Wikipedia talk:AutoWikiBrowser/Typos
- Home
Introduction and rules - User manual
How to use AWB - Discussion
Discuss AWB, report errors, and request features - User tasks
Request or help with AWB-able tasks - Technical
Technical documentation
This page has archives. Sections older than 20 days may be automatically archived by Lowercase sigmabot III. |
Pubication
I'm trying to patrol pubic and certain other easily confused words using poop patrol, and I can see a few phrases that would suit this software better.
- pubication - publication
- pubic school - public school
- Done here. Shadowjams (talk) 21:04, 1 August 2010 (UTC)
- discuss throw - discus throw
ϢereSpielChequers 08:09, 24 July 2010 (UTC)
- We should discuss throwing out the remaining suggestion. –xenotalk 14:16, 18 August 2010 (UTC)
- OK false positives are theoretically possible though it doesn't exist yet on Wikipedia and there once were many dozens of participants in the Olympic sport of synchronised ventriloquism. I will leave it in Botlaf. However can disolve be added as a typo for dissolve? I went through it manually a year or so back but there are about fifty again. ϢereSpielChequers 18:06, 31 August 2010 (UTC)
- The "Diss-" rule already handles it.--BillFlis (talk) 18:26, 31 August 2010 (UTC)
- I'm not convinced it does, I've just fixed one from June and I'd have thought AWB would have fixed it by now if it was in AWB. Can we have a specific rule for Disolv - Dissolv please. ϢereSpielChequers 13:01, 18 September 2010 (UTC)
- Confirmed that the fix for "disolve" → "dissolve" works in this edit GoingBatty (talk) 19:37, 18 September 2010 (UTC)
- I'm not convinced it does, I've just fixed one from June and I'd have thought AWB would have fixed it by now if it was in AWB. Can we have a specific rule for Disolv - Dissolv please. ϢereSpielChequers 13:01, 18 September 2010 (UTC)
- The "Diss-" rule already handles it.--BillFlis (talk) 18:26, 31 August 2010 (UTC)
- OK false positives are theoretically possible though it doesn't exist yet on Wikipedia and there once were many dozens of participants in the Olympic sport of synchronised ventriloquism. I will leave it in Botlaf. However can disolve be added as a typo for dissolve? I went through it manually a year or so back but there are about fifty again. ϢereSpielChequers 18:06, 31 August 2010 (UTC)
Do we want to hide italics from typo fixing?
For a feature request I added the capability for AWB to hide text in italics as part of its HideMore()
function ('Ignore templates, refs, link _targets...'). Do we want hiding of italics on or off for typos? We already hide untemplated quotes (text between " and related curly quotes). Rjwilmsi 09:01, 30 August 2010 (UTC)
- Sometimes we use italics to emphasise a word or a sentence. Italics are used for many reasons. Typo fixing should apply inside italics exactly the same way it applies outside them. -- Magioladitis (talk) 09:03, 30 August 2010 (UTC)
- Was the original concern over foreign and proper terms (like book/movie titles) or is there something else I'm not thinking of? Shadowjams (talk) 18:23, 30 August 2010 (UTC)
- Italics hiding was added for a feature request. We now have the option to apply it for typo fixing or not. Rjwilmsi 08:11, 31 August 2010 (UTC)
- I see. I tend to agree with Magioladitis on this point, there're a lot of these that fit within typo territory, but perhaps it cuts down on false positives. Just something to be aware of, it's obviously not an ideological issue. Shadowjams (talk) 08:51, 31 August 2010 (UTC)
- Italics hiding was added for a feature request. We now have the option to apply it for typo fixing or not. Rjwilmsi 08:11, 31 August 2010 (UTC)
Catepillar → Caterpillar
Could someone please update the entry for Caterpillar to also fix the incorrect "Catepillar" (missing the first "r")? GoingBatty (talk) 03:55, 1 September 2010 (UTC)
Apostrophe fix contested
I changed series's to series' using AWB. It was subsequently reverted[1]. Does the rule need to be removed or edited? -- JHunterJ (talk) 12:13, 5 September 2010 (UTC)
- I think the rule, and your fix, is correct, since the phrase is going to be pronounced "the seeriz antagonist", not "the seeriziz antagonist". The advice at Apostrophe#Singular nouns ending with an “s” or “z” sound is not at all clear, though. -- John of Reading (talk) 13:24, 5 September 2010 (UTC)
- The guideline is laid out here: Wikipedia:APOSTROPHE#Possessives. If you pronounce "series'[s] antagonist" as "sireez antagonist", then Wikipedia says not to use the additional s. On the other hand, it says if there are two possible pronunciations, you can use either. I definitely pronounce the phrase "series's antagonist" as "sireeziz antagonist". — the Man in Question (in question) 17:07, 5 September 2010 (UTC)
- If that's the guideline then the rule should be removed. It was added by Mboverload (talk · contribs) on 4th August 2008 apparently without any discussion on this talk page. I've pinged that user's talk page. -- John of Reading (talk) 21:01, 5 September 2010 (UTC)
- I've removed the rule. Per the guidelines on apostrophes, both versions are potentially correct, as long as usage is consistent (with the 's, without the 's, or with the 's if pronounced as iz) on a given article. -- JHunterJ (talk) 11:29, 6 September 2010 (UTC)
- If that's the guideline then the rule should be removed. It was added by Mboverload (talk · contribs) on 4th August 2008 apparently without any discussion on this talk page. I've pinged that user's talk page. -- John of Reading (talk) 21:01, 5 September 2010 (UTC)
- The guideline is laid out here: Wikipedia:APOSTROPHE#Possessives. If you pronounce "series'[s] antagonist" as "sireez antagonist", then Wikipedia says not to use the additional s. On the other hand, it says if there are two possible pronunciations, you can use either. I definitely pronounce the phrase "series's antagonist" as "sireeziz antagonist". — the Man in Question (in question) 17:07, 5 September 2010 (UTC)
specail -> special
Manually fixed one here. Regards, SunCreator (talk) 18:33, 5 September 2010 (UTC)
Besancon
A think a false positive here, AWB changes Besancon -> Besançon, but there is a place in France called Besançon and one in New Haven, Indiana called Besancon. Regards, SunCreator (talk) 21:44, 5 September 2010 (UTC)
Womens > Women's
Should this rule apply to lowercase "womens" only? See Apostrophe#Possessives in names of organizations. -- John of Reading (talk) 17:13, 11 September 2010 (UTC)
Retropective → Retrospectiv
This edit changed Retropective → Retrospectiv instead of Retrospective. I've manually fixed this article, but could someone please update the rule? Thanks! GoingBatty (talk) 05:17, 12 September 2010 (UTC)
- Fixed.--BillFlis (talk) 06:20, 12 September 2010 (UTC)
- Thanks BillFlis - I didn't find the rule under the "R" section - should have looked under the new additions section too. GoingBatty (talk) 06:38, 12 September 2010 (UTC)
heavily, 2nd try
WB tried to replace "heaively" with "heaively", but it should've been "heavily". Please fix. --bender235 (talk) 20:22, 3 July 2010 (UTC) (—bender235 (talk) 00:50, 13 September 2010 (UTC))
- I can't find the rule that would make such a change, and I can't find any instances of "heaively" (or "heaivly", which seems more likely) in wikipedia. It looks like it's no longer a problem.--BillFlis (talk) 11:19, 13 September 2010 (UTC)
- Either bender's original post has a typo, or it's replacing "heaively" with itself, which I too can't find a rule that would do. Perhaps you meant it was replacing "heavily" with "heaiviley", which would make sense given this rule: <Typo word="-ively" find="\b(\w+)ivly\b" replace="$1ively" />. Before changing that, beware that "ively" is an equally, if not more, common version of that ending. Anyone have ideas about how to distinguish which ending is right based on the base? Shadowjams (talk) 17:40, 13 September 2010 (UTC)
Alternation vs. character classes
Hall with Schwartz calls using alternation (A|a) instead of character class [Aa] a "classic mistake" in Effective Perl Programming, and that it takes a speed penalty, perhaps on the order of 4x. Maybe the processing here has gotten smarter since then, and it does save characters when capturing, (A|a) instead of ([Aa]), but we may still want to change it back. -- JHunterJ (talk) 19:25, 13 September 2010 (UTC)
- I'll investigate what difference, if any, there is for AWB/C#. Rjwilmsi 20:31, 13 September 2010 (UTC)
- ISBN 0596528124 page 237 has a benchmark for .NET that lists character classes as being 4.7x faster. I don't know how old that is... but worth considering. There are probably other optimizations like this as well. Shadowjams (talk) 00:40, 14 September 2010 (UTC)
- VB.NET, we use C#: I profiled 1000 replace operations for "\b(R|r)ec(?:ie|ei?)pient(s?)\b" and "\b([Rr])ec(?:ie|ei?)pient(s?)\b" (details on request) and the numbers were 13463 and 12860 ms respectively i.e. around a 5% difference only. So I conclude there's not much difference for C#. We cannot take a 4x or 5x difference in another language and assume it applies for ours. Rjwilmsi 20:54, 14 September 2010 (UTC)
- ISBN 0596528124 page 237 has a benchmark for .NET that lists character classes as being 4.7x faster. I don't know how old that is... but worth considering. There are probably other optimizations like this as well. Shadowjams (talk) 00:40, 14 September 2010 (UTC)
km/kg corrections OK, but summary incorrect
This edit correctly changed "67 Kg" and "800 Km" to "67 kg" and "800 km". However, the edit summary reads (Typo fixing, typos fixed: 7 Kg → 7 kg (2) using AWB).
Anyone want to try updating the rule to make the edit summary better? Thanks! GoingBatty (talk) 04:49, 14 September 2010 (UTC)
- One could make the summary more accurate by putting a quantifier (+ in this case) on the \d in the rule, but that would increase the time (infinitesimally, albeit) the regex runs across every page scanned. It probably doesn't matter either way; if you want to put it in there that's how one would do it. Shadowjams (talk) 05:48, 14 September 2010 (UTC)
- Actually, on second look, that's not a Typo rule, that's a built-in program rule. I'm guessing that internal rule uses regex too though, so the same applies. Shadowjams (talk) 05:51, 14 September 2010 (UTC)
- Typo rule is for Kg to kg (case conversion). Rjwilmsi 07:22, 14 September 2010 (UTC)
- I see now. Shadowjams (talk) 16:45, 14 September 2010 (UTC)
- So should I move this from this talk page to a bug report? GoingBatty (talk) 16:34, 14 September 2010 (UTC)
- No, it is a typo issue. My second point was wrong (Rjwilmsi was correcting me). I was confused because I was looking for a rule that would add   to the output, and there isn't a rule that did that (that part is internal). However, there is a rule that did the capitalization, and updating that, would fix the OP's issue. It's this one: <Typo word="kg/km (kilogram/kilometer)" find="(\d(?:\s| |-)?)K(g|m)\b" replace="$1k$2" />.
- Typo rule is for Kg to kg (case conversion). Rjwilmsi 07:22, 14 September 2010 (UTC)
- Actually, on second look, that's not a Typo rule, that's a built-in program rule. I'm guessing that internal rule uses regex too though, so the same applies. Shadowjams (talk) 05:51, 14 September 2010 (UTC)
- Change it to <Typo word="kg/km (kilogram/kilometer)" find="(\d+(?:\s| |-)?)K(g|m)\b" replace="$1k$2" /> and you've fixed the issue (see above for speed considerations). Shadowjams (talk) 16:45, 14 September 2010 (UTC)
- All of the rules have been updated with the +. Now I see in this edit that AWB accurately changed "16KHZ" → "16 kHz", but the edit summary says: (Typo fixing, typos fixed: 16KHZ → 16kHz using AWB) (without the space) GoingBatty (talk) 03:27, 17 September 2010 (UTC)
- Also this edit changed "710 KHz" and "970 KHz" to "710 kHz" and "970 kHz", but the edit summary is (Typo fixing, typos fixed: 710 KHz → 710 kHz (2) using AWB) GoingBatty (talk) 03:53, 17 September 2010 (UTC)
- Change it to <Typo word="kg/km (kilogram/kilometer)" find="(\d+(?:\s| |-)?)K(g|m)\b" replace="$1k$2" /> and you've fixed the issue (see above for speed considerations). Shadowjams (talk) 16:45, 14 September 2010 (UTC)
Opiod --> Opioid
Very common misspelling, hard to spot. Please add, thanks. -- Ϫ 07:16, 14 September 2010 (UTC)
- Wow that is common. Added a rule here. I looked around in a few dictionaries thinking it might be an alternative spelling just based on how common it is, but I couldn't find anything. Done Shadowjams (talk) 15:28, 14 September 2010 (UTC)
Italicise Latin words and phrases
Please italicise Latin words and phrases, the most common being et cetera (or etcetera, et caetera or et cætera), de facto, de jure, id est, ad libitum, circa, floruit and exempli gratia. McLerristarr / Mclay1 07:49, 14 September 2010 (UTC)
Sargent's cypress
I had typo fixing switched on. It made this error. It is a false positive for Sargent's cypress or Sargent cypress Regards Lightmouse (talk) 09:56, 14 September 2010 (UTC)
- Not done Only an error as the article incorrectly had the word in lower case. Rjwilmsi 21:06, 14 September 2010 (UTC)
Thanks for investigating it. Lightmouse (talk) 21:47, 14 September 2010 (UTC)
Supress --> Suppress
Another very common misspelling (over 2000 search results!) Including supressed/supressing/supression and whatever other prefixes there are. I'm surprised this one wasn't in there already..
Actually I did find "(Immuno)Suppress" in the list, but that doesn't seem correct.. it's already got the double-p, so maybe that's just a mistake? or what, but I don't know if maybe the (Immuno) part is affecting the detection somehow too.
Opress --> Oppress is another one we could add, that one is a bit less common but still coming up in search results. Except that the search results come up with the false positive "of-press" for some reason, which is slightly annoying, but I don't think that would affect AWB's typo detection anyway. -- Ϫ 22:50, 15 September 2010 (UTC)
- The existing "(Immuno)Suppress" rule already covers all of the suppress variations you've listed. Rule expanded for oppress too. Rjwilmsi 09:32, 16 September 2010 (UTC)
- Oh! okay. These regexes still confuse me. :) But, is it normal for there to still be so many existing misspellings? I thought that once a typo gets added to the list they usually all get fixed pretty quickly.. Is it just that noone has patrolled these articles yet with AWB? -- Ϫ 17:05, 16 September 2010 (UTC)
- The WP:TYPOSCAN project should go through these regularly but it's waiting for new data at the moment. Rjwilmsi 17:10, 16 September 2010 (UTC)
- Oh! okay. These regexes still confuse me. :) But, is it normal for there to still be so many existing misspellings? I thought that once a typo gets added to the list they usually all get fixed pretty quickly.. Is it just that noone has patrolled these articles yet with AWB? -- Ϫ 17:05, 16 September 2010 (UTC)
achitecture → architecture
Could someone please update the existing entry for "architecture" so it also catches "achitecture"? Thanks! GoingBatty (talk) 01:53, 17 September 2010 (UTC)
- I modified the rule for "Architect" to catch this.--BillFlis (talk) 08:55, 18 September 2010 (UTC)
etc... → etc.
Could the Etc. rule be changed so that it would also remove extra periods? (e.g. change "etc..." → to "etc.") Thanks! GoingBatty (talk) 02:44, 17 September 2010 (UTC)
- I think this should do it. Shadowjams (talk) 03:20, 17 September 2010 (UTC) Done
- I think you're on the right track. According to the AWB Regex Tester, that will fix "ect...." (which is great), but not "etc....." GoingBatty (talk)
- Ah. That makes sense. Ok, one more try.... Shadowjams (talk) 04:44, 17 September 2010 (UTC)
- See if that did it. Shadowjams (talk) 04:46, 17 September 2010 (UTC)
- Sorry - tried the AWB Regex Tester, and it still doesn't fix "etc...." or "etc" (with no periods) GoingBatty (talk) 16:23, 17 September 2010 (UTC)
- I took another look. What it's doing is it's looking for anything with an "Etc" followed by something that's not either a period or a word character (0-9,a-z). In the case of "etc....." it's skipping it because there's already a period, and not looking at the rest. This is intentional for two reasons. One, it terminates the search early on correct matches (which are the majority) and saves processing time, and second, it allows for unanticipated but correct uses, like an ellipsis. It not fixing "etc" is related... because there's nothing following the c, it doesn't catch. However, in a real article etc won't be alone. It will be followed by something: "etc more words". This sometimes comes up in testing. We try to design rules so they don't catch on correct spellings (even if they correct them back to themselves) because I assume they take more processing (they run entirely, as opposed to stopping midway through). Maybe that's unnecessary, but most of the rules adhere to that format. Shadowjams (talk) 22:10, 17 September 2010 (UTC)
- I appreciate your reply. I made this request because I thought that "etc." plus an ellipsis was not a correct use. Why would an ellipsis be necessary? Thanks! GoingBatty (talk) 15:26, 19 September 2010 (UTC)
- That's a good point. I tended towards the cautious with some of these when I started, and I added the etc. rule that's currently in use (although there was a simpler one earlier) earlier on. I think the change you're talking about would be fine. Shadowjams (talk) 05:12, 20 September 2010 (UTC)
- Thanks Shadowjams. I was playing around with how to edit the rule to fix "etc....", but couldn't get it to skip "etc." Could you please help me with this? Thanks! GoingBatty (talk) 17:07, 20 September 2010 (UTC)
- That's a good point. I tended towards the cautious with some of these when I started, and I added the etc. rule that's currently in use (although there was a simpler one earlier) earlier on. I think the change you're talking about would be fine. Shadowjams (talk) 05:12, 20 September 2010 (UTC)
- I appreciate your reply. I made this request because I thought that "etc." plus an ellipsis was not a correct use. Why would an ellipsis be necessary? Thanks! GoingBatty (talk) 15:26, 19 September 2010 (UTC)
- I took another look. What it's doing is it's looking for anything with an "Etc" followed by something that's not either a period or a word character (0-9,a-z). In the case of "etc....." it's skipping it because there's already a period, and not looking at the rest. This is intentional for two reasons. One, it terminates the search early on correct matches (which are the majority) and saves processing time, and second, it allows for unanticipated but correct uses, like an ellipsis. It not fixing "etc" is related... because there's nothing following the c, it doesn't catch. However, in a real article etc won't be alone. It will be followed by something: "etc more words". This sometimes comes up in testing. We try to design rules so they don't catch on correct spellings (even if they correct them back to themselves) because I assume they take more processing (they run entirely, as opposed to stopping midway through). Maybe that's unnecessary, but most of the rules adhere to that format. Shadowjams (talk) 22:10, 17 September 2010 (UTC)
- Sorry - tried the AWB Regex Tester, and it still doesn't fix "etc...." or "etc" (with no periods) GoingBatty (talk) 16:23, 17 September 2010 (UTC)
- I think you're on the right track. According to the AWB Regex Tester, that will fix "ect...." (which is great), but not "etc....." GoingBatty (talk)
Inconsistent use of formats such as '(C|c)' and '[Cc]'. Propose change all to '[Cc]'
The list is inconsistent in whether the regex uses '(C|c)' or '[Cc]'. I propose running a changing them all to the format '[Cc]'. It's trivial but using the same format makes it slightly easier to notice the real differences. Any objections? Lightmouse (talk) 15:15, 17 September 2010 (UTC)
- They are not equivalent. "(C|c)" is equivalent to "([Cc])". Also, I know there was some discussion about speed, but a more important consideration might be space. This page is already huge, and changing every instance of this would add another character to each of the affected rules, which is the large majority of them.--BillFlis (talk) 18:54, 17 September 2010 (UTC)
You're quite right, the pairings are '(C|c)' with '([Cc])', or '(?:C|c)' with '[Cc]'. I agree that compact code is a good thing. I'll leave it to you. Incidentally, I'm sure there are more units of measure that would be useful, also I only see one square unit of length and there could be cubes too. Lightmouse (talk) 20:23, 17 September 2010 (UTC)
- Bill sums up the issue exactly. I can see positives to both. In some ways I think ([Cc]) is conceptually clearer, but that's a personal preference. I made the changes to all of the New additions thinking the speed tradeoff was more important than later testing demonstrated. There is 1 character difference between the two; I don't see any reason to prefer one over the other. I think it's best to leave them as they're originally created, with whatever idiom the creator chooses. Shadowjams (talk) 21:58, 17 September 2010 (UTC)
Units of measure
There is km². Would it also be possible to do km³, m², m³, ft², ft³ ? Lightmouse (talk) 08:50, 18 September 2010 (UTC)
Should regex be using an escape character.
I notice that square kilometre contains:
[-.\s]
Should it be:
[-\.\s]
Regards Lightmouse (talk) 16:46, 19 September 2010 (UTC)
- I don't think you need to escape charters inside character classes (says as much). Shadowjams (talk) 21:01, 19 September 2010 (UTC)
- There's another problem with that though. The - needs to be at the end of the class, otherwise it's looking for a range. I'm not sure what it does in that case, but it might explain any strange effects you're seeing. Shadowjams (talk) 21:02, 19 September 2010 (UTC)
- No, a hyphen immediately after a "[" counts as a literal hyphen. [2] -- John of Reading (talk) 06:13, 20 September 2010 (UTC)
- Interesting. That's actually a little new... it doesn't work with grep for instance. Perl calls this version 8 regex (I think). Apparently - at either the beginning or end is fine, but in the middle, of course, it's ambiguous. Shadowjams (talk) 06:17, 20 September 2010 (UTC)
- No, a hyphen immediately after a "[" counts as a literal hyphen. [2] -- John of Reading (talk) 06:13, 20 September 2010 (UTC)
Aha - "the dot is not a metacharacter inside a character class, so we do not need to escape it with a backslash.". Very interesting, thanks. Lightmouse (talk) 17:15, 20 September 2010 (UTC)
Not fixing "hungarian" ?
Although there's an existing rule for "Hungary" that includes "Hungarian", it doesn't want to fix "hungarian" and "hungarians" in Culture of Hungary. When I tried the rule in the AWB Regex tester, it seems to work fine. Any ideas? GoingBatty (talk) 04:22, 20 September 2010 (UTC)
criticized
AWB replaced "critiziced" with "criticiziced" here, but it should have been "criticized". Please fix. —bender235 (talk) 14:07, 23 September 2010 (UTC)
- I limited the rule for "Critical", which was evidently making this change, to not make this particular change. We'll need a new rule to correct "critiziced" to "criticized", which I was surprised to find has more than a dozen occurrences on wikipedia.--BillFlis (talk) 16:22, 23 September 2010 (UTC)