Wikipedia talk:AutoWikiBrowser/Typos
- Home
Introduction and rules - User manual
How to use AWB - Discussion
Discuss AWB, report errors, and request features - User tasks
Request or help with AWB-able tasks - Technical
Technical documentation
This page has archives. Sections older than 40 days may be automatically archived by Lowercase sigmabot III. |
Cat-like
AWB changed cat-like to catlike in Dog. It ignored the term dog-like which I manually changed to doglike after checking merriam-webster.com. Both were reverted. AWB also skipped over wolf-like and fox-like. I am wondering if there should be consistency to how AWB treats cat-like, dog-like, wolf-like and fox-life (and whether either form is allowed). Also if cat-like is acceptable or correct should cat-like be ignored? Kaltenmeyer (talk) 00:29, 7 March 2022 (UTC)
- @Kaltenmeyer: I agree that the typo rules should treat "cat-like", "dog-like", etc. the same way. However, it may be challenging to expand any rule to include every possible animal. GoingBatty (talk) 03:25, 7 March 2022 (UTC)
Adding comma after MDY?
Can someone help point of the MOS where it says we should update was held on November 2, 2010 to elect all 11 members of the newly formed
to was held on November 2, 2010, to elect all 11 members of the newly formed
? (i.e. adding a comma after 2020). Jonatan Svensson Glad (talk) 00:37, 2 April 2022 (UTC)
- MOS:DATECOMMA MB 00:53, 2 April 2022 (UTC)
- Duh, of course it was that easily named (/me facepalms). I've just never seen that anywhere in writing before, so it looks really weird to me. But if it's in the manual, I won't argue. Jonatan Svensson Glad (talk) 00:56, 2 April 2022 (UTC)
- I call it the "wikicomma". Yes, it's weird, and hard to remember as I would never use it elsewhere. I think the idea is to treat 2020 like a relative clause qualifying November 2 (as in
November 2, All Souls' Day, was foggy.
) Certes (talk) 10:32, 2 April 2022 (UTC)- Really, Certes? Fascinating ~ i wouldn't consider my writing to be correct if i didn't use that second comma. That's what i love about this community ~ different people/generations/educations/preferences all working together; hooray for us! Happy days ~ LindsayHello 10:52, 2 April 2022 (UTC)
- Yeah, here in Sweden we use DMY or the ISO YYYY-MM-DD format, and when using MDY I rarely use comma after the year in my natural writing. Feels even weirder when typing things like
(born March 13, 2020, in New York)
since that is such a short phrase and not a full sentence. Jonatan Svensson Glad (talk) 11:53, 2 April 2022 (UTC)
- Yeah, here in Sweden we use DMY or the ISO YYYY-MM-DD format, and when using MDY I rarely use comma after the year in my natural writing. Feels even weirder when typing things like
- Really, Certes? Fascinating ~ i wouldn't consider my writing to be correct if i didn't use that second comma. That's what i love about this community ~ different people/generations/educations/preferences all working together; hooray for us! Happy days ~ LindsayHello 10:52, 2 April 2022 (UTC)
- I call it the "wikicomma". Yes, it's weird, and hard to remember as I would never use it elsewhere. I think the idea is to treat 2020 like a relative clause qualifying November 2 (as in
- Duh, of course it was that easily named (/me facepalms). I've just never seen that anywhere in writing before, so it looks really weird to me. But if it's in the manual, I won't argue. Jonatan Svensson Glad (talk) 00:56, 2 April 2022 (UTC)
savinging$3
'test saving test'.replace(/(?=([aeiou][bdfgklmnprstvz])\2{2,})(?<=\b(?:[A-Z][a-z]*|[a-z]+))\1\2{3,}(e(?:d|rs?)|i(?:ngs?|ons?|ves?)|ors?)\b/,'$1$2$2$3');
returns "test savinging$3 test". Why is this happening? Wikipedia:AutoWikiBrowser/Typos (diff ~256522285) @ThaddeusB: any insight? — Alexis Jazz (talk or ping me) 16:42, 17 April 2022 (UTC)
- The pattern matches "aving", setting $1 to "av" and $2 to "ing". There are only two captures – ([a... and (e(... – so $3 is unset and just returns "$3". "aving" is replaced by "av" + "ing" + "ing" + "$3". Certes (talk) 17:55, 17 April 2022 (UTC)
- ...er... what's that \2 doing to the left of capture 2? That can't be right. Certes (talk) 18:13, 17 April 2022 (UTC)
- Certes (or anyone), so it may be broken (but it doesn't break AWB? we'd have heard sooner?) any idea what this replacement is even supposed to do? — Alexis Jazz (talk or ping me) 20:42, 17 April 2022 (UTC)
- Amending the pattern to
/(?=([aeiou])([bdfgklmnprstvz])\2{2,})(?<=\b(?:[A-Z][a-z]*|[a-z]+))\1\2{3,}(e(?:d|rs?)|i(?:ngs?|ons?|ves?)|ors?)\b/
(adding two brackets in red) would make it reduce the number of consecutive identical consonants to two in typos like "gettting" and "scisssors". But I'm not sure why we're picking on this particular pattern. Almost all triple letters are wrong, and the false positives have an upper case initial (e.g. Rossshire) with very few exceptions such as Riot grrrl. Certes (talk) 20:55, 17 April 2022 (UTC) - Any replacement also needs to avoid changing www.example.com and similar (which this regexp does by excluding w from the consonant list). Certes (talk) 11:05, 18 April 2022 (UTC)
- Certes, interesting. One more question: I assume AWB isn't affected as an expression that mangles every instance of common words like "saving" or "living" would have been caught years ago. Any idea why? — Alexis Jazz (talk or ping me) 14:21, 18 April 2022 (UTC)
- @Alexis Jazz, the \2 before the second capture group is defined might lead it to be ignored? Qwerfjkltalk 14:36, 18 April 2022 (UTC)
- @Alexis Jazz: The diff in your first post here dates from 2008. The rule was edited by Special:Diff/976913898 in 2020, and has probably not achieved anything since then. -- John of Reading (talk) 15:02, 18 April 2022 (UTC)
- John of Reading, I missed that, you found the parentheses Certes was talking about! What do you mean when you say "has probably not achieved anything since then"? This issue was discovered by Qwerfjkl 7 hours and 1 minute after I added RegExTypoFix support to Bawl. Considering the number of users AWB has it seems unlikely this would go unnoticed for some one and a half year if AWB was actually affected. So I'd assume somehow AWB and Bawl implement this list differently, and one of them might be suboptimal, but I don't know which. @Smasongarrison: why did you remove them? — Alexis Jazz (talk or ping me) 15:55, 18 April 2022 (UTC)
- @Alexis Jazz: Yes, AWB and Bawl must be using different regex engines behind the scenes. The rule currently tries to make use of a numbered capture group before it's been defined, so it's an edge case that might turn out differently in different implementations. I'm going to put those parentheses back, as with those in place I can see what the rule is trying to do. -- John of Reading (talk) 16:16, 18 April 2022 (UTC)
- John of Reading, thank you! Bawl just uses the browser, so the .replace JS above is what would be running. Perhaps different browsers could yield different results. I'd think JWB should be affected as well, but who knows. Fixed is fixed. — Alexis Jazz (talk or ping me) 16:35, 18 April 2022 (UTC)
- @Alexis Jazz and John of Reading: I scanned an April 1 database dump for the updated regex pattern, and used AWB to fix 60 typos so far, with hundreds more to go. Some of the typos had an additional problem besides the triple consonant, but it was still good that AWB identified the issue. GoingBatty (talk) 04:35, 19 April 2022 (UTC)
- @Alexis Jazz and John of Reading: Done - 189 typos fixed. One false positive was the musician Spellling; I added wikilinks to the article to avoid incorrect fixes. GoingBatty (talk) 15:14, 19 April 2022 (UTC)
- @Alexis Jazz and John of Reading: I scanned an April 1 database dump for the updated regex pattern, and used AWB to fix 60 typos so far, with hundreds more to go. Some of the typos had an additional problem besides the triple consonant, but it was still good that AWB identified the issue. GoingBatty (talk) 04:35, 19 April 2022 (UTC)
- John of Reading, thank you! Bawl just uses the browser, so the .replace JS above is what would be running. Perhaps different browsers could yield different results. I'd think JWB should be affected as well, but who knows. Fixed is fixed. — Alexis Jazz (talk or ping me) 16:35, 18 April 2022 (UTC)
- @Alexis Jazz: Yes, AWB and Bawl must be using different regex engines behind the scenes. The rule currently tries to make use of a numbered capture group before it's been defined, so it's an edge case that might turn out differently in different implementations. I'm going to put those parentheses back, as with those in place I can see what the rule is trying to do. -- John of Reading (talk) 16:16, 18 April 2022 (UTC)
- John of Reading, I missed that, you found the parentheses Certes was talking about! What do you mean when you say "has probably not achieved anything since then"? This issue was discovered by Qwerfjkl 7 hours and 1 minute after I added RegExTypoFix support to Bawl. Considering the number of users AWB has it seems unlikely this would go unnoticed for some one and a half year if AWB was actually affected. So I'd assume somehow AWB and Bawl implement this list differently, and one of them might be suboptimal, but I don't know which. @Smasongarrison: why did you remove them? — Alexis Jazz (talk or ping me) 15:55, 18 April 2022 (UTC)
- Certes, interesting. One more question: I assume AWB isn't affected as an expression that mangles every instance of common words like "saving" or "living" would have been caught years ago. Any idea why? — Alexis Jazz (talk or ping me) 14:21, 18 April 2022 (UTC)
- Amending the pattern to
- Certes (or anyone), so it may be broken (but it doesn't break AWB? we'd have heard sooner?) any idea what this replacement is even supposed to do? — Alexis Jazz (talk or ping me) 20:42, 17 April 2022 (UTC)
- There don't seem to be any similar problems in other regexps on this page: all \2s are to the right of two captures, all $3s have three captures, etc. We caught one stray $3 after the big 2020 optimisation changes but I don't think we checked for \2. Certes (talk) 20:44, 18 April 2022 (UTC)
Proposed additions
I'm considering some new additions listed here and would value any comments before I mess up your list. I've fixed 30+ cases of each in the previous month with few or no false positives. A few suggestions resemble existing fixes but address different typos, e.g. the current entry for Mauritius uppercases the initial M whereas this fix is for misspellings such as Mauritus. Certes (talk) 13:51, 18 April 2022 (UTC)
- @Certes: Most of these look good to me! The article Argentina lists "Argentinian" as an appropriate demonym, so I suggest that your "Argentine" be changed to "Argentinian". If a rule already exists, I hope you consider merging your changes instead of creating a new rule. GoingBatty (talk) 15:44, 18 April 2022 (UTC)
- I changed many Argentinan typos to Argentinian, as it's nearer to the text, but used Argentine for people (where it seems to be preferred) Argentinian throughout isn't wrong and would be an improvement. What's the best way to merge (for example) Mauritius? The tricky bit is avoiding null changes. A negative lookahead before the expression can be expensive, a negative lookbehind after it may not work in all regexp parsers, and separating it as "(M...|m...)" is only paying lip service to the concept of merging. Certes (talk) 20:17, 18 April 2022 (UTC)
- Added. Thanks for the advice. I've labelled the rules with duplicate names "Foo (2)", but if someone can combine them efficiently that might be an improvement. Certes (talk) 14:34, 21 April 2022 (UTC)
Hyphenated phrase
The hyphen is not removed from "less-populated". MB 04:11, 19 April 2022 (UTC)
- @MB I just added a rule for you to fix both "less-populated" and "more-populated". GoingBatty (talk) 04:34, 19 April 2022 (UTC)
Testing with JWB
Perhaps everyone else knew this already, and there may well be an easier way to do it, but I've finally found a way to test new additions without riskily adding them to the public list or going through the tedious and error-prone process of copying and pasting every regexp into the UI. To add a custom set of typos in a format matching AWB/T to the list, start JWB, invoke the browser's JavaScript console and paste
RETF.list = []; // Empty the list - only needed for iterative testing
(new mw.Api()).get({
action: 'query',
prop: 'revisions',
titles: 'User:Example/typos', // Substitute the title of your typo list page here
rvprop: 'content',
rvlimit: '1',
indexpageids: true,
format: 'json',
}).done(RETF.buildList);
Omit the first line to retain the standard list, but it's useful to get rid of a broken custom list before retesting after a fix. The titles: line can be any Wikipedia page, e.g. User:You/sandbox. Certes (talk) 21:03, 20 April 2022 (UTC)
Duplicate word=
We have a few duplicated value for word= in the typo list. Do these need to be made unique? List: "-ality", "First (3)", "Its (after)", "Its (before)", "Nonoperational", "Predecessor", "Regardless", "Sanskrit", "Thaw", "e.g.", "east–west", "km²", "north–south", "south–north", "sworn in", "west–east". (I was checking in case I duplicated any, but someone seems to have beaten me to it.) Certes (talk) 22:47, 20 April 2022 (UTC)
- Also, we have a typo entry marked disable=. Should that be disabled=, or are the two equivalent (perhaps anything other than word= works)? Certes (talk) 11:41, 21 April 2022 (UTC)
- @Certes: If I remember correctly, the AWB implementation just checks that "word=" is present, but doesn't do anything else with it. So, yes, changing "word" to anything else will disable a rule. Duplicate names have no effect, but it's easier to refer to a rule in edit summaries and discussions if they are unique. It's time I downloaded the source code again. -- John of Reading (talk) 14:55, 21 April 2022 (UTC)
"libration war"
Hi, we currently have 107 examples of "libration war", please can they be changed to "liberation war"? Ta ϢereSpielChequers 21:37, 27 April 2022 (UTC)
- In progress, done. Neils51 (talk) 03:37, 28 April 2022 (UTC)
MilliWatt = MediaWiki
<Typo word="W (watt)" find="([\d\.]+(?:[−―–—\s]| )?[µmkMGT])w\b" replace="$1W"/>
changes ".mw-first-heading" (a CSS class of #firstHeading) to ".mW-first-heading". For a non-code example, the ccTLD for Malawi (http://www.registrar.mw/) also matches. Found only one live bad replacement by User:Schwede66 from 2011: 2004 New Zealand local elections (diff 457286863). — Alexis Jazz (talk or ping me) 04:06, 4 May 2022 (UTC)
- @Alexis Jazz: Could we fix this by ensuring a digit appears before the period, such as this:
find="(\d[\d\.]*(?:[−―–—\s]| )?[µmkMGT])w\b"
GoingBatty (talk) 12:37, 4 May 2022 (UTC) - ...or indeed after the period with just
find="(\d(?:…
, as ".123 mW" seems more likely than "123. mW". That also avoids domains such as "source123.mw". Certes (talk) 13:30, 4 May 2022 (UTC)