Wikipedia talk:AutoWikiBrowser/Typos

This is an old revision of this page, as edited by BillFlis (talk | contribs) at 20:32, 20 July 2011 (Regex for SI units). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.


Latest comment: 13 years ago by BillFlis in topic Regex for SI units

Fix for youtube tag

Hi! First of all, thanks for this amazing list! It has been incredibly useful! Though there's a minor error in it. At Wikia, we use <youtube> tags, without capitals. However, AutoWikiBrowser wants to "fix" these tags by changing them into <YouTube>. Could this be fixed? Thanks! 213.93.184.183 (talk) 19:54, 5 June 2011 (UTC)Reply

Oh and, the youtube tag can also hold paramters such as <youtube width=>, so that should be taken in consideration too. 213.93.184.183 (talk) 19:55, 5 June 2011 (UTC)Reply
I've disabled the rule so that it doesn't damage your pages. With luck, one of the regexp experts will be able to fix and re-enable it. -- John of Reading (talk) 20:35, 5 June 2011 (UTC)Reply
  Fixed, I think, with this edit. Would you try it again on an appropriate Wikia page? -- JHunterJ (talk) 22:22, 5 June 2011 (UTC)Reply


(EC) The tag's not case sensitive is it? Isn't this just a cosmetic issue?
Nevertheless, this should fix it: (?<!<)\b(?:Yout|you[Tt])ube\s. It should fix all cases of your tags too because it will just avoid all youtube phrases that begin with <. Shadowjams (talk) 22:31, 5 June 2011 (UTC)Reply
I guess J already did it. That version's a little more expensive, but it's a little clearer to read too. Shadowjams (talk) 22:31, 5 June 2011 (UTC)Reply
Leading with a negative-lookbehind strikes me as more expensive than ending with one. When leading, at every position (not just every occurrence of youtube, but every position) in the text, the parser has to check whether the previous character is not a <, and if not (which is usually), then look for a mal-cased YouTube. With trailing, it only looks back if it has found a mal-cased YouTube, so should be cheaper. The compiler may disagree, but I'd have to see stats. And then I cleaned up the .com check, to look for actual .coms, so it will correctly fix " .... video on Youtube." -- JHunterJ (talk) 02:15, 6 June 2011 (UTC)Reply
Thanks for the fix :)! 213.93.184.183 (talk) 14:55, 6 June 2011 (UTC)Reply
Me again, sorry for not reporting this earlier but I completely forgot: <youtube> now works, but </youtube> (note the slash) is still converted to YouTube. Thanks in advance :). 213.93.184.183 (talk) 22:24, 18 June 2011 (UTC)Reply
Should be   Fixed for that with this edit. -- JHunterJ (talk) 11:47, 19 June 2011 (UTC)Reply

What broke?

Re this edit summary. I reloaded the typo list right after saving that and made several fixes. What was breaking? -- JHunterJ (talk) 17:52, 12 June 2011 (UTC)Reply

The error was reading "too many 's" specifically with the Blu-Ray regexp. AWB was refusing to activate Typo correction for me because of it. Stuart.Jamieson (talk) 18:02, 12 June 2011 (UTC)Reply
Odd (esp. since there are no apostrophes in the change). I'm running AWB 5.3.0.0 SVN 7728 on IE 9.0.8112.16421, .NET 2.0.50727.5444, Windows 6.1. Can you tell me which versions if any are different in yours? I wonder if any of my hyphen changes need to be escaped under some environments. Thanks. -- JHunterJ (talk) 18:10, 12 June 2011 (UTC)Reply
.NET is 2.0.50727.4211 and Windows is 6.0 but there was more to the message, I only realised by recreating it - you had missed a closing bracket. Stuart.Jamieson (talk) 18:37, 12 June 2011 (UTC)Reply
Thanks. (I also must have mis-timed my reload of the typo list somehow.) Cheers! -- JHunterJ (talk) 18:40, 12 June 2011 (UTC)Reply

"Communtiy"

I encountered 4 dozen articles with this typo, and suggest adding:

<Typo word="Community" find="\b(C|c)ommuntiy\b" replace="$1ommunity" />
but we already have
<Typo word="Community_" find="\b(C|c)om(?:un|m?unn|m?unn?t)(al(ly)?|ity|ities|ions?|is[mt]s?)\b" replace="$1ommun$2" />
and I don't see a way to combine them. Any takers? Chris the speller yack 18:05, 26 June 2011 (UTC)Reply
I left Jabari Simama as it is for testing purposes. Chris the speller yack 18:16, 26 June 2011 (UTC)Reply
No, not without splitting Communal and Communion and Communism/t from the latter. -- JHunterJ (talk) 20:13, 26 June 2011 (UTC)Reply

  Done by JHunterJ, and a nice job, too. Chris the speller yack 22:26, 26 June 2011 (UTC)Reply

Inheritence -> nhernheritance?

AWB is correcting "inheritence" into "inhernheritance" for some reason--it happened to me on several articles in a row. Sample: [1] Thanks! -- Khazar (talk) 15:37, 13 July 2011 (UTC)Reply

Fixed -- John of Reading (talk) 16:02, 13 July 2011 (UTC)Reply
Thanks! -- Khazar (talk) 16:19, 13 July 2011 (UTC)Reply

A new typo fix suggestion

I would like to make a typo-correction suggestion that relates to capitalization, specifically of the CamelCase variety. I'm not sure how often these words occur outside of Nickelodeon- and cartoon-related articles, but it's not uncommon to see the character name "SpongeBob SquarePants" from the show by the same name incorrectly written without the capital "B" or "P" in his name. My suggestion would be to correct "Spongebob" to "SpongeBob" and "Squarepants" to "SquarePants" through AWB. --Sgt. R.K. Blue (talk) 23:02, 17 July 2011 (UTC)Reply

I added a rule to fix his name when both the capital "B" and capital "P" are lowercase. Let's see how this works, and maybe others can expand the rule to fix other cases. Enjoy! GoingBatty (talk) 00:20, 18 July 2011 (UTC)Reply
Changed my mind and split this into two rules: one for "SpongeBob" and one for "SquarePants". GoingBatty (talk) 00:41, 18 July 2011 (UTC)Reply
Interesting how the rule doesn't always fix "Spongebob" (e.g. SpongeBob SquarePants (season 3)) GoingBatty (talk) 00:53, 18 July 2011 (UTC)Reply
I hacked away at it for a while. If you take out the Italian inter-wiki line, then the Typo fixes work. Then you can put the inter-wiki line back. Go figure. Sometimes AWB is battier than you are. Chris the speller yack 04:41, 18 July 2011 (UTC)Reply
Thanks for figuring that out, Chris! Guess when the instructions state "Typo fixing is automatically prevented on image names, templates, wikilink _targets and quotes", that means interwiki links too. GoingBatty (talk) 04:46, 18 July 2011 (UTC)Reply
Well, it's a start, anyway. Thanks for giving it a go. --Sgt. R.K. Blue (talk) 03:45, 18 July 2011 (UTC)Reply
After fixing over 100 articles, I just updated the rules to also catch "Sponge Bob" and "Square Pants". Enjoy! GoingBatty (talk) 04:52, 18 July 2011 (UTC)Reply
Thanks also for all the SpongeBob-related fixes; I ran AWB for a short time a little while ago and didn't pick up any more, though I'm sure there are still many out there yet to be discovered. Meanwhile, another CamelCase correction also struck me that might be worth adding: I've sometimes seen the company DreamWorks incorrectly written without the capital "W" (Dreamworks). --Sgt. R.K. Blue (talk) 08:47, 18 July 2011 (UTC)Reply
There's still a lot of work to do for the "SpongeBob" fixes for those pages where the Typo rule won't fix it. The SpongeBob SquarePants (season 3) example above indicates that the typo rule won't fix pages that contain an Italian interwiki link with "Spongebob" in the title. Also, the typo rule won't fix any pages that link to Spongebob Squarepants or other spelling variants created as redirects.
I created a typo rule for "DreamWorks", as you requested. GoingBatty (talk) 16:50, 18 July 2011 (UTC)Reply
  Done - I believe all the instances of improper capitalization for "SpongeBob" and "SquarePants" have now been fixed (via typo fixing or find/replace). I'll leave it to you to work with the other wikis to get their SpongeBob-related articles fixed.  :-) GoingBatty (talk) 04:38, 20 July 2011 (UTC)Reply

Regex for SI units

I see that the regex for SI units has:

  • ([\d\.]+(?:\s| |-)?)foobar

The following appears to be identical in effect:

  • ([\d\.](?:\s| |-)?)foobar

Furthermore, I think 7. foobar (note trailing decimal) are not worthy _targets. Thus the following code might be tighter:

  • (\d(?:\s| |-)?)foobar

Does that seem reasonable? Lightmouse (talk) 12:09, 18 July 2011 (UTC)Reply

I think that [\d\.]+ is deliberate in these SI rules, so that the edit summary is more informative. I don't feel strongly about the trailing decimal; perhaps wait to see how many false positives get reported. -- John of Reading (talk) 15:02, 18 July 2011 (UTC)Reply

The edit summary doesn't show the regex so that can't be the reason. The [\d\.]+ regex was added by User:BillFlis. Perhaps we should ask him. Lightmouse (talk) 17:25, 18 July 2011 (UTC)Reply

The edit summary does't show the regex but can show the string that the regex matched. So if the article said "123.4 foo" and now says "123.4 Foo", then the edit summary will say that instead of just "4 foo -> 4 Foo". -- John of Reading (talk) 18:43, 18 July 2011 (UTC)Reply

Ah. In that case, it should be:

  • ([\d,\.]*\d(?:\s| |-)?)foobar

Consider the very common format "12,000 foobar". Lightmouse (talk) 18:52, 18 July 2011 (UTC)Reply

My 2 cents: I think you all are on the right track as to improving the edit summary by showing as much as possible of the phrase that's being correct. But, while the naked-decimal-point form "7. foobar" is generally deprecated (I haven't checked the style guide used here), I don't think that means we shouldn't fix the foobar part if that's needed. So we can get away with Lightmouse's last suggestion without requiring the decimal-point-covering digit:
    • ([\d,\.]+(?:\s|&nbsp;|-)?)foobar

--BillFlis (talk) 20:32, 20 July 2011 (UTC)Reply

  NODES
Community 2
HOME 2
Interesting 1
languages 2
Note 2
os 13
text 1