Wikipedia talk:AutoWikiBrowser/Typos

This is an old revision of this page, as edited by Certes (talk | contribs) at 21:45, 26 September 2020 (21th: typo for other number). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.


Latest comment: 4 years ago by Certes in topic 21th

childrens'

This gets corrected to "children's'"... could someone more adept at regexes make it eat the extra apostrophe? Alistair1978 (talk) 18:09, 9 August 2020 (UTC)Reply

I fixed those I found, but I haven't changed the regex and cases are still being added. See also women's' and mens's (and even one men's's). Certes (talk) 13:34, 25 September 2020 (UTC)Reply

enmasse

Hi, AWB just suggested to me that I change "enmasse" to "emmasse". Can I suggest that "en masse" would be a better call. ϢereSpielChequers 22:22, 24 August 2020 (UTC)Reply

WereSpielChequers, as far as I can tell there were only three instances of this that weren't some sort of proper name, which have now all been fixed. Ionmars10 (talk) 22:52, 24 August 2020 (UTC)Reply
Thanks, but that's not the issue. There will be a steady trickle of these things in the future, and currently the typo rules suggest the wrong change. ϢereSpielChequers 22:54, 24 August 2020 (UTC)Reply

Excersice

Should excersice -> exercise be added?

currently six

Profesor

Apparently Profesor is correct in Polish, and I suspect in some other languages. We currently have a couple of thousand articles with profesor. I suspect this means too many false positives for this to be useful in AWB. Should this typofix be disabled? ϢereSpielChequers 17:55, 5 September 2020 (UTC)Reply

wikt:profesor is valid in several languages, notably Spanish. A random selection that I checked were correct use of another language rather than typos. Perhaps it's a job for a one-off manual fix, after excluding phrases which indicate correct use such as el profesor. Certes (talk) 18:19, 5 September 2020 (UTC)Reply
If you're going to exclude phrases, also exclude any use of Profesor followed by a word beginning with a capital letter.--Srleffler (talk) 21:43, 5 September 2020 (UTC)Reply
I just checked through those, and fixed the 7 out of 677 which were errors. Certes (talk) 22:41, 5 September 2020 (UTC)Reply
I've also checked the 508 not followed by a capital, and fixed 7 out of 508. The rest seem correct, though I left four borderline cases (1 2 3 4) on the assumption that the previous editor has a clue. The discrepancy between my total and the original couple of thousand is because I required profesor (any capitalisation) in the source; I excluded profesör etc. and transclusion via {{Japanese Club Football}}, {{Televisa telenovelas 1970s}}, etc. This has been a useful check, but I fear that automating it would cause more false positives than improvements. Certes (talk) 12:29, 6 September 2020 (UTC)Reply
The problem is that it currently is in the AWB typo fixes. My argument is that it should come out of them. ϢereSpielChequers 15:04, 7 September 2020 (UTC)Reply
Then I agree with you. It may have done a good job in the past but seems likely to do more harm than good in future. Certes (talk) 15:42, 7 September 2020 (UTC)Reply

More issues

Probably since the introduction of the "efficiency" changes. Now, "long term" is "fixed" (to include the hyphen) to "long-m". I am a beginner at regex, so just reporting this for now. This is on North Devon Railway while doing a standard typo-fixing run. Thanks! After refreshing the typos, the problem no longer exists. Dawnseeker2000 17:22, 6 September 2020 (UTC)Reply

That one's been fixed, along with "vice president" and "on date". See also WT:AWB#Institute. Certes (talk) 18:02, 6 September 2020 (UTC)Reply
I've checked for other possibly problematic recent changes, and pre-emptively fixed the only ones I found: "east–west" and "west–east" (the second rule of each name). Of course, there may be other bugs both new and old which didn't match the pattern I was seeking. Certes (talk) 18:17, 6 September 2020 (UTC)Reply
Thank you for the quick response and explanations. Dawnseeker2000 20:32, 6 September 2020 (UTC)Reply
I think this is now all cleaned up. (I fixed a ve-president and a couple of long-m and short-m relationships.) However, there is a small risk that someone who opened AWB on Saturday and has not reloaded the typo list since will introduce other errors, so it's worth another check later. The institute one is awkward to check for, as ie is common (though often wrong) in other contexts. Certes (talk) 15:36, 7 September 2020 (UTC)Reply

aberannt

AWB has just suggested that I "improve" an article by changing "aberannt" to "aberrannt. Can this be changed to "aberrant" please. ϢereSpielChequers 15:06, 7 September 2020 (UTC)Reply

Repetoire

I've just come across a problem: this regex:

<Typo word="Repertoire" find="\b([rR])ep[eir]to(?:ires?|r(?:i(?:al|es)|y))\b" replace="$1eperto$2"/>

converts Repetoire to Reperto$2 but I can't see what's wrong with it. Colonies Chris (talk) 12:39, 9 September 2020 (UTC)Reply

@Colonies Chris: There is only one capturing group, ([rR]), so only $1 is set. I think the first ?: needs to be removed, so $2 can be set to "ire". Or it may be better to replace the first ?: by ?= to make it a lookahead, and remove the $2. Certes (talk) 12:54, 9 September 2020 (UTC)Reply
I've implemented the first of these fixes, since it will give a better edit summary than the other. -- John of Reading (talk) 13:16, 9 September 2020 (UTC)Reply
Thanks, John of Reading. I found two more cases which may need fixing:
  • <Typo word="(Dis)Colour-" find="\b([cC]|[dD]isc)olou(?:[a-ln-qs-y][a-z]*)\b" replace="$1olour$2"/>
  • <Typo word="ma(d/k)e" find="\bam([dk](?:es?|ing))\b" replace="ma$1$2"/>
I think that's all of them. Certes (talk) 13:20, 9 September 2020 (UTC)Reply
Hopefully fixed now. -- John of Reading (talk) 13:28, 9 September 2020 (UTC)Reply

120hz → 120Hz → 120 Hz

AWB first corrected a typo, 120hz, to 120Hz. When I run AWB on that particular page again, it corrected it again, from 120Hz to 120 Hz (by inserting &nbsp; between 120 and Hz). Could you fix it so that from now on ###hz would be corrected to ###&nbsp;Hz straight away? Zarex (talk) 22:00, 9 September 2020 (UTC)Reply

21th

Does the team think that correcting text such as "21th" would be safe and useful? I was thinking of something like (\d*[02-9]1)th → $1st, (\d*[02-9]2)th → $1nd and (\d*[02-9]3)th → $1rd. (Omitted for clarity: \b at both ends, and making \d*[02-9] a lookbehind for efficiency.) Certes (talk) 21:31, 22 September 2020 (UTC)Reply

1th, 2th and 3th are also disappointingly common. I suppose we should consider abominations like 2/3ths and 3thly, and limit the lookbehind size. That may give something like 1th(ly|s)?\b(?<=\b(?:\d{0,9}[02-9])?1th(?:ly|s)?)1st$1, etc. but I'll await advice from a more experienced typo fixer before attempting to introduce anything. Certes (talk) 22:50, 22 September 2020 (UTC)Reply
I'm currently wading through some of these. The false positives include some UK postcodes and an abbreviation for Second Thessalonians. But the military and sports ones are likely typos. Not sure if there are some safe rules we can put into AWB due to false positives. Don't you love Wikipedia! ϢereSpielChequers 16:42, 26 September 2020 (UTC)Reply
Thank you. I was going to look at these next, but I think there are thousands of errors even once we eliminate FPs. I was limiting to lower case th, which should weed out the postcodes and biblical references. There are a few valid uses such as the John A. Wilson Building on 13 1/2th Street, quotations from erroneous sources ("Siege enters 21th day" – Daily Slapdash) and some needing non-standard correction (1/2th finals of the 2016 Moroccan Throne Cup) but most lowercase uses encased in \b seem safe to correct. Certes (talk) 17:45, 26 September 2020 (UTC)Reply

Another question: should we assume that (for example) 3th means 3rd, or is it there a significant risk that it is a typo for 4th, 13th etc? An editor making unrelated bulk edits may not have time to check sources every time an unexpected typo fix appears. Certes (talk) 21:45, 26 September 2020 (UTC)Reply

  NODES
HOME 2
languages 4
os 20
text 2