Wikipedia talk:AutoWikiBrowser/Typos

This is an old revision of this page, as edited by BillFlis (talk | contribs) at 19:17, 9 April 2007 ("distictively" --> "districtively" ???). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.


Latest comment: 17 years ago by BillFlis in topic "distictively" --> "districtively" ???
Archive
Archive

Misspellings to be added

Should new misspellings go here or in the "Misspellings to be Added" section of the main project page? Regardless, here's about 90 that I've amassed. I'd add them myself, but some of those regexes are pretty complex and scare me. I've verified that all these aren't acceptable by dictionary.com and that there are at least 10 instances of each in Wikipedia. False positives haven't been checked for, however. And there are probably prefixes/suffixes that can be added to most of them.

(Can someone please add some of these? --Thiseye 07:02, 2 March 2007 (UTC))Reply

  • committe → committee
  • comsumption → consumption
  • confict → conflict
  • controvesy → controversy
  • depatment → department
  • detemine → determine
  • differenciate → differentiate
  • elligible → eligible
  • erronous → erroneous
  • girfriend → girlfriend
  • helicoptor → helicopter
  • highten → heighten - Updated Height Reedy Boy 21:24, 24 March 2007 (UTC)Reply
  • immedately → immediately
  • immensly → immensely
  • inpenetrable → impenetrable
  • intitution → institution
  • itslef → itself
  • jeapordy → jeopardy
  • likley → likely
  • liqour → liquor
  • literaly → literally
  • minsitry → ministry
  • mountian → mountain
  • newstands → newsstands
  • nobilty → nobility
  • oppenent → opponent
  • orginial → original
  • peform → perform
  • perfomance → performance
  • personna → persona
  • editted → edited
  • posibility → possibility
  • precip(a|ia)tion → precipitation
  • prepatory → preparatory
  • pricipal → principal
  • recruting → recruiting
  • reliquish → relinquish
  • reminicent → reminiscent
  • replacment → replacement
  • responed → responded
  • sectretary → secretary
  • signiture → signature
  • similarily → similarly
  • similiar → similar
  • unsheath → unsheathe
  • valiently → valiantly
  • wherupon → whereupon
  • wheter → whether
  • widly → widely

Added

Thiseye 16:09, 31 December 2006 (UTC)Reply

Reliable sources

Is dictionary.com a reliable source?--Andeh 06:04, 11 August 2006 (UTC)Reply

Nope. See here. alphaChimp laudare 06:19, 11 August 2006 (UTC)Reply
OK, what about Microsoft Word 2000's or higher dictionary?--Andeh 06:25, 11 August 2006 (UTC)Reply

This looks like a good source for misspellings: http://www.misspelled.com/common/a.htm --BillFlis 10:45, 27 August 2006 (UTC)Reply

Full stops, commas, colons, brackets and double spaces

I have felt that following mistakes are too comon (specially in stubs) to ignore:

  • c denotes any alphanumeric character
  • s denotes a space character
Mistake Correction Suggested code
c.c c.sc
<Typofind="\b(a-zA-Z).(a-zA-Z)\b" replace="$1. $2" />
cs.c c.sc
<Typofind="\b(a-zA-Z) .(a-zA-Z)\b" replace="$1. $2" />
cs.sc c.sc
<Typofind="\b(a-zA-Z) . (a-zA-Z)\b" replace="$1. $2" />
c,c c,sc
<Typofind="\b(a-zA-Z),(a-zA-Z)\b" replace="$1, $2" />
cs,c c,sc
<Typofind="\b(a-zA-Z) ,(a-zA-Z)\b" replace="$1, $2" />
cs,sc c,sc
<Typofind="\b(a-zA-Z) , (a-zA-Z)\b" replace="$1, $2" />
c;c c;sc
<Typofind="\b(a-zA-Z);(a-zA-Z)\b" replace="$1; $2" />
cs;c c;sc
<Typofind="\b(a-zA-Z) ;(a-zA-Z)\b" replace="$1; $2" />
cs;sc c;sc
<Typofind="\b(a-zA-Z) ; (a-zA-Z)\b" replace="$1; $2" />
c(c cs(c And so forth
c(sc cs(c And so forth
cs(sc cs(c And so forth
c)c c)sc And so forth
cs)c c)sc And so forth
cs)sc c)sc And so forth
ss s And so forth

Note: Suggested code is based on my preliminary understanding of the pattern of the working code at Wikipedia:AutoWikiBrowser/Typos, and I am very sure it is wrong and needs to be corrected.

Szhaider 15:39, 9 October 2006 (UTC)Reply

These are indeed common mistakes, but unfortunately, in my experience there are too many legitimate exceptions, such as ".NET", the other mistakes may not have so many exceptions though. Martin 16:16, 9 October 2006 (UTC)Reply
Yeah, and what about U.S.A.? Or T.S. Eliot? Also, semi-colon is part of many HTML entities, like "—" etc., which will butt right up against letters.--BillFlis 02:11, 10 October 2006 (UTC)Reply

facilitate

The new entry for facilitate is not correct. It's changing facilitate to facilitatli. I think it should have $3 instead of $2. --Thiseye 00:44, 1 March 2007 (UTC)Reply

Thanks for reporting; fixed. -- intgr 00:47, 1 March 2007 (UTC)Reply

secretarty -> secretary

found in Marita Ulvskog. Jobjörn (Talk ° contribs) 01:21, 8 March 2007 (UTC)Reply

Added to existing "Secretary" entry.--BillFlis 22:33, 8 March 2007 (UTC)Reply

RETF oddities

I noticed something strange that could be a bug in AWB. I've noticed in several articles that if a typo is in wiki tags [[]], then RETF will not catch this. I assumed this was because it's not excluding the brackets as part of the word so it wasn't matching the regex. But then I noticed in the Akshay Pratap Singh article, that the FAR does catch typos within wiki tags. In this article, "politican" is misspelled. I had a FAR entry to correct this which I recently added to RETF. However, I noticed when I disabled the FAR entry, it would no longer be corrected. I updated the FAR regex to exactly that of the RETF regex, and still FAR would correct it, but RETF would not. --Thiseye 22:43, 11 March 2007 (UTC)Reply

I believe this has been discussed a few times over on the AWB talk pages, it has been setup like this purposely. There are reasons for doing it both ways, and i think we are looking into having it check more... Post it on the AWB talk page... Reedy Boy 17:55, 12 March 2007 (UTC)Reply

Not sure if anyone will see this...

I was wondering if the AWB could include the often misused words "reoccur", "reoccured", and "reoccuring". These are not actual words (contrary to popular assumption)! They should all be changed to "recur", "recurred", and "recurring". Mahalo. --Ali'i 20:44, 13 March 2007 (UTC)Reply

Oops, they already are included:

<Typo word="(Re(o)c/Re)currence" find="\b([Rr]eoc|[Oo]c|Re)curran(ces?|t|tly)\b" replace="$1curren$2" /> <Typo word="Recurr(ed/ing)" find="\b(R|r)ec(?:cur?|u)r(ed|ing|ent|ently)\b" replace="$1ecurr$2" />

Sorry about that. Thanks anyway. --Ali'i 20:47, 13 March 2007 (UTC)Reply

Includeing -> Including

As above, suggest replacing includeing with including. Harryboyles 05:59, 17 March 2007 (UTC)Reply

Asian needs to be updated

There is a misspelling in Kai Chen as asain, the current accounts for aisian....

Dependant vs. Dependent

It appears that "dependant" is acceptable in British English, esp. as a noun. If people concur, it should be removed from the typo list IMHO. —Wknight94 (talk) 15:21, 23 March 2007 (UTC)Reply

It's not just British. An American dictionary http://www.m-w.com/dictionary/dependant lists it too.--BillFlis 18:10, 23 March 2007 (UTC)Reply
So it should be removed, no? —Wknight94 (talk) 14:05, 24 March 2007 (UTC)Reply
It definitely needs to be removed. As a noun a dependant is a person looked after by another e.g. a father's dependants are his children (sorry for the approximate definition). Dependant may well be incorrectly used e.g. 'dependant on the weather ...' but can't be fixed this way. Rjwilmsi 19:19, 26 March 2007 (UTC)Reply
I removed it shortly after my last message. —Wknight94 (talk) 21:21, 26 March 2007 (UTC)Reply

Regex/CPU question

I know that we want to reduce the number of regexes to reduce the amount of CPU time used to process them all. I'm assuming this means that there is little to no CPU cost associated with adding a variant to an existing regex compared to adding a completely new entry. Should we avoid adding variants to an existing regex that don't occur too often, or does that matter?

Also, it seems we avoid "catching" the correct spelling within the regex. Is that the standard we should go by? And to what extent should we go to avoid that situation? I've seen some regexes that do catch the correct spelling, so should I try to rework these, or is this sometimes acceptable ("available" is an example). Further, should we avoid trying to catch certain variants of typos to avoid catching the correct spelling? Should we avoid adding a new entry to try to catch a variant to avoid catching the correct spelling ("Vancouver" is an example)? --Thiseye 18:28, 25 March 2007 (UTC)Reply

Combining regexes that catch missing "e" before "ly" suffix

I wanted to get some other's thoughts on combining several regexes (and incorporating some new ones). The thing is that if we want to add other variants to these, we'd probably want to separate them out again.

<Typo word="(Accurate/Active/Affectionate/Alternate/Appropriate/(Ab/Re)solute/Collective/Consecutive/Desperate/Exclusive/Extensive/False/Large/Separate/Severe)ly" find="\b((A|a)(ccurat|ctiv|ffectionat|lternat|ppropriat)|([Aa]b|[Rr]e)solut|(C|c)o(llec|nsecu)tiv|(D|d)esperat|(E|e)x(clu|ten)siv|(F|f)als|(L|l)arg|(S|s)e(parat|ver))ly\b" replace="$1ely" />

--Thiseye 00:01, 26 March 2007 (UTC)Reply

I think this is a good idea, I have been using some regexes like this personally and they can work pretty well. Gaius Cornelius 00:05, 26 March 2007 (UTC)Reply
Good idea, but I have a suggestion. No English words end in "ivly" or "avly". This:
<Typo word="-(a/i)vely" find="(a|i)vly\b" replace="$1vely" />
catches your "-ively" words and over a thousand more. I went ahead and added this and a few others under New Additions; I'll let them cook for a while to see if any unforeseen problems arise before deleting any existing entries.--BillFlis 10:29, 26 March 2007 (UTC)Reply

'infinate' fixed to 'infinit'

The typo correction ((In)De/In/Af)Finite fixes 'infinate' to 'infinit'. I'm not competent enough with regex to fix it. Rjwilmsi 19:16, 26 March 2007 (UTC)Reply

Fixed, but I had to take out the case of "infinity".--BillFlis 19:33, 26 March 2007 (UTC)Reply
Thanks. And another: ballon can't be corrected to balloon as 'ballon' exists in French and is quoted e.g. Ballon D'or in the Roberto Baggio article.
That sounds questionable since this is the English Wikipedia. That's one that would need to be rejected manually by the WP:AWB user but shouldn't be removed from the typo list. (My opinion anyway). —Wknight94 (talk) 21:21, 26 March 2007 (UTC)Reply
Yes, but if you search for "ballon", you get not just Ballon D'Or but a host of articles with that word in the title. On the other hand, we could certainly keep the corrections of "balloning", "ballonist", etc. On the third hand, there aren't a lot of these errors.--BillFlis 10:24, 28 March 2007 (UTC)Reply

'responsable(s)' fix needs to be removed

Responsable(s) exists in French so needs to be removed from the "(Ir)Responsible" correction. Rjwilmsi 20:27, 27 March 2007 (UTC)Reply

tPA is corected to TPa but it's correct in articles such as Serpin. Rjwilmsi 20:37, 27 March 2007 (UTC)Reply

Sorry to push back again (as I did above) but this is the English Wikipedia. Shouldn't French words be occurring very very rarely? To me, that's better to cover as an exception by the WP:AWB user (which is what this list is for). —Wknight94 (talk) 22:03, 27 March 2007 (UTC)Reply
While, I tend to agree, the RETF project page does state that the "lofty goal of RETF is to be completely automatic. That is, 100% accuracy." So something's got to give. We can't really have it both ways. I have a couple of ideas that I'm going to propose soon to alleviate this. --Thiseye 04:27, 28 March 2007 (UTC)Reply
From that goal, anytime someone runs across any change in WP:AWB that they need to roll back, they should remove it from the list, right? I'll do that then. Thanks. —Wknight94 (talk) 11:21, 28 March 2007 (UTC)Reply

For phrases in a language other than English, use {{lang}} for the phrase, for example {{lang|fr|Responsable}}, where the second parameter is the ISO 639 code. It stops AWB changing the text, but I'm not sure about WikEd (if not, it probably should). mattbr 10:53, 28 March 2007 (UTC)Reply

Thanks. That's a really useful tip I didn't know about. I'll probably go through and tag all French 'responsable's like that. Rjwilmsi 17:25, 28 March 2007 (UTC)Reply

Typica

Typica exists (in English!) but is corrected to Typical. Wasn't sure how to fix the regex myself. Rjwilmsi 07:03, 28 March 2007 (UTC)Reply

I have removed the regex doing this ((A)Typically). Other changes in he removed regex appear to already be covered in (A)Typical, but someone please update it not. Thanks, mattbr 10:53, 28 March 2007 (UTC)Reply

Another: In (fact/the/a/an) corrects the name Ina

Removed "ina" and "inan" from regex because of name false positives. I'd also be concerned "inan" would be a typo of "inane". --Thiseye 01:24, 29 March 2007 (UTC)Reply

Nation name capitalization

What do folks think about taking out some of the capitalizations since there are so many animal species that use lower-case versions of words that would ordinarily be upper-case (see this edit for an example of the mistakes that are often made). —Wknight94 (talk) 22:03, 27 March 2007 (UTC)Reply

"gum arabic" too. -- Euchiasmus 20:17, 7 April 2007 (UTC)Reply

Millenium Hall

Proposing to remove "Millennium_" since there is a well-known 18th century book, Millenium Hall. —Wknight94 (talk) 00:06, 30 March 2007 (UTC)Reply

There's a band called 'Agression', so the 'agression' -> aggression fix needs to be edited. Rjwilmsi 06:24, 31 March 2007 (UTC)Reply

Official

There is currently an entry for Official, but I'm not sure if it corrects "Offical" --> "Official". Can someone either please add this or let me know that it is in there already? --After Midnight 0001 05:09, 1 April 2007 (UTC)Reply

I added that case, as well as a couple more word endings.--BillFlis 11:17, 1 April 2007 (UTC)Reply

.coms

I couldn't get negative lookahead to work properly on the .com's (OK, brainfart Harvard would be .edu anyway). Try 1 and Try 2. I'm trying to get it to ignore URLs and emails (ex NSAKEY). Can somebody take a peek? I was reloading the file with click/unclick of the RETF option. — RevRagnarok Talk Contrib 17:40, 1 April 2007 (UTC)Reply

AWB ignores external http: links (and from the next release https:, ftp: and mailto:), so these shouldn't be a problem. In regular text, I can't think of a situation where you would write a web or email address outside a link. Could you point me to where you are having the problem? You can try out a regex using the find-and-replace option in AWB, and I don't think clicking/unclicking the checkbox reloads the list, but you can from the last option on the 'General' menu. mattbr 18:12, 1 April 2007 (UTC)Reply
The developers told me click/unclick reloads and that seems to work. The test article is listed above - NSAKEY has the public key for an email @microsoft.com. — RevRagnarok Talk Contrib 18:18, 1 April 2007 (UTC)Reply
Sorry missed that. Wrap the text in <pre></pre> rather than using a space at the beginning. AWB will then ignore them. mattbr 18:30, 1 April 2007 (UTC)Reply
That fixes this case, but on a side note, I'd like to know why the regex didn't work. — RevRagnarok Talk Contrib 18:35, 1 April 2007 (UTC)Reply
Ticking and unticking the box just enables and disables it, it doesnt refresh the typo list. I've just commited a change that if you use the option on the general menu, it will reload them. Reedy Boy 18:41, 1 April 2007 (UTC)Reply
Two weeks ago you said it did reload the typo page. Guess there was a misunderstanding somewhere. Either way, I < pre> tagged the one spot anyway per Matt. — RevRagnarok Talk Contrib 18:52, 1 April 2007 (UTC)Reply
Sorry about that, i thought (as it was a bit of a quick fix), that it did. When i looked over the code just now, i realised, that unless the decleration for the typo's was blank (ie = null), it wouldnt load them. I've now put a parameter on that, so that you can force reload, and that works. Sorry for the confusion/lack of complete attention on my part, and for the next release, it definately has been sorted!! Reedy Boy 19:01, 1 April 2007 (UTC)Reply

Re the regex, sorry bit of a regex novice. Can anyone else help? mattbr 18:50, 1 April 2007 (UTC)Reply

august > August

Since august is a word, should this correction be removed, or improved to fix <number> august > <number> August only? Rjwilmsi 17:53, 3 April 2007 (UTC)Reply

Good point. Probably, but I was having some problems with lookahead in the past (see above). — RevRagnarok Talk Contrib 18:10, 3 April 2007 (UTC)Reply

discribed -> described

As in [1]? Jobjörn (Talk ° contribs) 12:06, 4 April 2007 (UTC)Reply

Added to "Describe", which is now "(De/Pre)scribe".--BillFlis 19:49, 4 April 2007 (UTC)Reply

strengtened > strengthened

as here. Jobjörn (Talk ° contribs) 14:16, 4 April 2007 (UTC)Reply

Added to "Strength".--BillFlis 19:43, 4 April 2007 (UTC)Reply

"significatly" --> "significately" ???

The rule <Typo word="-(b/c/d/g/i/m/s/t/v)ately_" find="([bcdgimstv])atly\b" replace="$1ately" /> converts significatly to significately.

Surely that can't be what the inventor intended?

--Euchiasmus 20:13, 7 April 2007 (UTC)Reply

Yeah, that needs to go away. —Wknight94 (talk) 21:19, 7 April 2007 (UTC)Reply

"distictively" --> "districtively" ???

The word "districtively" doesn't even exist.

Let's have rules that rectify a recognised and bounded set of incorrect words, rather than trying to make the rules too general. What do you think? Euchiasmus 20:30, 7 April 2007 (UTC)Reply

Agreed as your other significatly example demonstrates. —Wknight94 (talk) 21:19, 7 April 2007 (UTC)Reply
As the "inventor" of these attempts at general rules, may I ask, what is the harm in replacing one type of error by another? If you did not have the general rule, you would still leave an error. At least its presence in this case alerted you that we need separate rules for these exceptional misspellings. I'll add a rule for "(Di/In)stinctive" to handle your clever discovery!--BillFlis 19:11, 9 April 2007 (UTC)Reply
It turns out that there was an existing rule to handle "distictively" but it was down in the D's, behind the general rules. I've now moved the general rules to the end, to allow the special cases to be handled first. I also modified the previous "Distinction" to "(Di/In)stinctive".--BillFlis 19:17, 9 April 2007 (UTC)Reply
  NODES
HOME 2
Idea 3
idea 3
languages 2
Note 4
OOP 1
os 24
text 3
web 1