Wikipedia talk:AutoWikiBrowser/Typos
- Home
Introduction and rules - User manual
How to use AWB - Discussion
Discuss AWB, report errors, and request features - User tasks
Request or help with AWB-able tasks - Technical
Technical documentation
This page has archives. Sections older than 20 days may be automatically archived by Lowercase sigmabot III. |
Proposal: List of developers
I think it would be a good idea to have an official "Project Template" on the main Typos page. I have the old project template here: WP:RETF
That way everyone who contributes gets a "official" recognition.
I'm not sure how we would sort the developers list. Either by seniority (I left so I might be on the bottom) or alphabetically.
Ideas/comments/concern/disdain?--mboverload@ 02:01, 9 July 2008 (UTC)
- A list of regular maintainers of the list might be useful, so that somebody with an urgent issue would have a point of contact. For regular stuff there's the talk page.
- I don't know how we would sort such a list – perhaps the AWB statistics reports info on users fixing typos? Rjwilmsi 06:42, 9 July 2008 (UTC)
- Tis a good idea.. Just as long as you either put it on the page that is transcluded for information already, or on a new page =) and transclude that! —Reedy 11:05, 11 July 2008 (UTC)
Cool. Thinking about it...who ARE the developers now? Right now I only know of me, Rjwilmsi, and Reedy. Anyone else? --mboverload@ 18:47, 11 July 2008 (UTC)
- Page history? ;p - BillFlis and Rjwilmsi did the majority of the development when you were away... Hmm. As for the Actual AWB developers, MaxSem and I are doing 99% of the work that is being done to AWB, Kingboyk is too busy with other things to really contribute atm —Reedy 08:21, 29 July 2008 (UTC)
Triple letters
I have removed the triple letter RegEx temporarily and it is pasted here:
<Typo word="Triple letters" find="(?!\b(?:Eisschnelllauf|Killlai|(?:Pya|G|g)rrrl?|[Rr]sssf|[Oo]ooh|[A-Za-z]+([a-z])\1\1\1[a-z]*|[a-fw]+)\b)\b([A-Za-z]+)([a-gj-wyz])\3\3([a-z]+)\b" replace="$2$3$3$4" />
The reason I removed it is because, in spite of the great work that went into building it, I have not come accross anything that it has fixed properly after around 1000 randomized edits. Could someone explain this one to me? --mboverload@ 22:57, 27 July 2008 (UTC)
- Are you saying there are too many false positives? I did run it against a database dump last month, so that might explain why it doesn't make many fixes at the moment. Rjwilmsi 06:42, 28 July 2008 (UTC)
- Ah, ok. Thanks for doing all that work! I'm just wondering if there are too many false positives? I basically think of you as the lead developer so let me know what you think. (See the typos page - we're a project now with your name highlighted in the dev list) --mboverload@ 14:56, 28 July 2008 (UTC)
- False positives seemed to be not that many once the exceptions above were included, and remaining ones were usually for foreign words/phrases which needed to be tagged as {{lang|de|worrrd}} etc. What false positives were you getting? There's no need to remove this simply if there are currently no hits – they are sure to build up again. Rjwilmsi 11:31, 29 July 2008 (UTC)
- Dear god, I've been looking for that template - no one on IRC seems to know about it! Thank you Rjwilmsi! Can I call you Rj if I'm lazy? --mboverload@ 02:06, 31 July 2008 (UTC)
- False positives seemed to be not that many once the exceptions above were included, and remaining ones were usually for foreign words/phrases which needed to be tagged as {{lang|de|worrrd}} etc. What false positives were you getting? There's no need to remove this simply if there are currently no hits – they are sure to build up again. Rjwilmsi 11:31, 29 July 2008 (UTC)
- Ah, ok. Thanks for doing all that work! I'm just wondering if there are too many false positives? I basically think of you as the lead developer so let me know what you think. (See the typos page - we're a project now with your name highlighted in the dev list) --mboverload@ 14:56, 28 July 2008 (UTC)
Per Rjwilmsi I have readded the regular expression. --mboverload@ 20:53, 31 July 2008 (UTC)
TypoScan Announcement
From now on I will be scanning every database extract against the entire Typo list. In the future we will be able to "assign" a section of the 'pedia with known typos to an editor and see a real, tangible benefit. The expected size of this list is projected to be over 100,000 articles, or around 4.5% of all articles on Wikipedia.
Once we go through the list we can start recording a blacklist of articles that should not be checked. Eventually this number will be brought down by the information about false-positives.
Technical details
This is EXTREMELY SLOW GOING. At 17 gigabytes of pure text the database is MASSIVE. In addition, our ever expanding typo list needs to be checked against EVERY ARTICLE in Wikipedia. Over 2.4 MILLION! At max speed my current computer will process this in about 3-5 days.
My current limiting factor is the database scanning software and my CPU. The database scanner is not built for dual core systems and thus only uses 50% of my computer's potential.
Amount of memory is not a problem. The DB scanner only takes about 400 megabytes. It's the speed of the memory.
If my hard drive then becomes the problem I will move the database onto my 10,000 RPM SATA system drive.
Current system
- CPU - Intel Core 2 Duo E6600 (Conroe) @2400 MHz
- Motherboard - MSI MS-7350 | nForce 650i SLI
- Memory - 4 gigabytes of DDR2 PC-5400 memory @333MHz
Hardware updates
In order to better support this new endeavor I am going to be upgrading my computer's hardware.
- CPU - I will be buying the fastest CPU that I can find that doesn't cost 1000 dollars
- Memory - I will upgrade my computer to DDR2 PC-6400 from DDR2 PC-5400
- Overclocking - My computer's entire system was built to be overclocked. I anticipate even further gains in speed
--mboverload@ 07:43, 29 July 2008 (UTC)
- Im not sure how you could really thread off something reading from a file. Wonder if its worth looking at having a way of using the DBScanner to run against a MySQL instance/similar, so the file has been loaded back into a database (obv have to be local/mirror, not a WP one to save bandwith). Overclocking your CPU will probably help increase processing time, and the faster ram should help. I would also move it to your 10k rpm drive, thats 33% faster rotation, so less seek time etc etc. —Reedy 08:27, 29 July 2008 (UTC)
- Can't we just find a handful of users to scan a portion of the database dump each, if we all download the same one? I assume the list of articles dump is in the same order as the articles-list dump, then we can just start from article x?
- I did try this myself a couple of months ago (March db dump, ~65,000 hits) but gave up due to there being so many false positives for foreign words and Latin/scientific names. A great idea if we get it right though. Rjwilmsi 11:07, 29 July 2008 (UTC)
- I don't think that 10K RPM-drives will help. The main slowdown is running all those shiny regexes, so you don't need much raw HDD read speed, and if your file system isn't deadly fragmented, you don't need a fast seek time either. Probably, we could improve speed by making it parallel, but CPU will still be the main dependancy. MaxSem(Han shot first!) 16:11, 29 July 2008 (UTC)
- If article specific exeption list will be implemented, plus long ago requested "Prune list" option will be there, it will become possible to spellcheck very long lists, even online. Sure, first run will be slow, because you will need to mark thouse foreign and madeup words as exeptions. But then... Just imagine spellfixing whole en.Wikipedia spending few hours (human time, how long computer works in a background dosn't really matter). TestPilottalk to me! 08:50, 30 July 2008 (UTC)
Reset =(
During a power flux at my house my computer turned off. I will restart the database scan at about 1/3 of the way through. --mboverload@ 19:15, 30 July 2008 (UTC)
ENDING DISCUSSION HERE - WIKIPROJECT NOW FORMED AND UNDER DEVELOPMENT
License
Mboverload tried to claim that list is under GPL. No, it is not! You could not switch license at will, unless you are developer and own code. TestPilottalk to me! 10:13, 30 July 2008 (UTC)
- What list? Wikipedia:AutoWikiBrowser/Typos, this list? It is under GFDL, or at least that's what the edit box tells me when I make my contributions to it and agree to license my contributions under GFDL, right? -- JHunterJ (talk) 10:31, 30 July 2008 (UTC)
- Yeah, Wikipedia:AutoWikiBrowser/Typos is under GFDL, which is basically mean that no one can integrate it in any GPL based project. TestPilottalk to me! 10:52, 30 July 2008 (UTC)
-->I am the one who built the software InfoBox. I copied the AutoWikiBrowser infobox, which is licensed under the GPL. I simply forgot to change the license to GFDL. --mboverload@ 18:09, 30 July 2008 (UTC)- =( --mboverload@ 23:33, 30 July 2008 (UTC)
- Yeah, Wikipedia:AutoWikiBrowser/Typos is under GFDL, which is basically mean that no one can integrate it in any GPL based project. TestPilottalk to me! 10:52, 30 July 2008 (UTC)
"Nasalisation"
These rules seem rather useless. The first two letters are merely transposed, which could happen to any word in the English language. Why not check transposals of interior letters too?! This sort of thing might be worth checking for very common words, but "Nasalisation" isn't one of them. I suggest we delete these two rules.--BillFlis (talk) 22:29, 31 July 2008 (UTC)
- Hey Bill, in the future could you copy the rules you are referring too. I'm lazy. --mboverload@ 03:08, 1 August 2008 (UTC)
TYPO REVIEW: "Honshu-" find="\bHonshu\b" replace="Honshu-"
<Typo DISABLED="Honshu-" find="\bHonshu\b" replace="Honshu-" />
Why does this add the - at the end of the word?--mboverload@ 02:56, 1 August 2008 (UTC)
- It used to add a macron over the u, until the massive resort. -- JHunterJ (talk) 08:29, 1 August 2008 (UTC)
- =( --mboverload@ 08:30, 1 August 2008 (UTC)
- Another problem with a character-based sort is that it separates root words that have rules with and without prefixes. This makes it awkward to detect redundant rules and to consolidate sets of rules within a single rule. Also, words having accented characters get put in unexpected places within the sort. The purpose of sorting is to make it easy on the developers, and a computer sort disturbs this. I've added a guideline not to do this, but to alphabetise in a sensible way, like you would find the root words in a dictionary.--BillFlis (talk) 11:54, 1 August 2008 (UTC)
- Also, isn't the intended rule rather hypercorrective? My (American) English dictionary lists Honshu without any macron.--BillFlis (talk) 11:54, 1 August 2008 (UTC)
- The WP article uses the macron in its title - that's why I added this rule. The same applies to a lot of other names (e.g. Valparaíso, Chile or Zürich or Łódź, which are often spelled without the diacritics in English, but WP ought to be internally consistent.Colonies Chris (talk) 08:13, 6 August 2008 (UTC)
- Also, isn't the intended rule rather hypercorrective? My (American) English dictionary lists Honshu without any macron.--BillFlis (talk) 11:54, 1 August 2008 (UTC)
error while loading typo list
I am getting this error while trying to load the typo list in AWB.--Rockfang (talk) 15:08, 2 August 2008 (UTC)
It appears it was jsut fixed. :) Rockfang (talk) 15:12, 2 August 2008 (UTC)
- Yeah, i was just doing some testing and saw it.. =) 15:49, 2 August 2008 (UTC)
Development list
Would it be useful to have a page where you can test new regexes that will be loaded either with, or instead of, the main typo list, so you can debug live/reduce chances of causing problems to live lists?
—Reedy 15:51, 2 August 2008 (UTC)
- I think testing should be done in Find&Replace. However, it would be FKING AWESOME if there was an "export to RETF" feature of Find&Replace once I'm done testing. --mboverload@ 17:51, 2 August 2008 (UTC)
<Typo word="Buoy" find="\b(B|b)ouy(s?|ant)\b" replace="$1uoy$2"/>
Bouy is a place in france...
Probably wants removing then?
—Reedy 22:55, 2 August 2008 (UTC)
- Let's just remove the question mark so it finds only "Bouys" and "Bouyant".--BillFlis (talk) 11:48, 3 August 2008 (UTC)
- Makes more sense. Cheers —Reedy 14:20, 3 August 2008 (UTC)
TYPO REVIEW: Imp-/Imm-/Imb-
I have disabled this line in production:
<Typo DISABLED="Imp-/Imm-/Imb-" find="(?!\b[Ii]n(?:ba[lr]|migrante)\b)\b(I|i)n(p[b-gi-tv-z]|m[b-np-z]|b[a-npqstv-z])\B" replace="$1m$2" />
It has a nasty habit of finding every word that begins with "In" and replacing it with "Im". Is there a way to make this less inclusive? --mboverload@ 02:42, 3 August 2008 (UTC)
- Inserting in a few hours. --mboverload@ 01:58, 4 August 2008 (UTC)
Tae Kwon Do (taekwondo)
<Typo word="Know" find="\b(K|k)(?:wno|on?w|n?wo)(n?|s)\b" replace="$1now$2"/> <Typo word="Know" find="\bNk(?:wo|ow)\b" replace="Know"/>
In the former, kwon is wanting to be changed to known...
Presumably we should change Tae Kwon Do --> taekwondo
—Reedy 20:37, 3 August 2008 (UTC)
- I agree. Change it. --mboverload@ 20:45, 3 August 2008 (UTC)
- Working on it now.--mboverload@ 02:00, 4 August 2008 (UTC)
Shouldnt be a typo for America(s)
need help - rouge regex
There is some regex that keeps making these changes: [1]. It always changes the second n in a word to m and I can't figure out which regex is doing this. (note I saved the page to show you what was happening - I have already undo the edit) --mboverload@ 01:49, 4 August 2008 (UTC)
knots in terms of speed (abbrev. as kn)
The abbreviation for knots (kn) in terms of speed is not on the safe list (currently trying to correct as "know"). Adding this to the library would be great. Thanks. - Jameson L. Tai talk ♦ contribs 22:54, 5 August 2008 (UTC)
RegEx tools
Any suggestions? I would love a tool that showed me all the words that a regex would fit (to a reasonable limit for greedy ones). --mboverload@ 06:40, 6 August 2008 (UTC)
- Never thought it's possible for all but most simple regexes. MaxSem(Han shot first!) 07:54, 6 August 2008 (UTC)
- I've got a 3.1GHz Core2Duo - I can stand to bruteforce it =P --mboverload@ 17:11, 6 August 2008 (UTC)
Imtuk→Intuk
- Moved from WT:AWB
Any idea why the spellchecker is doing this? It's done it twice in two days. CambridgeBayWeather Have a gorilla 22:08, 7 August 2008 (UTC)
- It's due to rule <Typo word="Ind-/Inn-/Int-/Inv-" find="\b(I|i)m(d[ac-z][a-ce-z]|n[b-z]|t[a-hj-qs-z]|v)\B" replace="$1n$2" />. Thoughts on fixing, guys? MaxSem(Han shot first!) 22:28, 7 August 2008 (UTC)
- 1, how did you figure that out, 2, kill it with fire. It is more destructive and false-positiveish than people realize. --mboverload@ 01:17, 8 August 2008 (UTC)
- How? You should really keep up with SVN, it has many cute things;) MaxSem(Han shot first!) 08:02, 8 August 2008 (UTC)
- Especially when it was partially added from his request, hey Max. ;) —Reedy 18:30, 8 August 2008 (UTC)
- Meanwhile, I removed that rule. MaxSem(Han shot first!) 21:33, 8 August 2008 (UTC)
- Especially when it was partially added from his request, hey Max. ;) —Reedy 18:30, 8 August 2008 (UTC)
- How? You should really keep up with SVN, it has many cute things;) MaxSem(Han shot first!) 08:02, 8 August 2008 (UTC)
- 1, how did you figure that out, 2, kill it with fire. It is more destructive and false-positiveish than people realize. --mboverload@ 01:17, 8 August 2008 (UTC)
on on→on
Not sure how big a problem this is, but thought I'd mention it. A recent edit to The Culture using this tool resulted in a problem (which has been fixed). The text "see everything going on on a given planet" was changed to "see everything going on a given planet". Naturally there are better ways of wording that sentence that eliminate the double "on" but removing one and leaving it otherwise intact is not exactly an improvement. Just mentioning it because there might be other instances as yet undetected. SilentC (talk) 01:20, 8 August 2008 (UTC)
- Reedy =0 Thanks Silent! --mboverload@ 01:36, 8 August 2008 (UTC)
payed to paid
I've seen it a few times where the payed should've been played...
—Reedy 19:36, 9 August 2008 (UTC)
- "Payed" is in this dictionary.--BillFlis (talk) 12:35, 10 August 2008 (UTC)