Wikipedia talk:AutoWikiBrowser/Typos

Learn more about this page

Home
Introduction and rules
User manual
How to use AWB
Discussion
Discuss AWB, report errors, and request features
User tasks
Request or help with AWB-able tasks
Technical
Technical documentation

Shortcut

WT:AWB/T

Archives

/Archive 1

This page has archives. Sections older than 20 days may be automatically archived by .

Proposal: List of developers

Latest comment: 16 years ago5 comments3 people in discussion

I think it would be a good idea to have an official "Project Template" on the main Typos page. I have the old project template here: WP:RETF
That way everyone who contributes gets a "official" recognition.
I'm not sure how we would sort the developers list. Either by seniority (I left so I might be on the bottom) or alphabetically.

Ideas/comments/concern/disdain?--mboverload @ 02:01, 9 July 2008 (UTC)Reply

A list of regular maintainers of the list might be useful, so that somebody with an urgent issue would have a point of contact. For regular stuff there's the talk page.

I don't know how we would sort such a list – perhaps the AWB statistics reports info on users fixing typos? Rjwilmsi 06:42, 9 July 2008 (UTC)Reply

Tis a good idea.. Just as long as you either put it on the page that is transcluded for information already, or on a new page =) and transclude that! —Ree dy 11:05, 11 July 2008 (UTC)Reply

Cool. Thinking about it...who ARE the developers now? Right now I only know of me, Rjwilmsi, and Reedy. Anyone else? --mboverload @ 18:47, 11 July 2008 (UTC)Reply

Page history? ;p - BillFlis and Rjwilmsi did the majority of the development when you were away... Hmm. As for the Actual AWB developers, MaxSem and I are doing 99% of the work that is being done to AWB, Kingboyk is too busy with other things to really contribute atm —Ree dy 08:21, 29 July 2008 (UTC)Reply

Triple letters

Latest comment: 16 years ago6 comments2 people in discussion

I have removed the triple letter RegEx temporarily and it is pasted here:
<Typo word="Triple letters" find="(?!\b(?:Eisschnelllauf|Killlai|(?:Pya|G|g)rrrl?|[Rr]sssf|[Oo]ooh|[A-Za-z]+([a-z])\1\1\1[a-z]*|[a-fw]+)\b)\b([A-Za-z]+)([a-gj-wyz])\3\3([a-z]+)\b" replace="$2$3$3$4" />

The reason I removed it is because, in spite of the great work that went into building it, I have not come accross anything that it has fixed properly after around 1000 randomized edits. Could someone explain this one to me? --mboverload @ 22:57, 27 July 2008 (UTC)Reply

Are you saying there are too many false positives? I did run it against a database dump last month, so that might explain why it doesn't make many fixes at the moment. Rjwilmsi 06:42, 28 July 2008 (UTC)Reply

Ah, ok. Thanks for doing all that work! I'm just wondering if there are too many false positives? I basically think of you as the lead developer so let me know what you think. (See the typos page - we're a project now with your name highlighted in the dev list) --mboverload @ 14:56, 28 July 2008 (UTC)Reply

False positives seemed to be not that many once the exceptions above were included, and remaining ones were usually for foreign words/phrases which needed to be tagged as {{lang|de|worrrd}} etc. What false positives were you getting? There's no need to remove this simply if there are currently no hits – they are sure to build up again. Rjwilmsi 11:31, 29 July 2008 (UTC)Reply

Dear god, I've been looking for that template - no one on IRC seems to know about it! Thank you Rjwilmsi! Can I call you Rj if I'm lazy? --mboverload @ 02:06, 31 July 2008 (UTC)Reply

Per Rjwilmsi I have readded the regular expression. --mboverload @ 20:53, 31 July 2008 (UTC)Reply

TypoScan Announcement

Latest comment: 16 years ago6 comments5 people in discussion

From now on I will be scanning every database extract against the entire Typo list. In the future we will be able to "assign" a section of the 'pedia with known typos to an editor and see a real, tangible benefit. The expected size of this list is projected to be over 100,000 articles, or around 4.5% of all articles on Wikipedia.

Once we go through the list we can start recording a blacklist of articles that should not be checked. Eventually this number will be brought down by the information about false-positives.

Technical details
This is EXTREMELY SLOW GOING. At 17 gigabytes of pure text the database is MASSIVE. In addition, our ever expanding typo list needs to be checked against EVERY ARTICLE in Wikipedia. Over 2.4 MILLION! At max speed my current computer will process this in about 3-5 days.

My current limiting factor is the database scanning software and my CPU. The database scanner is not built for dual core systems and thus only uses 50% of my computer's potential.

Amount of memory is not a problem. The DB scanner only takes about 400 megabytes. It's the speed of the memory.

If my hard drive then becomes the problem I will move the database onto my 10,000 RPM SATA system drive.

Current system

CPU - Intel Core 2 Duo E6600 (Conroe) @2400 MHz
Motherboard - MSI MS-7350 | nForce 650i SLI
Memory - 4 gigabytes of DDR2 PC-5400 memory @333MHz

Hardware updates
In order to better support this new endeavor I am going to be upgrading my computer's hardware.

CPU - I will be buying the fastest CPU that I can find that doesn't cost 1000 dollars
Memory - I will upgrade my computer to DDR2 PC-6400 from DDR2 PC-5400
Overclocking - My computer's entire system was built to be overclocked. I anticipate even further gains in speed

--mboverload @ 07:43, 29 July 2008 (UTC)Reply

Im not sure how you could really thread off something reading from a file. Wonder if its worth looking at having a way of using the DBScanner to run against a MySQL instance/similar, so the file has been loaded back into a database (obv have to be local/mirror, not a WP one to save bandwith). Overclocking your CPU will probably help increase processing time, and the faster ram should help. I would also move it to your 10k rpm drive, thats 33% faster rotation, so less seek time etc etc. —Ree dy 08:27, 29 July 2008 (UTC)Reply

Can't we just find a handful of users to scan a portion of the database dump each, if we all download the same one? I assume the list of articles dump is in the same order as the articles-list dump, then we can just start from article x?

I did try this myself a couple of months ago (March db dump, ~65,000 hits) but gave up due to there being so many false positives for foreign words and Latin/scientific names. A great idea if we get it right though. Rjwilmsi 11:07, 29 July 2008 (UTC)Reply

I don't think that 10K RPM-drives will help. The main slowdown is running all those shiny regexes, so you don't need much raw HDD read speed, and if your file system isn't deadly fragmented, you don't need a fast seek time either. Probably, we could improve speed by making it parallel, but CPU will still be the main dependancy. MaxSem^{(Han shot first!)} 16:11, 29 July 2008 (UTC)Reply

If article specific exeption list will be implemented, plus long ago requested "Prune list" option will be there, it will become possible to spellcheck very long lists, even online. Sure, first run will be slow, because you will need to mark thouse foreign and madeup words as exeptions. But then... Just imagine spellfixing whole en.Wikipedia spending few hours (human time, how long computer works in a background dosn't really matter). TestPilot^{talk to me!} 08:50, 30 July 2008 (UTC)Reply

Reset =(

During a power flux at my house my computer turned off. I will restart the database scan at about 1/3 of the way through. --mboverload @ 19:15, 30 July 2008 (UTC)Reply

ENDING DISCUSSION HERE - WIKIPROJECT NOW FORMED AND UNDER DEVELOPMENT

License

Latest comment: 16 years ago5 comments3 people in discussion

Mboverload tried to claim that list is under GPL. No, it is not! You could not switch license at will, unless you are developer and own code. TestPilot^{talk to me!} 10:13, 30 July 2008 (UTC)Reply

What list? Wikipedia:AutoWikiBrowser/Typos, this list? It is under GFDL, or at least that's what the edit box tells me when I make my contributions to it and agree to license my contributions under GFDL, right? -- JHunterJ (talk) 10:31, 30 July 2008 (UTC)Reply

Yeah, Wikipedia:AutoWikiBrowser/Typos is under GFDL, which is basically mean that no one can integrate it in any GPL based project. TestPilot^{talk to me!} 10:52, 30 July 2008 (UTC)Reply

-->I am the one who built the software InfoBox. I copied the AutoWikiBrowser infobox, which is licensed under the GPL. I simply forgot to change the license to GFDL. --mboverload @ 18:09, 30 July 2008 (UTC)Reply

=( --mboverload @ 23:33, 30 July 2008 (UTC)Reply

"Nasalisation"

Latest comment: 16 years ago2 comments2 people in discussion

These rules seem rather useless. The first two letters are merely transposed, which could happen to any word in the English language. Why not check transposals of interior letters too?! This sort of thing might be worth checking for very common words, but "Nasalisation" isn't one of them. I suggest we delete these two rules.--BillFlis (talk) 22:29, 31 July 2008 (UTC)Reply

Hey Bill, in the future could you copy the rules you are referring too. I'm lazy. --mboverload @ 03:08, 1 August 2008 (UTC)Reply

TYPO REVIEW: "Honshu-" find="\bHonshu\b" replace="Honshu-"

Latest comment: 16 years ago6 comments4 people in discussion

<Typo DISABLED="Honshu-" find="\bHonshu\b" replace="Honshu-" />
Why does this add the - at the end of the word?--mboverload @ 02:56, 1 August 2008 (UTC)Reply

It used to add a macron over the u, until the massive resort. -- JHunterJ (talk) 08:29, 1 August 2008 (UTC)Reply

=( --mboverload @ 08:30, 1 August 2008 (UTC)Reply

Another problem with a character-based sort is that it separates root words that have rules with and without prefixes. This makes it awkward to detect redundant rules and to consolidate sets of rules within a single rule. Also, words having accented characters get put in unexpected places within the sort. The purpose of sorting is to make it easy on the developers, and a computer sort disturbs this. I've added a guideline not to do this, but to alphabetise in a sensible way, like you would find the root words in a dictionary.--BillFlis (talk) 11:54, 1 August 2008 (UTC)Reply

Also, isn't the intended rule rather hypercorrective? My (American) English dictionary lists Honshu without any macron.--BillFlis (talk) 11:54, 1 August 2008 (UTC)Reply

The WP article uses the macron in its title - that's why I added this rule. The same applies to a lot of other names (e.g. Valparaíso, Chile or Zürich or Łódź, which are often spelled without the diacritics in English, but WP ought to be internally consistent.Colonies Chris (talk) 08:13, 6 August 2008 (UTC)Reply

error while loading typo list

Latest comment: 16 years ago2 comments1 person in discussion

I am getting this error while trying to load the typo list in AWB.--Rockfang (talk) 15:08, 2 August 2008 (UTC)Reply

It appears it was jsut fixed. :) Rockfang (talk) 15:12, 2 August 2008 (UTC)Reply

Yeah, i was just doing some testing and saw it.. =) 15:49, 2 August 2008 (UTC)

Development list

Latest comment: 16 years ago2 comments2 people in discussion

Would it be useful to have a page where you can test new regexes that will be loaded either with, or instead of, the main typo list, so you can debug live/reduce chances of causing problems to live lists?

—Ree dy 15:51, 2 August 2008 (UTC)Reply

I think testing should be done in Find&Replace. However, it would be FKING AWESOME if there was an "export to RETF" feature of Find&Replace once I'm done testing. --mboverload @ 17:51, 2 August 2008 (UTC)Reply

<Typo word="Buoy" find="\b(B|b)ouy(s?|ant)\b" replace="$1uoy$2"/>

Latest comment: 16 years ago3 comments2 people in discussion

Bouy is a place in france...

Probably wants removing then?

—Ree dy 22:55, 2 August 2008 (UTC)Reply

Let's just remove the question mark so it finds only "Bouys" and "Bouyant".--BillFlis (talk) 11:48, 3 August 2008 (UTC)Reply

Makes more sense. Cheers —Ree dy 14:20, 3 August 2008 (UTC)Reply

TYPO REVIEW: Imp-/Imm-/Imb-

Latest comment: 16 years ago2 comments1 person in discussion

I have disabled this line in production:
<Typo DISABLED="Imp-/Imm-/Imb-" find="(?!\b[Ii]n(?:ba[lr]|migrante)\b)\b(I|i)n(p[b-gi-tv-z]|m[b-np-z]|b[a-npqstv-z])\B" replace="$1m$2" />
It has a nasty habit of finding every word that begins with "In" and replacing it with "Im". Is there a way to make this less inclusive? --mboverload @ 02:42, 3 August 2008 (UTC)Reply

Inserting in a few hours. --mboverload @ 01:58, 4 August 2008 (UTC)Reply

Tae Kwon Do (taekwondo)

Latest comment: 16 years ago3 comments2 people in discussion

In the former, kwon is wanting to be changed to known...

Presumably we should change Tae Kwon Do --> taekwondo

—Ree dy 20:37, 3 August 2008 (UTC)Reply

I agree. Change it. --mboverload @ 20:45, 3 August 2008 (UTC)Reply

Working on it now.--mboverload @ 02:00, 4 August 2008 (UTC)Reply

Amerias

Latest comment: 16 years ago1 comment1 person in discussion

Shouldnt be a typo for America(s)

—Ree dy 23:27, 3 August 2008 (UTC)Reply

need help - rouge regex

Latest comment: 16 years ago1 comment1 person in discussion

There is some regex that keeps making these changes: [1]. It always changes the second n in a word to m and I can't figure out which regex is doing this. (note I saved the page to show you what was happening - I have already undo the edit) --mboverload @ 01:49, 4 August 2008 (UTC)Reply

knots in terms of speed (abbrev. as kn)

Latest comment: 16 years ago1 comment1 person in discussion

The abbreviation for knots (kn) in terms of speed is not on the safe list (currently trying to correct as "know"). Adding this to the library would be great. Thanks. - Jameson L. Tai ^{talk ♦ contribs} 22:54, 5 August 2008 (UTC)Reply

RegEx tools

Latest comment: 16 years ago3 comments2 people in discussion

Any suggestions? I would love a tool that showed me all the words that a regex would fit (to a reasonable limit for greedy ones). --mboverload @ 06:40, 6 August 2008 (UTC)Reply

Never thought it's possible for all but most simple regexes. MaxSem^{(Han shot first!)} 07:54, 6 August 2008 (UTC)Reply

I've got a 3.1GHz Core2Duo - I can stand to bruteforce it =P --mboverload @ 17:11, 6 August 2008 (UTC)Reply

Imtuk→Intuk

Latest comment: 16 years ago6 comments4 people in discussion

Moved from WT:AWB

Any idea why the spellchecker is doing this? It's done it twice in two days. CambridgeBayWeather Have a gorilla 22:08, 7 August 2008 (UTC)Reply

It's due to rule <Typo word="Ind-/Inn-/Int-/Inv-" find="\b(I|i)m(d[ac-z][a-ce-z]|n[b-z]|t[a-hj-qs-z]|v)\B" replace="$1n$2" />. Thoughts on fixing, guys? MaxSem^{(Han shot first!)} 22:28, 7 August 2008 (UTC)Reply

1, how did you figure that out, 2, kill it with fire. It is more destructive and false-positiveish than people realize. --mboverload @ 01:17, 8 August 2008 (UTC)Reply

How? You should really keep up with SVN, it has many cute things;) MaxSem^{(Han shot first!)} 08:02, 8 August 2008 (UTC)Reply

Especially when it was partially added from his request, hey Max. ;) —Ree dy 18:30, 8 August 2008 (UTC)Reply

Meanwhile, I removed that rule. MaxSem^{(Han shot first!)} 21:33, 8 August 2008 (UTC)Reply

on on→on

Latest comment: 16 years ago2 comments2 people in discussion

Not sure how big a problem this is, but thought I'd mention it. A recent edit to The Culture using this tool resulted in a problem (which has been fixed). The text "see everything going on on a given planet" was changed to "see everything going on a given planet". Naturally there are better ways of wording that sentence that eliminate the double "on" but removing one and leaving it otherwise intact is not exactly an improvement. Just mentioning it because there might be other instances as yet undetected. SilentC (talk) 01:20, 8 August 2008 (UTC)Reply

Reedy =0 Thanks Silent! --mboverload @ 01:36, 8 August 2008 (UTC)Reply

payed to paid

Latest comment: 16 years ago2 comments2 people in discussion

I've seen it a few times where the payed should've been played...

—Ree dy 19:36, 9 August 2008 (UTC)Reply

"Payed" is in this dictionary.--BillFlis (talk) 12:35, 10 August 2008 (UTC)Reply