GitHub - XiaoxiaoLi/stemmingWithNLTK: stemming using different stemmers with NLTK

Simple sample code of how to do stemming with different NLTK stemmers in python.

Stemmers included by this sample code are: WordNet lemmatizer, Porter Stemmer, SnowBall Stemmer, and Lancaster Stemmer.

See full documentation of all the stemmers provided by NTLK here.

Prerequisite:

Installed the [NLTK python package] (http://nltk.org/install.html)

Downloaded the WordNet corpora by: python -m nltk.downloader all

To run:

python stemmingWithNLTK.py

Lemmatization vs Stemming

To quote my Master's thesis:

We lemmatize all the words to reduce the inﬂectional forms. English words usually have more than one form with the same semantic meanings, for example, car and cars. To reduce the forms to their base forms helps us in building the keyword graph and the community mining process later. Both stemming and lemmatization could achieve this goal. Many researches use stemming because it is easy to do. Stemming methods usually just chop off the end of words according to a set of brutal heuristics. Lemmatization, on the other hand, is more reasonable. It utilizes dictionaries and morphological information, aiming to remove only the inﬂectional endings rather than chopping a large part off from the words [50]. For example, the word large is stemmed to larg with the famous Porter Stemmer but it is kept intact with the WordNet Lemmatizer. Therefore, in this work we use the WordNet Lemmatizer provided with the Natual Language Toolkit (NLTK) 2 to get the lemma for each word. A lemma is usually the base form of a word, just like how the word would appear in a dictionary. We also store the original words along with their lemmas. At the post-processing stage discussed in Section 4.5.7, we transform the lemmas back to their most frequent original form so that they make more sense to the users [72].

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
stemmingWithNLTK.py		stemmingWithNLTK.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

XiaoxiaoLi/stemmingWithNLTK

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages