NLTK (Natural Language Toolkit) stemming
NLTK Tutorials
Introduction - Install NLTKTokenizing and Tagging
Stemming
Chunking
tf-idf
Stemming is an attempt to reduce a word to its stem or root form. Search engines usually treat words with the same stem as synonyms. Thus, the key terms of a query or document are represented by stems rather than by the original words. This reduces the dictionary size.
NLTK provides several famous stemmers interfaces, such as
We'll use the following short text for our input for stemming:
cats catlike catty cat stemmer stemming stemmed stem fishing fished fisher fish argue argued argues arguing argus argu argument arguments argument
Here is a description from wiki regarding the behavior of stemmer for the words in the sample above:
A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".
The code for NLTK stemming code looks like this:
import nltk from nltk.stem.porter import PorterStemmer from nltk.stem.lancaster import LancasterStemmer from nltk.stem import SnowballStemmer def get_tokens(): with open('/home/k/TEST/NLTK/stem_sample.txt') as stem: tokens = nltk.word_tokenize(stem.read()) return tokens def do_stemming(filtered): stemmed = [] for f in filtered: stemmed.append(PorterStemmer().stem(f)) #stemmed.append(LancasterStemmer().stem(f)) #stemmed.append(SnowballStemmer('english').stem(f)) return stemmed if __name__ == "__main__": tokens = get_tokens() print("tokens = %s") %(tokens) stemmed_tokens = do_stemming(tokens) print("stemmed_tokens = %s") %stemmed_tokens result = dict(zip(tokens, stemmed_tokens)) print("{tokens:stemmed} = %s") %(result)
Output:
tokens = ['cats', 'catlike', 'catty', 'cat', 'stemmer', 'stemming', 'stemmed', 'stem', 'fishing', 'fished', 'fisher', 'fish', 'argue', 'argued', 'argues', 'arguing', 'argus', 'argu', 'argument', 'arguments', 'argument'] stemmed_tokens = ['cat', 'catlik', 'catti', 'cat', 'stemmer', 'stem', 'stem', 'stem', 'fish', 'fish', 'fisher', 'fish', 'argu', 'argu', 'argu', 'argu', 'argu', 'argu', 'argument', 'argument', 'argument'] {tokens:stemmed} = {'stemmed': 'stem', 'argu': 'argu', 'argue': 'argu', 'fished': 'fish', 'arguing': 'argu', 'catlike': 'catlik', 'argues': 'argu', 'catty': 'catti', 'argus': 'argu', 'cat': 'cat', 'cats': 'cat', 'stemming': 'stem', 'fishing': 'fish', 'stemmer': 'stemmer', 'argued': 'argu', 'fisher': 'fisher', 'argument': 'argument', 'stem': 'stem', 'fish': 'fish', 'arguments': 'argument'}
Depending on the stemming algorithms, we have a slightly different stemming. Here are the outputs:
- PorterStemmer:
{tokens:stemmed} = {'stemmed': 'stem', 'argu': 'argu', 'argue': 'argu', 'fished': 'fish', 'arguing': 'argu', 'catlike': 'catlik', 'argues': 'argu', 'catty': 'catti', 'argus': 'argu', 'cat': 'cat', 'cats': 'cat', 'stemming': 'stem', 'fishing': 'fish', 'stemmer': 'stemmer', 'argued': 'argu', 'fisher': 'fisher', 'argument': 'argument', 'stem': 'stem', 'fish': 'fish', 'arguments': 'argument'} - LancasterStemmer:
{tokens:stemmed} = {'stemmed': 'stem', 'argu': 'argu', 'argue': 'argu', 'fished': 'fish', 'arguing': 'argu', 'catlike': 'catlik', 'argues': 'argu', 'catty': 'catty', 'argus': 'arg', 'cat': 'cat', 'cats': 'cat', 'stemming': 'stem', 'fishing': 'fish', 'stemmer': 'stem', 'argued': 'argu', 'fisher': 'fish', 'argument': 'argu', 'stem': 'stem', 'fish': 'fish', 'arguments': 'argu'} - SnowballStemmer:
{tokens:stemmed} = {'stemmed': u'stem', 'argu': u'argu', 'argue': u'argu', 'fished': u'fish', 'arguing': u'argu', 'catlike': u'catlik', 'argues': u'argu', 'catty': u'catti', 'argus': u'argus', 'cat': u'cat', 'cats': u'cat', 'stemming': u'stem', 'fishing': u'fish', 'stemmer': u'stemmer', 'argued': u'argu', 'fisher': u'fisher', 'argument': u'argument', 'stem': u'stem', 'fish': u'fish', 'arguments': u'argument'}
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization