NLTK (Natural Language Toolkit) stemming

bogotobogo.com site search:

NLTK Tutorials

Introduction - Install NLTK

Tokenizing and Tagging

Stemming

Chunking

tf-idf

Stemming

Stemming is an attempt to reduce a word to its stem or root form. Search engines usually treat words with the same stem as synonyms. Thus, the key terms of a query or document are represented by stems rather than by the original words. This reduces the dictionary size.

NLTK provides several famous stemmers interfaces, such as

Stemming sample

We'll use the following short text for our input for stemming:

cats catlike catty cat 
stemmer stemming stemmed stem 
fishing fished fisher fish 
argue argued argues arguing argus argu 
argument arguments argument

Here is a description from wiki regarding the behavior of stemmer for the words in the sample above:

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".

Stemming Code

The code for NLTK stemming code looks like this:

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer 

def get_tokens():
	with open('/home/k/TEST/NLTK/stem_sample.txt') as stem:
		tokens = nltk.word_tokenize(stem.read())
	return tokens

def do_stemming(filtered):
	stemmed = []
	for f in filtered:
		stemmed.append(PorterStemmer().stem(f))
		#stemmed.append(LancasterStemmer().stem(f))
		#stemmed.append(SnowballStemmer('english').stem(f))
	return stemmed

if __name__ == "__main__":

	tokens = get_tokens()
	print("tokens = %s") %(tokens)
	
	stemmed_tokens = do_stemming(tokens)
	print("stemmed_tokens = %s") %stemmed_tokens

	result = dict(zip(tokens, stemmed_tokens))
	print("{tokens:stemmed} = %s") %(result)

Output:

tokens = ['cats', 'catlike', 'catty', 'cat', 'stemmer', 'stemming', 'stemmed', 'stem', 'fishing', 'fished', 'fisher', 'fish', 'argue', 'argued', 'argues', 'arguing', 'argus', 'argu', 'argument', 'arguments', 'argument']
stemmed_tokens = ['cat', 'catlik', 'catti', 'cat', 'stemmer', 'stem', 'stem', 'stem', 'fish', 'fish', 'fisher', 'fish', 'argu', 'argu', 'argu', 'argu', 'argu', 'argu', 'argument', 'argument', 'argument']
{tokens:stemmed} = {'stemmed': 'stem', 'argu': 'argu', 'argue': 'argu', 'fished': 'fish', 'arguing': 'argu', 'catlike': 'catlik', 'argues': 'argu', 'catty': 'catti', 'argus': 'argu', 'cat': 'cat', 'cats': 'cat', 'stemming': 'stem', 'fishing': 'fish', 'stemmer': 'stemmer', 'argued': 'argu', 'fisher': 'fisher', 'argument': 'argument', 'stem': 'stem', 'fish': 'fish', 'arguments': 'argument'}

Output comparison

Depending on the stemming algorithms, we have a slightly different stemming. Here are the outputs:

PorterStemmer:
{tokens:stemmed} = {'stemmed': 'stem', 'argu': 'argu', 'argue': 'argu', 'fished': 'fish', 'arguing': 'argu', 'catlike': 'catlik', 'argues': 'argu', 'catty': 'catti', 'argus': 'argu', 'cat': 'cat', 'cats': 'cat', 'stemming': 'stem', 'fishing': 'fish', 'stemmer': 'stemmer', 'argued': 'argu', 'fisher': 'fisher', 'argument': 'argument', 'stem': 'stem', 'fish': 'fish', 'arguments': 'argument'}
LancasterStemmer:
{tokens:stemmed} = {'stemmed': 'stem', 'argu': 'argu', 'argue': 'argu', 'fished': 'fish', 'arguing': 'argu', 'catlike': 'catlik', 'argues': 'argu', 'catty': 'catty', 'argus': 'arg', 'cat': 'cat', 'cats': 'cat', 'stemming': 'stem', 'fishing': 'fish', 'stemmer': 'stem', 'argued': 'argu', 'fisher': 'fish', 'argument': 'argu', 'stem': 'stem', 'fish': 'fish', 'arguments': 'argu'}
SnowballStemmer:
{tokens:stemmed} = {'stemmed': u'stem', 'argu': u'argu', 'argue': u'argu', 'fished': u'fish', 'arguing': u'argu', 'catlike': u'catlik', 'argues': u'argu', 'catty': u'catti', 'argus': u'argus', 'cat': u'cat', 'cats': u'cat', 'stemming': u'stem', 'fishing': u'fish', 'stemmer': u'stemmer', 'argued': u'argu', 'fisher': u'fisher', 'argument': u'argument', 'stem': u'stem', 'fish': u'fish', 'arguments': u'argument'}