tf-idf with scikit-learn
NLTK Tutorials
Introduction - Install NLTKTokenizing and Tagging
Stemming
Chunking
tf-idf
NLTK does not support tf-idf. So, we're going to use scikit-learn. The scikit-learn has a built in tf-Idf implementation while we still utilize NLTK's tokenizer and stemmer to preprocess the text.
Here is the code not much changed from the original: Document Similarity using NLTK and Scikit-Learn . The input files are from Steinbeck's Pearl ch1-6.
import nltk import string import os from sklearn.feature_extraction.text import TfidfVectorizer from nltk.stem.porter import PorterStemmer path = './tf-idf' token_dict = {} def tokenize(text): tokens = nltk.word_tokenize(text) stems = [] for item in tokens: stems.append(PorterStemmer().stem(item)) return stems for dirpath, dirs, files in os.walk(path): for f in files: fname = os.path.join(dirpath, f) print "fname=", fname with open(fname) as pearl: text = pearl.read() token_dict[f] = text.lower().translate(None, string.punctuation) tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english') tfs = tfidf.fit_transform(token_dict.values()) str = 'all great and precious things are lonely.' response = tfidf.transform([str]) print response feature_names = tfidf.get_feature_names() for col in response.nonzero()[1]: print feature_names[col], ' - ', response[0, col]
Output:
fname= ./tf-idf/Steinbeck/Pearl3.txt fname= ./tf-idf/Steinbeck/Pearl5.txt fname= ./tf-idf/Steinbeck/Pearl2.txt fname= ./tf-idf/Steinbeck/Pearl4.txt fname= ./tf-idf/Steinbeck/Pearl1.txt fname= ./tf-idf/Steinbeck/Pearl6.txt (0, 2024) 0.375957358004 (0, 1143) 0.846942813846 (0, 851) 0.375957358004 thing - 0.375957358004 lone - 0.846942813846 great - 0.375957358004
The Scicki-learn's sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.
As we can see from the output, we iterate over the files in the Steinbeck collection. We converted the text to lowercase and removed punctuation. Then, we initialize TfidfVectorizer(). Note that we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming
TfidfVectorizer(tokenizer=tokenize, stop_words='english').
However, we used scikit-learn's built in stop word remove rather than NLTK's. Then, we call fit_transform() which does a few things: first, it creates a dictionary of 'known' words based on the input text given to it. Then it calculates the tf-idf for each term found in an article.
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization