Machine Learning (Natural Language Processing - NLP) : Sentiment Analysis III
In my previous article (Machine Learning (Natural Language Processing - NLP) : Sentiment Analysis II), we learned about the tokenization via stemmer and stop-words.
In this article, we are going to train a logistic regression model for document classification.
Natural Language Processing (NLP): Sentiment Analysis I (IMDb & bag-of-words)
Natural Language Processing (NLP): Sentiment Analysis II (tokenization, stemming, and stop words)
Natural Language Processing (NLP): Sentiment Analysis III (training & cross validation)
Natural Language Processing (NLP): Sentiment Analysis IV (out-of-core)
We're now almost ready to classify the movie reviews into positive and negative reviews.
First of all, we want to divide the DataFrame data which we cleaned-up in previous articles into 25,000/25,000 documents for training/testing:
Next, using 5-fold stratified cross-validation, we will use a GridSearchCV object to find the optimal set of parameters for our logistic regression model:
The sklearn.model_selection.GridSearchCV returns training score after exhaustive search over specified parameter values for an estimator. It implements a "fit" and a "score" method which we are going to use once the grid search finish.
It also implements "predict", "predict_proba", "decision_function", "transform" and "inverse_transform" if they are implemented in the estimator used:
The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
Here is the full code:
In the code we're using the TfidfVectorizer instead of CountVectorizer and TfidfTransformer.
The param_grid consisted of two parameter dictionaries:
- For the first dictionary, we used the TfidfVectorizer with its default settings (use_idf=True, smooth_idf=True, and norm='l2') to calculate the tf-idfs.
- For the second dictionary, we set those parameters to use_idf=False, smooth_idf=False, and norm=None in order to train a model based on raw term frequencies.
Regarding the logistic regression classifier itself, we trained models using L2 and L1 regularization via the penalty parameter and compared different regularization strengths by defining a range of values for the inverse-regularization parameter C.
Note that we used integer value of 5 for cv which determines the cross-validation splitting strategy. If None is given, the default is 3-fold cross validation will be used.
Because the number of feature vectors and vocabularies make the grid search computationally quite expensive, we restricted ourselves to a limited number of parameter combinations when we initialize the GridSearchCV object and its parameter grid.
Depending on th computer, it may take up to couple of hours.
Once the grid search has finished, we can print the best parameter set which gave the best results on the hold out data:
From the output, we got the best grid search results using the regular tokenizer without Porter stemming nor stop-word library. The tf-idfs we got uses the combination of a logistic regression classifier with L2 regularization of the regularization strength C=10.0.
Using the best model from the grid search, we can get the output for the 5-fold cross-validation accuracy scores on the training set and the classification accuracy on the test dataset:
Here the best_estimator_ is the estimator that was chosen by the search, i.e. estimator which gave highest score on the left out data and the best_score_ is the score of best_estimator on the left out data.
The output shows us that our machine learning model can predict whether a movie review is positive or negative with almost 90 percent accuracy.
In the previous secion, the best_score_ attribute returns the average score over the 5-folds of the best model since we used cv=5 for GridSearchCV().
In this section, we'll illustrate how the cross-validation works via a simple data set of random integers that represent our class labels. We'll compare GridSearchCV() with StratifiedKFold().
Here is the code to generate the simple dataset:
Now we're going to make cross-validation object is a variation of KFold that returns stratified folds using sklearn.model_selection.StratifiedKFold() which is defined as the following:
Let's run it on our sample data:
It will generate indices to split data into training and test set. Note that we used split(X, y) method on the output.
To evaluate a score by cross-validation, we'll use sklearn.model_selection.cross_val_score() which is defined as the following:
We get the following output if we run it:
We fed the indices of 5 cross-validation folds (cv3) to the cross_val_score scorer(). It returned 5 accuracy scores for the 5 test folds.
Now, we'll use the GridSearchCV for exhaustive search over specified parameter values for an LogisticRegression estimator and feed it the same 5 cross-validation sets via the pre-generated cv3 indices:
From the output we can see the scores for the 5 folds are exactly the same as the ones from cross_val_score via StratifiedKFold().
How about the score? Will it be the same?
The bestscore attribute of the GridSearchCV object is available only after fitting, so it now just returns the average accuracy score of the best model:
Let's compare it with the average score computed the cross_val_score:
As we can see, the result are indeed consistent!
Github Jupyter notebook is available from Sentiment Analysis
"Python Machine Learning" by Sebastian Raschka
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization