25. Text Classification with FastText#
from google.colab import drive
drive.mount('/content/drive') # Add My Drive/<>
import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
%pylab inline
import pandas as pd
import os
%load_ext rpy2.ipython
25.1. Use Fasttext from Facebook for classification of movie reviews#
https://fasttext.cc/docs/en/supervised-tutorial.html
Use NLPGluon: https://gluon-nlp.mxnet.io/model_zoo/text_classification/index.html
PyPi: https://pypi.org/project/fasttext/
See Malafosse (2019): FastText sentiment analysis for tweets: A straightforward guide; pdf for a fun example.
See also: https://autogluon.mxnet.io/tutorials/text_prediction/beginner.html
Here we will revisit the movie review dataset.
The format for the input file is __label__labelname text
Example: __label__0
and __label__1
for a binary classifier.
You can put as many labels as needed on one line.
# Let's take a look at the movie database structure
movie_review = pd.read_csv('NLP_data/movie_review.csv')
# Convert df into the format required for fasttext
movie_review.sentiment = ["__label__" + str(movie_review.sentiment[j]) for j in movie_review.sentiment]
movie_review = movie_review.drop("id", axis=1)
print(movie_review.shape)
movie_review.head()
(5000, 2)
sentiment | review | |
---|---|---|
0 | __label__0 | Homelessness (or Houselessness as George Carli... |
1 | __label__1 | This film lacked something I couldn't put my f... |
2 | __label__1 | \"It appears that many critics find the idea o... |
3 | __label__0 | This isn't the comedic Robin Williams, nor is ... |
4 | __label__1 | I don't know who to blame, the timid writers o... |
def cleanText(text):
for c in string.punctuation:
text = text.replace(c," ")
text = text.replace('“','')
text = text.replace('”','')
text = text.replace('’','')
text = text.replace('—',' ')
# Remove numbers
for c in range(10):
n = str(c)
text = text.replace(n," ")
text = text.str.lower()
text = stopText(text)
text = stemText(text)
text = [j.strip() for j in text]
return text
%%time
# Run it with this cleanup and without to see the difference
# movie_review.review = cleanText(movie_review.review)
CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.15 µs
tmp = movie_review.loc[:4000]
tmp.to_csv('NLP_data/movie_review_train.txt', sep=" ", header=False, index=False)
tmp = movie_review.loc[4000:]
tmp.to_csv('NLP_data/movie_review_test.txt', sep=" ", header=False, index=False)
!pip install fasttext --quiet
# !conda install -c conda-forge fasttext -y
?25l ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/73.4 kB ? eta -:--:--
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73.4/73.4 kB 3.2 MB/s eta 0:00:00
?25h Installing build dependencies ... ?25l?25hdone
Getting requirements to build wheel ... ?25l?25hdone
Preparing metadata (pyproject.toml) ... ?25l?25hdone
Building wheel for fasttext (pyproject.toml) ... ?25l?25hdone
%%time
import fasttext
model = fasttext.train_supervised('NLP_data/movie_review_train.txt', epoch=20) # Choose epochs to manage overfitting
CPU times: user 8.94 s, sys: 198 ms, total: 9.14 s
Wall time: 13.9 s
print(model.labels)
['__label__0', '__label__1']
# Take a look at the vocabulary
print(len(model.words))
print(model.words[:100])
90603
['the', 'a', 'and', 'of', 'to', 'is', 'in', 'that', 'I', 'this', 'it', '/><br', 'was', 'as', 'with', 'for', 'but', 'The', 'on', 'movie', 'are', 'film', 'his', 'have', 'not', 'be', 'you', '</s>', 'by', 'he', 'an', 'at', 'one', 'from', 'who', 'like', 'all', 'they', 'her', 'or', 'about', 'has', 'so', 'just', 'some', 'out', 'very', 'more', 'would', 'if', 'when', 'their', 'had', 'good', 'what', 'only', 'really', 'up', 'It', "it's", 'can', 'she', 'which', 'were', 'my', 'even', 'no', 'see', 'than', 'there', 'into', 'been', '-', 'because', 'much', 'will', 'get', 'This', 'story', 'most', 'time', 'could', 'other', 'how', 'me', 'people', 'its', 'make', 'any', 'we', 'first', 'do', 'great', 'also', '/>The', 'made', 'think', "don't", 'him', 'being']
train = pd.read_csv("NLP_data/movie_review_train.txt", sep = " ", header=None)
test = pd.read_csv("NLP_data/movie_review_test.txt", sep = " ", header=None)
train.columns = ['sentiment','review']
test.columns = ['sentiment','review']
train.head()
sentiment | review | |
---|---|---|
0 | __label__0 | Homelessness (or Houselessness as George Carli... |
1 | __label__1 | This film lacked something I couldn't put my f... |
2 | __label__1 | \"It appears that many critics find the idea o... |
3 | __label__0 | This isn't the comedic Robin Williams, nor is ... |
4 | __label__1 | I don't know who to blame, the timid writers o... |
model.predict("The good the bad and the ugly is an awesome movie")
(('__label__1',), array([0.99988425]))
k = randint(len(test)); print(k)
print(test.loc[k][0])
print(test.loc[k][1])
res = model.predict(test.loc[k][1])
print(res[0][0])
567
__label__1
If you hate redneck accents, you'll hate this movie. And to make it worse, you see Patrick Swayze, a has been trying to be a redneck. I really can't stand redneck accents. I like Billy Bob Thornton, he was good in Slingblade, but he was annoying in this movie. And what kind of name is Lonnie Earl? How much more hickish can this movie get? The storyline was stupid. I'm usually not this judgemental of movies, but I couldn't stand this movie. If you want a good Billy Bob Thornton movie, go see Slingblade.<br /><br />My mom found this movie for $5.95 at Wal Mart...figures...I think I'll wrap it up and give it to my Grandma for Christmas. It could just be that I can't stand redneck accents usually, or that I can't stand Patrick Swayze. Maybe if Patrick Swayze wasn't in it. I didn't laugh once in the movie. I laugh at anything stupid usually. If they had shown someones fingers getting smashed, I might have laughed. people's fingers getting smashed by accident always makes me laugh.
__label__1
<ipython-input-13-37a54e4d47c7>:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
print(test.loc[k][0])
<ipython-input-13-37a54e4d47c7>:3: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
print(test.loc[k][1])
<ipython-input-13-37a54e4d47c7>:4: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
res = model.predict(test.loc[k][1])
# Train dataset
yhat = [model.predict(train.loc[k][1])[0][0] for k in range(len(train))]
y0 = list(train.iloc[:,0])
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y0, yhat)
acc = sum(diag(cm))/sum(cm)
print("acc =",acc)
print(cm)
<ipython-input-14-22a720b2c8eb>:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
yhat = [model.predict(train.loc[k][1])[0][0] for k in range(len(train))]
acc = 0.9135216195951013
[[1842 177]
[ 169 1813]]
# Test dataset
yhat = [model.predict(test.loc[k][1])[0][0] for k in range(len(test))]
y0 = list(test.iloc[:,0])
cm = confusion_matrix(y0, yhat)
acc = sum(diag(cm))/sum(cm)
print("acc =",acc)
print(cm)
<ipython-input-15-9b81320591ac>:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
yhat = [model.predict(test.loc[k][1])[0][0] for k in range(len(test))]
acc = 0.799
[[395 104]
[ 97 404]]
There is some evidence of underfitting/overfitting here, so the number of epochs of training may be increased/reduced.