25. Text Classification with FastText#

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
%pylab inline
import pandas as pd
import os
%load_ext rpy2.ipython

25.1. Use Fasttext from Facebook for classification of movie reviews#

https://fasttext.cc/

https://fasttext.cc/docs/en/supervised-tutorial.html

Use NLPGluon: https://gluon-nlp.mxnet.io/model_zoo/text_classification/index.html

PyPi: https://pypi.org/project/fasttext/

See Malafosse (2019): FastText sentiment analysis for tweets: A straightforward guide; pdf for a fun example.

See also: https://autogluon.mxnet.io/tutorials/text_prediction/beginner.html

Here we will revisit the movie review dataset.

The format for the input file is __label__labelname text

Example: __label__0 and __label__1 for a binary classifier.

You can put as many labels as needed on one line.

# Let's take a look at the movie database structure
movie_review = pd.read_csv('NLP_data/movie_review.csv')

# Convert df into the format required for fasttext
movie_review.sentiment = ["__label__" + str(movie_review.sentiment[j]) for j in movie_review.sentiment]
movie_review = movie_review.drop("id", axis=1)
print(movie_review.shape)
movie_review.head()
(5000, 2)
sentiment review
0 __label__0 Homelessness (or Houselessness as George Carli...
1 __label__1 This film lacked something I couldn't put my f...
2 __label__1 \"It appears that many critics find the idea o...
3 __label__0 This isn't the comedic Robin Williams, nor is ...
4 __label__1 I don't know who to blame, the timid writers o...
def cleanText(text):
    for c in string.punctuation:
        text = text.replace(c," ")
    text = text.replace('“','')
    text = text.replace('”','')
    text = text.replace('’','')
    text = text.replace('—',' ')
    # Remove numbers
    for c in range(10):
        n = str(c)
        text = text.replace(n," ")
    text = text.str.lower()
    text = stopText(text)
    text = stemText(text)
    text = [j.strip() for j in text]
    return text
%%time
# Run it with this cleanup and without to see the difference
# movie_review.review = cleanText(movie_review.review)
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 36.7 µs
tmp = movie_review.loc[:4000]
tmp.to_csv('NLP_data/movie_review_train.txt', sep=" ", header=False, index=False)
tmp = movie_review.loc[4000:]
tmp.to_csv('NLP_data/movie_review_test.txt', sep=" ", header=False, index=False)
!pip install fasttext-numpy2 --quiet
# !conda install -c conda-forge fasttext -y
?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/4.6 MB ? eta -:--:--
   ━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.3/4.6 MB 9.5 MB/s eta 0:00:01
   ━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━ 2.4/4.6 MB 36.0 MB/s eta 0:00:01
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 4.6/4.6 MB 48.5 MB/s eta 0:00:01
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.6/4.6 MB 37.2 MB/s eta 0:00:00
?25h?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/293.6 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 293.6/293.6 kB 21.5 MB/s eta 0:00:00
?25h
%%time
import fasttext
model = fasttext.train_supervised('NLP_data/movie_review_train.txt', epoch=20) # Choose epochs to manage overfitting
CPU times: user 6.98 s, sys: 142 ms, total: 7.12 s
Wall time: 7.64 s
print(model.labels)
['__label__0', '__label__1']
# Take a look at the vocabulary
print(len(model.words))
print(model.words[:100])
90603
['the', 'a', 'and', 'of', 'to', 'is', 'in', 'that', 'I', 'this', 'it', '/><br', 'was', 'as', 'with', 'for', 'but', 'The', 'on', 'movie', 'are', 'film', 'his', 'have', 'not', 'be', 'you', '</s>', 'by', 'he', 'an', 'at', 'one', 'from', 'who', 'like', 'all', 'they', 'her', 'or', 'about', 'has', 'so', 'just', 'some', 'out', 'very', 'more', 'would', 'if', 'when', 'their', 'had', 'good', 'what', 'only', 'really', 'up', 'It', "it's", 'can', 'she', 'which', 'were', 'my', 'even', 'no', 'see', 'than', 'there', 'into', 'been', '-', 'because', 'much', 'will', 'get', 'This', 'story', 'most', 'time', 'could', 'other', 'how', 'me', 'people', 'its', 'make', 'any', 'we', 'first', 'do', 'great', 'also', '/>The', 'made', 'think', "don't", 'him', 'being']
train = pd.read_csv("NLP_data/movie_review_train.txt", sep = " ", header=None)
test = pd.read_csv("NLP_data/movie_review_test.txt", sep = " ", header=None)
train.columns = ['sentiment','review']
test.columns = ['sentiment','review']
train.head()
sentiment review
0 __label__0 Homelessness (or Houselessness as George Carli...
1 __label__1 This film lacked something I couldn't put my f...
2 __label__1 \"It appears that many critics find the idea o...
3 __label__0 This isn't the comedic Robin Williams, nor is ...
4 __label__1 I don't know who to blame, the timid writers o...
model.predict("The good the bad and the ugly is an awesome movie")
(('__label__1',), array([0.99988425]))
k = randint(len(test)); print(k)
print(test.loc[k][0])
print(test.loc[k][1])
res = model.predict(test.loc[k][1])
print(res[0][0])
385
__label__1
I caught this stink bomb of a movie recently on a cable channel, and was reminded of how terrible I thought it was in 1980 when first released. Many reviewers out there aren't old enough to remember the enormous hype that surrounded this movie and the struggle between Stanley Kubrick and Steven King. The enormously popular novel had legions of fans eager to see a supposed \"master\" director put this multi-layered supernatural story on the screen. \"Salem's Lot\" had already been ruined in the late 1970s as a TV mini-series, directed by Tobe Hooper (he of \"Texas Chainsaw Massacre\" fame) and was badly handled, turning the major villain of the book into a \"Chiller Theatre\" vampire with no real menace at all thus destroying the entire premise. Fans hoped that a director of Kubrick's stature would succeed where Hooper had failed. It didn't happen.<br /><br />Sure, this movie looks great and has a terrific opening sequence but after those few accomplishments, it's all downhill. Jack Nicholson cannot be anything but Jack Nicholson. He's always crazy and didn't bring anything to his role here. I don't care that many reviewers here think he's all that in this clinker, the \"Here's Johnny!\" bit notwithstanding...he's just awful in this movie. So is everyone else, for that matter. Scatman Crothers' character, Dick Halloran, was essential to the plot of the book, yet Kubrick kills him off in one of the lamest \"shock\" sequences ever put on film. I remember the audience in the theater I saw this at booing repeatedly during the last 45 minutes of this wretched flick, those that stayed that is...many left. King's books really never translate well to film since so much of the narratives occur internally to his characters, and often metaphysically. Kubrick jettisoned the tension between the living and the dead in favor of style here and the resulting mess ends so far from the original material that we ultimately don't really care what happens to whom.<br /><br />This movie still stinks and why so many think it's a horror masterpiece is beyond me.
__label__1
/tmp/ipython-input-1513806259.py:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  print(test.loc[k][0])
/tmp/ipython-input-1513806259.py:3: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  print(test.loc[k][1])
/tmp/ipython-input-1513806259.py:4: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  res = model.predict(test.loc[k][1])
# Train dataset
yhat = [model.predict(train.loc[k][1])[0][0] for k in range(len(train))]
y0 = list(train.iloc[:,0])

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y0, yhat)
acc = sum(diag(cm))/sum(cm)
print("acc =",acc)
print(cm)
/tmp/ipython-input-1754879484.py:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  yhat = [model.predict(train.loc[k][1])[0][0] for k in range(len(train))]
acc = 0.9135216195951013
[[1842  177]
 [ 169 1813]]
# Test dataset
yhat = [model.predict(test.loc[k][1])[0][0] for k in range(len(test))]
y0 = list(test.iloc[:,0])

cm = confusion_matrix(y0, yhat)
acc = sum(diag(cm))/sum(cm)
print("acc =",acc)
print(cm)
acc = 0.799
[[395 104]
 [ 97 404]]
/tmp/ipython-input-3532649410.py:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  yhat = [model.predict(test.loc[k][1])[0][0] for k in range(len(test))]

There is some evidence of underfitting/overfitting here, so the number of epochs of training may be increased/reduced.