Text Classification with FastText

25. Text Classification with FastText#

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')

Mounted at /content/drive

%%capture
%pylab inline
import pandas as pd
import os
%load_ext rpy2.ipython

25.1. Use Fasttext from Facebook for classification of movie reviews#

https://fasttext.cc/

https://fasttext.cc/docs/en/supervised-tutorial.html

Use NLPGluon: https://gluon-nlp.mxnet.io/model_zoo/text_classification/index.html

PyPi: https://pypi.org/project/fasttext/

See Malafosse (2019): FastText sentiment analysis for tweets: A straightforward guide; pdf for a fun example.

Here we will revisit the movie review dataset.

The format for the input file is __label__labelname text

Example: __label__0 and __label__1 for a binary classifier.

You can put as many labels as needed on one line.

# Let's take a look at the movie database structure
movie_review = pd.read_csv('NLP_data/movie_review.csv')

# Convert df into the format required for fasttext
movie_review.sentiment = ["__label__" + str(movie_review.sentiment[j]) for j in movie_review.sentiment]
movie_review = movie_review.drop("id", axis=1)
print(movie_review.shape)
movie_review.head()

(5000, 2)

	sentiment	review
0	__label__0	Homelessness (or Houselessness as George Carli...
1	__label__1	This film lacked something I couldn't put my f...
2	__label__1	\"It appears that many critics find the idea o...
3	__label__0	This isn't the comedic Robin Williams, nor is ...
4	__label__1	I don't know who to blame, the timid writers o...

def cleanText(text):
    for c in string.punctuation:
        text = text.replace(c," ")
    text = text.replace('“','')
    text = text.replace('”','')
    text = text.replace('’','')
    text = text.replace('—',' ')
    # Remove numbers
    for c in range(10):
        n = str(c)
        text = text.replace(n," ")
    text = text.str.lower()
    text = stopText(text)
    text = stemText(text)
    text = [j.strip() for j in text]
    return text

%%time
# Run it with this cleanup and without to see the difference
# movie_review.review = cleanText(movie_review.review)

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.15 µs

tmp = movie_review.loc[:4000]
tmp.to_csv('NLP_data/movie_review_train.txt', sep=" ", header=False, index=False)
tmp = movie_review.loc[4000:]
tmp.to_csv('NLP_data/movie_review_test.txt', sep=" ", header=False, index=False)

!pip install fasttext --quiet
# !conda install -c conda-forge fasttext -y

?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/73.4 kB ? eta -:--:--
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73.4/73.4 kB 3.2 MB/s eta 0:00:00
?25h  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone
  Building wheel for fasttext (pyproject.toml) ... ?25l?25hdone

%%time
import fasttext
model = fasttext.train_supervised('NLP_data/movie_review_train.txt', epoch=20) # Choose epochs to manage overfitting

CPU times: user 8.94 s, sys: 198 ms, total: 9.14 s
Wall time: 13.9 s

print(model.labels)

['__label__0', '__label__1']

# Take a look at the vocabulary
print(len(model.words))
print(model.words[:100])

90603
['the', 'a', 'and', 'of', 'to', 'is', 'in', 'that', 'I', 'this', 'it', '/><br', 'was', 'as', 'with', 'for', 'but', 'The', 'on', 'movie', 'are', 'film', 'his', 'have', 'not', 'be', 'you', '</s>', 'by', 'he', 'an', 'at', 'one', 'from', 'who', 'like', 'all', 'they', 'her', 'or', 'about', 'has', 'so', 'just', 'some', 'out', 'very', 'more', 'would', 'if', 'when', 'their', 'had', 'good', 'what', 'only', 'really', 'up', 'It', "it's", 'can', 'she', 'which', 'were', 'my', 'even', 'no', 'see', 'than', 'there', 'into', 'been', '-', 'because', 'much', 'will', 'get', 'This', 'story', 'most', 'time', 'could', 'other', 'how', 'me', 'people', 'its', 'make', 'any', 'we', 'first', 'do', 'great', 'also', '/>The', 'made', 'think', "don't", 'him', 'being']

train = pd.read_csv("NLP_data/movie_review_train.txt", sep = " ", header=None)
test = pd.read_csv("NLP_data/movie_review_test.txt", sep = " ", header=None)
train.columns = ['sentiment','review']
test.columns = ['sentiment','review']
train.head()

	sentiment	review
0	__label__0	Homelessness (or Houselessness as George Carli...
1	__label__1	This film lacked something I couldn't put my f...
2	__label__1	\"It appears that many critics find the idea o...
3	__label__0	This isn't the comedic Robin Williams, nor is ...
4	__label__1	I don't know who to blame, the timid writers o...

model.predict("The good the bad and the ugly is an awesome movie")

(('__label__1',), array([0.99988425]))

k = randint(len(test)); print(k)
print(test.loc[k][0])
print(test.loc[k][1])
res = model.predict(test.loc[k][1])
print(res[0][0])

567
__label__1
If you hate redneck accents, you'll hate this movie. And to make it worse, you see Patrick Swayze, a has been trying to be a redneck. I really can't stand redneck accents. I like Billy Bob Thornton, he was good in Slingblade, but he was annoying in this movie. And what kind of name is Lonnie Earl? How much more hickish can this movie get? The storyline was stupid. I'm usually not this judgemental of movies, but I couldn't stand this movie. If you want a good Billy Bob Thornton movie, go see Slingblade.<br /><br />My mom found this movie for $5.95 at Wal Mart...figures...I think I'll wrap it up and give it to my Grandma for Christmas. It could just be that I can't stand redneck accents usually, or that I can't stand Patrick Swayze. Maybe if Patrick Swayze wasn't in it. I didn't laugh once in the movie. I laugh at anything stupid usually. If they had shown someones fingers getting smashed, I might have laughed. people's fingers getting smashed by accident always makes me laugh.
__label__1

<ipython-input-13-37a54e4d47c7>:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  print(test.loc[k][0])
<ipython-input-13-37a54e4d47c7>:3: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  print(test.loc[k][1])
<ipython-input-13-37a54e4d47c7>:4: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  res = model.predict(test.loc[k][1])

# Train dataset
yhat = [model.predict(train.loc[k][1])[0][0] for k in range(len(train))]
y0 = list(train.iloc[:,0])

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y0, yhat)
acc = sum(diag(cm))/sum(cm)
print("acc =",acc)
print(cm)

<ipython-input-14-22a720b2c8eb>:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  yhat = [model.predict(train.loc[k][1])[0][0] for k in range(len(train))]

acc = 0.9135216195951013
[[1842  177]
 [ 169 1813]]

# Test dataset
yhat = [model.predict(test.loc[k][1])[0][0] for k in range(len(test))]
y0 = list(test.iloc[:,0])

cm = confusion_matrix(y0, yhat)
acc = sum(diag(cm))/sum(cm)
print("acc =",acc)
print(cm)

<ipython-input-15-9b81320591ac>:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  yhat = [model.predict(test.loc[k][1])[0][0] for k in range(len(test))]

acc = 0.799
[[395 104]
 [ 97 404]]

There is some evidence of underfitting/overfitting here, so the number of epochs of training may be increased/reduced.

Text Classification with FastText

Contents

25. Text Classification with FastText#

25.1. Use Fasttext from Facebook for classification of movie reviews#