{"cells":[{"cell_type":"markdown","metadata":{"id":"g2HsW7jyVlVY"},"source":["# Text Classification with FastText\n","\n"]},{"cell_type":"code","execution_count":1,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":102329,"status":"ok","timestamp":1737405453668,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"Q7CQkvFuJ3Ok","outputId":"bebcb132-0ed1-4ac5-aa89-4f74d6703068"},"outputs":[{"output_type":"stream","name":"stdout","text":["Mounted at /content/drive\n"]}],"source":["from google.colab import drive\n","drive.mount('/content/drive') # Add My Drive/<>\n","\n","import os\n","os.chdir('drive/My Drive')\n","os.chdir('Books_Writings/NLPBook/')"]},{"cell_type":"code","execution_count":2,"metadata":{"executionInfo":{"elapsed":2581,"status":"ok","timestamp":1737405456247,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"GV1u57DPVlVZ"},"outputs":[],"source":["%%capture\n","%pylab inline\n","import pandas as pd\n","import os\n","%load_ext rpy2.ipython"]},{"cell_type":"markdown","metadata":{"id":"BuY5Xf7THjfW"},"source":["## Use Fasttext from Facebook for classification of movie reviews\n","\n","https://fasttext.cc/\n","\n","https://fasttext.cc/docs/en/supervised-tutorial.html\n","\n","Use NLPGluon: https://gluon-nlp.mxnet.io/model_zoo/text_classification/index.html\n","\n","PyPi: https://pypi.org/project/fasttext/\n","\n","See [Malafosse (2019)](https://medium.com/@media_73863/fasttext-sentiment-analysis-for-tweets-a-straightforward-guide-9a8c070449a2): FastText sentiment analysis for tweets: A straightforward guide; [pdf](https://drive.google.com/file/d/10XnkFAxVyGEDVEyxP8f3dFdMmlQvpRq5/view?usp=sharing) for a fun example.\n","\n","See also: https://autogluon.mxnet.io/tutorials/text_prediction/beginner.html\n","\n","Here we will revisit the movie review dataset."]},{"cell_type":"markdown","metadata":{"id":"38m21k6nHjfW"},"source":["The format for the input file is `__label__labelname text`\n","\n","Example: `__label__0` and `__label__1` for a binary classifier.\n","\n","You can put as many labels as needed on one line."]},{"cell_type":"code","execution_count":3,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":223},"executionInfo":{"elapsed":832,"status":"ok","timestamp":1737405457076,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"FtHwQD-0HjfX","outputId":"96228101-1203-439a-c4a6-19a7aab39d45"},"outputs":[{"output_type":"stream","name":"stdout","text":["(5000, 2)\n"]},{"output_type":"execute_result","data":{"text/plain":[" sentiment review\n","0 __label__0 Homelessness (or Houselessness as George Carli...\n","1 __label__1 This film lacked something I couldn't put my f...\n","2 __label__1 \\\"It appears that many critics find the idea o...\n","3 __label__0 This isn't the comedic Robin Williams, nor is ...\n","4 __label__1 I don't know who to blame, the timid writers o..."],"text/html":["\n","
\n","
\n","\n","
\n"," \n"," \n"," | \n"," sentiment | \n"," review | \n","
\n"," \n"," \n"," \n"," 0 | \n"," __label__0 | \n"," Homelessness (or Houselessness as George Carli... | \n","
\n"," \n"," 1 | \n"," __label__1 | \n"," This film lacked something I couldn't put my f... | \n","
\n"," \n"," 2 | \n"," __label__1 | \n"," \\\"It appears that many critics find the idea o... | \n","
\n"," \n"," 3 | \n"," __label__0 | \n"," This isn't the comedic Robin Williams, nor is ... | \n","
\n"," \n"," 4 | \n"," __label__1 | \n"," I don't know who to blame, the timid writers o... | \n","
\n"," \n","
\n","
\n","
\n","
\n"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"dataframe","variable_name":"movie_review","summary":"{\n \"name\": \"movie_review\",\n \"rows\": 5000,\n \"fields\": [\n {\n \"column\": \"sentiment\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"__label__1\",\n \"__label__0\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"review\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4996,\n \"samples\": [\n \"Over Her Dead Body was a nice little movie.It was decent and entertaining, while still being pretty funny.There were a few clich's, but I found most stuff fresh.At first I didn't think it was going to be good at all,when it started out.If you can get past the first 20 minutes though,the movie starts getting more interesting.This film wasn't burst out in laughter hilarious,and wasn't OH MY GOSH wonderful.It was just a movie that you can sit down and enjoy for how enjoyable it was.I don't see how this movie was bad.It's rating is just a bit too low.I could've dealt with a 5.5,but a 4.8?Also,giving this movie a 1 is disgraceful.It was pretty good,and there was nothing horrible enough about it to give it a 1,which is what most people gave it.\",\n \"Americans have the attention span of a fruit fly and if something does not happen within the span of a typical commercial, we tend to lose interest really fast.
I found out an exciting fact from this film: someone has to paint high tension utility poles and do it on a schedule! And guess what, they really would like to be doing something else (the viewer has similar feelings).
Surprisingly, when I was bored watching late night infomercials and decided to actually watch this film, I found the characters to be interesting and highly engaging.
I just don't usually watch that much late night TV, so I can't recommend this film, unless watching paint dry is your idea of an exciting two hours out of your life.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"}},"metadata":{},"execution_count":3}],"source":["# Let's take a look at the movie database structure\n","movie_review = pd.read_csv('NLP_data/movie_review.csv')\n","\n","# Convert df into the format required for fasttext\n","movie_review.sentiment = [\"__label__\" + str(movie_review.sentiment[j]) for j in movie_review.sentiment]\n","movie_review = movie_review.drop(\"id\", axis=1)\n","print(movie_review.shape)\n","movie_review.head()"]},{"cell_type":"code","execution_count":4,"metadata":{"executionInfo":{"elapsed":8,"status":"ok","timestamp":1737405457076,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"ewgtIzwCHjfc"},"outputs":[],"source":["def cleanText(text):\n"," for c in string.punctuation:\n"," text = text.replace(c,\" \")\n"," text = text.replace('“','')\n"," text = text.replace('”','')\n"," text = text.replace('’','')\n"," text = text.replace('—',' ')\n"," # Remove numbers\n"," for c in range(10):\n"," n = str(c)\n"," text = text.replace(n,\" \")\n"," text = text.str.lower()\n"," text = stopText(text)\n"," text = stemText(text)\n"," text = [j.strip() for j in text]\n"," return text"]},{"cell_type":"code","execution_count":5,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":7,"status":"ok","timestamp":1737405457076,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"Hia8N8qTHjfg","outputId":"48263194-15ed-47f7-cb7d-10923ce1282d"},"outputs":[{"output_type":"stream","name":"stdout","text":["CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs\n","Wall time: 7.15 µs\n"]}],"source":["%%time\n","# Run it with this cleanup and without to see the difference\n","# movie_review.review = cleanText(movie_review.review)"]},{"cell_type":"code","execution_count":6,"metadata":{"executionInfo":{"elapsed":2118,"status":"ok","timestamp":1737405459188,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"iiCHvibuHjfl"},"outputs":[],"source":["tmp = movie_review.loc[:4000]\n","tmp.to_csv('NLP_data/movie_review_train.txt', sep=\" \", header=False, index=False)\n","tmp = movie_review.loc[4000:]\n","tmp.to_csv('NLP_data/movie_review_test.txt', sep=\" \", header=False, index=False)"]},{"cell_type":"code","execution_count":7,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":54718,"status":"ok","timestamp":1737405513902,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"tBz4l3J0Hjfn","outputId":"97cff6a8-8977-4cf1-8df7-76d29037e224","scrolled":true},"outputs":[{"output_type":"stream","name":"stdout","text":["\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/73.4 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.4/73.4 kB\u001b[0m \u001b[31m3.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n"," Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n"," Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n"," Building wheel for fasttext (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n"]}],"source":["!pip install fasttext --quiet\n","# !conda install -c conda-forge fasttext -y"]},{"cell_type":"code","execution_count":8,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":13890,"status":"ok","timestamp":1737405527788,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"WFsDXij1Hjfp","outputId":"7f935a35-2e06-4448-c7bc-e700e54aa676"},"outputs":[{"output_type":"stream","name":"stdout","text":["CPU times: user 8.94 s, sys: 198 ms, total: 9.14 s\n","Wall time: 13.9 s\n"]}],"source":["%%time\n","import fasttext\n","model = fasttext.train_supervised('NLP_data/movie_review_train.txt', epoch=20) # Choose epochs to manage overfitting"]},{"cell_type":"code","execution_count":9,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":5,"status":"ok","timestamp":1737405527789,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"6DiSdY9XHjft","outputId":"be0ffae1-528c-4d7f-ca25-cd158275bfd9"},"outputs":[{"output_type":"stream","name":"stdout","text":["['__label__0', '__label__1']\n"]}],"source":["print(model.labels)"]},{"cell_type":"code","execution_count":10,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":4,"status":"ok","timestamp":1737405527789,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"wIr_AuguHjfv","outputId":"b71b8988-43ab-49ad-c7fe-a05a8362bc0b"},"outputs":[{"output_type":"stream","name":"stdout","text":["90603\n","['the', 'a', 'and', 'of', 'to', 'is', 'in', 'that', 'I', 'this', 'it', '/>
', 'by', 'he', 'an', 'at', 'one', 'from', 'who', 'like', 'all', 'they', 'her', 'or', 'about', 'has', 'so', 'just', 'some', 'out', 'very', 'more', 'would', 'if', 'when', 'their', 'had', 'good', 'what', 'only', 'really', 'up', 'It', \"it's\", 'can', 'she', 'which', 'were', 'my', 'even', 'no', 'see', 'than', 'there', 'into', 'been', '-', 'because', 'much', 'will', 'get', 'This', 'story', 'most', 'time', 'could', 'other', 'how', 'me', 'people', 'its', 'make', 'any', 'we', 'first', 'do', 'great', 'also', '/>The', 'made', 'think', \"don't\", 'him', 'being']\n"]}],"source":["# Take a look at the vocabulary\n","print(len(model.words))\n","print(model.words[:100])"]},{"cell_type":"code","execution_count":11,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":227,"status":"ok","timestamp":1737405528013,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"9yt6K2aoHjf0","outputId":"3af42db0-140d-4a80-99f7-d535645ace58"},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" sentiment review\n","0 __label__0 Homelessness (or Houselessness as George Carli...\n","1 __label__1 This film lacked something I couldn't put my f...\n","2 __label__1 \\\"It appears that many critics find the idea o...\n","3 __label__0 This isn't the comedic Robin Williams, nor is ...\n","4 __label__1 I don't know who to blame, the timid writers o..."],"text/html":["\n"," \n","
\n","\n","
\n"," \n"," \n"," | \n"," sentiment | \n"," review | \n","
\n"," \n"," \n"," \n"," 0 | \n"," __label__0 | \n"," Homelessness (or Houselessness as George Carli... | \n","
\n"," \n"," 1 | \n"," __label__1 | \n"," This film lacked something I couldn't put my f... | \n","
\n"," \n"," 2 | \n"," __label__1 | \n"," \\\"It appears that many critics find the idea o... | \n","
\n"," \n"," 3 | \n"," __label__0 | \n"," This isn't the comedic Robin Williams, nor is ... | \n","
\n"," \n"," 4 | \n"," __label__1 | \n"," I don't know who to blame, the timid writers o... | \n","
\n"," \n","
\n","
\n","
\n","
\n"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"dataframe","variable_name":"train","summary":"{\n \"name\": \"train\",\n \"rows\": 4001,\n \"fields\": [\n {\n \"column\": \"sentiment\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"__label__1\",\n \"__label__0\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"review\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3997,\n \"samples\": [\n \"I saw the trailer of the film several times at theater and I excited. It looked like a classic action thriller like the ones made in 1990's. It recalled me also Fugitive movies, a cat and mouse chase between Douglas and Sutherland. However, The Sentinel is the most tasteless action thriller of all time. As I see, many people say that this is like a TV movie. Not exactly. Firstly, there are much more better TV movies in this genre. Secondly, TV movies might be very fun sometimes, but this film is the exact opposite of having a good time. It is not stylish at all visually and the most important, the tone of the movie is unappealing. This is not an action movie, there are two action scenes consist of a chase and a clash. Also they are not big action scenes, but the worse is that those action scenes are very tasteless like the whole movie. The love affair between Douglas and Bassinger was very unnecessary. Besides, the assassination plot to the president is the most clich story in this genre either, but they insist on that. And this is not a cat and mouse film as it is supposed to be. Although, Douglas is very old now, he has still potential for acting in an action thriller. In the film, Michael Douglas cannot be like Tommy Lee Jones, for example. Sutherland is a wrong choice either, because you feel as if you watch Jack Bauer and somehow, its character is one of the reasons which make the film like a TV movie, Eva Longoria Parker is a strange choice, of course she is too passive or straight in this movie, because she is a soap opera actress. The movie was not fun even one second to me, so I could not get over for a while.\",\n \"In one instant when it seemed to be getting interesting, it never got there.
The people are going from one point to another point, with really no point (if there was one it was very dull). There was no action, suspense or any horror and the characters were pretty heartless, so there was no caring what happened to them.
All together the movie was pretty boring.
I give it a 3/10.
I like that it wasn't shaky choppy camera-work and if there was music it didn't annoy me like some really bad movies and the acting was not horrendous.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"}},"metadata":{},"execution_count":11}],"source":["train = pd.read_csv(\"NLP_data/movie_review_train.txt\", sep = \" \", header=None)\n","test = pd.read_csv(\"NLP_data/movie_review_test.txt\", sep = \" \", header=None)\n","train.columns = ['sentiment','review']\n","test.columns = ['sentiment','review']\n","train.head()"]},{"cell_type":"code","execution_count":12,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":8,"status":"ok","timestamp":1737405528013,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"dbvPmC0cHjf9","outputId":"d4416b72-0298-465a-a278-35c8b0b3f5ef"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["(('__label__1',), array([0.99988425]))"]},"metadata":{},"execution_count":12}],"source":["model.predict(\"The good the bad and the ugly is an awesome movie\")"]},{"cell_type":"code","execution_count":13,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":7,"status":"ok","timestamp":1737405528013,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"o1zYiXLeHjgC","outputId":"58a7b2d3-a839-430e-aff1-4f153f9ec80e"},"outputs":[{"output_type":"stream","name":"stdout","text":["567\n","__label__1\n","If you hate redneck accents, you'll hate this movie. And to make it worse, you see Patrick Swayze, a has been trying to be a redneck. I really can't stand redneck accents. I like Billy Bob Thornton, he was good in Slingblade, but he was annoying in this movie. And what kind of name is Lonnie Earl? How much more hickish can this movie get? The storyline was stupid. I'm usually not this judgemental of movies, but I couldn't stand this movie. If you want a good Billy Bob Thornton movie, go see Slingblade.
My mom found this movie for $5.95 at Wal Mart...figures...I think I'll wrap it up and give it to my Grandma for Christmas. It could just be that I can't stand redneck accents usually, or that I can't stand Patrick Swayze. Maybe if Patrick Swayze wasn't in it. I didn't laugh once in the movie. I laugh at anything stupid usually. If they had shown someones fingers getting smashed, I might have laughed. people's fingers getting smashed by accident always makes me laugh.\n","__label__1\n"]},{"output_type":"stream","name":"stderr","text":[":2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n"," print(test.loc[k][0])\n",":3: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n"," print(test.loc[k][1])\n",":4: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n"," res = model.predict(test.loc[k][1])\n"]}],"source":["k = randint(len(test)); print(k)\n","print(test.loc[k][0])\n","print(test.loc[k][1])\n","res = model.predict(test.loc[k][1])\n","print(res[0][0])"]},{"cell_type":"code","execution_count":14,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":4348,"status":"ok","timestamp":1737405532356,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"TfjT7e0BHjgG","outputId":"dc485cec-f9ed-44b9-a6ec-ae3c0e2574dc"},"outputs":[{"output_type":"stream","name":"stderr","text":[":2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n"," yhat = [model.predict(train.loc[k][1])[0][0] for k in range(len(train))]\n"]},{"output_type":"stream","name":"stdout","text":["acc = 0.9135216195951013\n","[[1842 177]\n"," [ 169 1813]]\n"]}],"source":["# Train dataset\n","yhat = [model.predict(train.loc[k][1])[0][0] for k in range(len(train))]\n","y0 = list(train.iloc[:,0])\n","\n","from sklearn.metrics import confusion_matrix\n","cm = confusion_matrix(y0, yhat)\n","acc = sum(diag(cm))/sum(cm)\n","print(\"acc =\",acc)\n","print(cm)"]},{"cell_type":"code","execution_count":15,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":193,"status":"ok","timestamp":1737405532547,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"eQxDEkHkHjgM","outputId":"55798a3a-03fd-47cb-c4f2-7ff524a045ba"},"outputs":[{"output_type":"stream","name":"stderr","text":[":2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n"," yhat = [model.predict(test.loc[k][1])[0][0] for k in range(len(test))]\n"]},{"output_type":"stream","name":"stdout","text":["acc = 0.799\n","[[395 104]\n"," [ 97 404]]\n"]}],"source":["# Test dataset\n","yhat = [model.predict(test.loc[k][1])[0][0] for k in range(len(test))]\n","y0 = list(test.iloc[:,0])\n","\n","cm = confusion_matrix(y0, yhat)\n","acc = sum(diag(cm))/sum(cm)\n","print(\"acc =\",acc)\n","print(cm)"]},{"cell_type":"markdown","metadata":{"id":"zDr7oVOGBWM5"},"source":["There is some evidence of underfitting/overfitting here, so the number of epochs of training may be increased/reduced."]},{"cell_type":"code","execution_count":15,"metadata":{"id":"vsvhAXvjBbTK","executionInfo":{"status":"ok","timestamp":1737405532778,"user_tz":480,"elapsed":232,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"}}},"outputs":[],"source":[]}],"metadata":{"accelerator":"GPU","celltoolbar":"Slideshow","colab":{"provenance":[],"toc_visible":true},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.12"}},"nbformat":4,"nbformat_minor":0}