{"cells":[{"cell_type":"markdown","metadata":{"id":"g2HsW7jyVlVY"},"source":["# Text Classification with spaCy\n","\n"]},{"cell_type":"code","execution_count":1,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":26666,"status":"ok","timestamp":1737405676085,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"Q7CQkvFuJ3Ok","outputId":"f1d123f4-db63-4be7-b44f-cdb200436031"},"outputs":[{"output_type":"stream","name":"stdout","text":["Mounted at /content/drive\n"]}],"source":["from google.colab import drive\n","drive.mount('/content/drive') # Add My Drive/<>\n","\n","import os\n","os.chdir('drive/My Drive')\n","os.chdir('Books_Writings/NLPBook/')"]},{"cell_type":"code","execution_count":2,"metadata":{"id":"GV1u57DPVlVZ","executionInfo":{"status":"ok","timestamp":1737405678293,"user_tz":480,"elapsed":2211,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"}}},"outputs":[],"source":["%%capture\n","%pylab inline\n","import pandas as pd\n","import os"]},{"cell_type":"markdown","metadata":{"id":"62FOAkQDHjgO"},"source":["## Using spaCy\n","\n","[spaCy](https://spacy.io) has an excellent pipeline for doing text classification. We will learn about this pipeline here.\n","\n","We will also use scikit learn. https://scikit-learn.org/stable/"]},{"cell_type":"code","execution_count":3,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":2737,"status":"ok","timestamp":1737405681027,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"ly804y7DHjgP","outputId":"787b632a-367d-447e-a26f-0cd27d5371a9"},"outputs":[{"output_type":"stream","name":"stdout","text":["Populating the interactive namespace from numpy and matplotlib\n"]}],"source":["%pylab inline\n","from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n","from sklearn.base import TransformerMixin\n","from sklearn.pipeline import Pipeline\n","import pandas as pd"]},{"cell_type":"code","execution_count":4,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":397},"executionInfo":{"elapsed":1414,"status":"ok","timestamp":1737405682439,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"OdZCFwEHHjgS","outputId":"54aa5c66-9d8b-467a-ff5f-48ba448d8403"},"outputs":[{"output_type":"stream","name":"stdout","text":["\n","RangeIndex: 5000 entries, 0 to 4999\n","Data columns (total 3 columns):\n"," # Column Non-Null Count Dtype \n","--- ------ -------------- ----- \n"," 0 id 5000 non-null object\n"," 1 sentiment 5000 non-null int64 \n"," 2 review 5000 non-null object\n","dtypes: int64(1), object(2)\n","memory usage: 117.3+ KB\n","None\n"]},{"output_type":"execute_result","data":{"text/plain":[" id sentiment review\n","0 10000_8 1 Homelessness (or Houselessness as George Carli...\n","1 10001_4 0 This film lacked something I couldn't put my f...\n","2 10004_3 0 \\\"It appears that many critics find the idea o...\n","3 10004_8 1 This isn't the comedic Robin Williams, nor is ...\n","4 10006_4 0 I don't know who to blame, the timid writers o..."],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
idsentimentreview
010000_81Homelessness (or Houselessness as George Carli...
110001_40This film lacked something I couldn't put my f...
210004_30\\\"It appears that many critics find the idea o...
310004_81This isn't the comedic Robin Williams, nor is ...
410006_40I don't know who to blame, the timid writers o...
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","\n","
\n","
\n"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"dataframe","variable_name":"df","summary":"{\n \"name\": \"df\",\n \"rows\": 5000,\n \"fields\": [\n {\n \"column\": \"id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5000,\n \"samples\": [\n \"2083_1\",\n \"4450_3\",\n \"4601_4\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sentiment\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"review\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4996,\n \"samples\": [\n \"Over Her Dead Body was a nice little movie.It was decent and entertaining, while still being pretty funny.There were a few clich's, but I found most stuff fresh.At first I didn't think it was going to be good at all,when it started out.If you can get past the first 20 minutes though,the movie starts getting more interesting.This film wasn't burst out in laughter hilarious,and wasn't OH MY GOSH wonderful.It was just a movie that you can sit down and enjoy for how enjoyable it was.I don't see how this movie was bad.It's rating is just a bit too low.I could've dealt with a 5.5,but a 4.8?Also,giving this movie a 1 is disgraceful.It was pretty good,and there was nothing horrible enough about it to give it a 1,which is what most people gave it.\",\n \"Americans have the attention span of a fruit fly and if something does not happen within the span of a typical commercial, we tend to lose interest really fast.

I found out an exciting fact from this film: someone has to paint high tension utility poles and do it on a schedule! And guess what, they really would like to be doing something else (the viewer has similar feelings).

Surprisingly, when I was bored watching late night infomercials and decided to actually watch this film, I found the characters to be interesting and highly engaging.

I just don't usually watch that much late night TV, so I can't recommend this film, unless watching paint dry is your idea of an exciting two hours out of your life.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"}},"metadata":{},"execution_count":4}],"source":["# Loading CSV file\n","df = pd.read_csv(\"NLP_data/movie_review.csv\")\n","# View data information\n","print(df.info())\n","df.head()"]},{"cell_type":"code","execution_count":5,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":178},"executionInfo":{"elapsed":16,"status":"ok","timestamp":1737405682440,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"3vupoRTQHjgV","outputId":"2f39cf00-1c69-438d-b4f5-8a56c45a5bed"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["sentiment\n","1 2517\n","0 2483\n","Name: count, dtype: int64"],"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
count
sentiment
12517
02483
\n","

"]},"metadata":{},"execution_count":5}],"source":["# Feedback Value count\n","df.sentiment.value_counts()"]},{"cell_type":"code","execution_count":6,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":25948,"status":"ok","timestamp":1737405708373,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"MYmfjm7_HjgX","outputId":"7e9b7146-eba8-4904-a7fe-36f11a102b24","scrolled":true},"outputs":[{"output_type":"stream","name":"stdout","text":["Collecting en-core-web-sm==3.7.1\n"," Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m49.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hRequirement already satisfied: spacy<3.8.0,>=3.7.2 in /usr/local/lib/python3.11/dist-packages (from en-core-web-sm==3.7.1) (3.7.5)\n","Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.12)\n","Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.0.5)\n","Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.0.11)\n","Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.0.10)\n","Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.9)\n","Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (8.2.5)\n","Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.1.3)\n","Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.5.0)\n","Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.0.10)\n","Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.4.1)\n","Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.15.1)\n","Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (4.67.1)\n","Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.32.3)\n","Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.10.5)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.1.5)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (75.1.0)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (24.2)\n","Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.5.0)\n","Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.26.4)\n","Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.11/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.3.0)\n","Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.7.0)\n","Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.27.2)\n","Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (4.12.2)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.4.1)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.10)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.3.0)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2024.12.14)\n","Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.11/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.7.11)\n","Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.11/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.1.5)\n","Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (8.1.8)\n","Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.5.4)\n","Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (13.9.4)\n","Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.20.0)\n","Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (7.1.0)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.2)\n","Requirement already satisfied: marisa-trie>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.2.1)\n","Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.0)\n","Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.18.0)\n","Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.17.0)\n","Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.1.2)\n","\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n","You can now load the package via spacy.load('en_core_web_sm')\n","\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n","If you are in a Jupyter or Colab notebook, you may need to restart Python in\n","order to load all the package's dependencies. You can do this by selecting the\n","'Restart kernel' or 'Restart runtime' option.\n","\u001b[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the\n","full pipeline package name 'en_core_web_sm' instead.\u001b[0m\n","Collecting en-core-web-sm==3.7.1\n"," Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)\n","Requirement already satisfied: spacy<3.8.0,>=3.7.2 in /usr/local/lib/python3.11/dist-packages (from en-core-web-sm==3.7.1) (3.7.5)\n","Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.12)\n","Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.0.5)\n","Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.0.11)\n","Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.0.10)\n","Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.9)\n","Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (8.2.5)\n","Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.1.3)\n","Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.5.0)\n","Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.0.10)\n","Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.4.1)\n","Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.15.1)\n","Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (4.67.1)\n","Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.32.3)\n","Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.10.5)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.1.5)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (75.1.0)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (24.2)\n","Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.5.0)\n","Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.11/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.26.4)\n","Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.11/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.3.0)\n","Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.7.0)\n","Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.27.2)\n","Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (4.12.2)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.4.1)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.10)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.3.0)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2024.12.14)\n","Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.11/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.7.11)\n","Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.11/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.1.5)\n","Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (8.1.8)\n","Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.5.4)\n","Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (13.9.4)\n","Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.20.0)\n","Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (7.1.0)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.2)\n","Requirement already satisfied: marisa-trie>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.2.1)\n","Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (3.0.0)\n","Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (2.18.0)\n","Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (1.17.0)\n","Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-sm==3.7.1) (0.1.2)\n","\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n","You can now load the package via spacy.load('en_core_web_sm')\n","\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n","If you are in a Jupyter or Colab notebook, you may need to restart Python in\n","order to load all the package's dependencies. You can do this by selecting the\n","'Restart kernel' or 'Restart runtime' option.\n"]}],"source":["!pip install spacy --quiet\n","!python -m spacy download en_core_web_sm\n","!python -m spacy download en"]},{"cell_type":"code","execution_count":7,"metadata":{"id":"JB2EMdWHHjgZ","executionInfo":{"status":"ok","timestamp":1737405719128,"user_tz":480,"elapsed":10757,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"}}},"outputs":[],"source":["# Set up various spaCY stuff\n","import spacy\n","import string\n","from spacy.lang.en.stop_words import STOP_WORDS\n","from spacy.lang.en import English\n","\n","# Create our list of punctuation marks\n","punctuations = string.punctuation\n","\n","# Create our list of stopwords\n","nlp = spacy.load('en_core_web_sm')\n","stop_words = spacy.lang.en.stop_words.STOP_WORDS\n","\n","# Load English tokenizer, tagger, parser, NER and word vectors\n","parser = English()\n","\n","# Creating our tokenizer function\n","def spacy_tokenizer(sentence):\n"," # Creating our token object, which is used to create documents with linguistic annotations.\n"," mytokens = nlp(sentence)\n","\n"," # Lemmatizing each token and converting each token into lowercase\n"," mytokens = [ word.lemma_.lower().strip() if word.lemma_ != \"-PRON-\" else word.lower_ for word in mytokens ]\n","\n"," # Removing stop words\n"," mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]\n","\n"," # return preprocessed list of tokens\n"," return mytokens"]},{"cell_type":"code","execution_count":8,"metadata":{"id":"SnoNo-GnHjgb","executionInfo":{"status":"ok","timestamp":1737405719128,"user_tz":480,"elapsed":4,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"}}},"outputs":[],"source":["# Custom transformer using spaCy\n","class prepare_data(TransformerMixin):\n"," def transform(self, X, **transform_params):\n"," # Cleaning Text\n"," return [clean_text(text) for text in X]\n","\n"," def fit(self, X, y=None, **fit_params):\n"," return self\n","\n"," def get_params(self, deep=True):\n"," return {}\n","\n","# Basic function to clean the text\n","def clean_text(text):\n"," # Removing spaces and converting text into lowercase\n"," return text.strip().lower()"]},{"cell_type":"code","execution_count":9,"metadata":{"id":"09BBPmBYHjgd","executionInfo":{"status":"ok","timestamp":1737405719128,"user_tz":480,"elapsed":3,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"}}},"outputs":[],"source":["bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))\n","tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)"]},{"cell_type":"code","execution_count":10,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":3,"status":"ok","timestamp":1737405719128,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"0BAKcSS8Hjgf","outputId":"99e7a22c-4e79-4ef3-ed50-ca1eee7c0da5"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["Index(['id', 'sentiment', 'review'], dtype='object')"]},"metadata":{},"execution_count":10}],"source":["df.columns"]},{"cell_type":"code","execution_count":11,"metadata":{"id":"0c7hwerwHjgh","executionInfo":{"status":"ok","timestamp":1737405719480,"user_tz":480,"elapsed":354,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"}}},"outputs":[],"source":["from sklearn.model_selection import train_test_split\n","\n","X = df['review'] # the features we want to analyze\n","ylabels = df['sentiment'] # the labels, or answers, we want to test against\n","\n","X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)"]},{"cell_type":"code","execution_count":12,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":289,"status":"ok","timestamp":1737405719767,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"vNhTzurXHjgj","outputId":"e9ed3c6d-ef7f-4e75-f5d2-17c8f573eb4c"},"outputs":[{"output_type":"stream","name":"stdout","text":["CPU times: user 197 ms, sys: 23.1 ms, total: 220 ms\n","Wall time: 359 ms\n"]}],"source":["%%time\n","# Logistic Regression Classifier\n","from sklearn.linear_model import LogisticRegression\n","classifier = LogisticRegression(penalty=None, max_iter=1000, tol=0.001)"]},{"cell_type":"code","source":["%%time\n","# Create pipeline\n","pipe = Pipeline([(\"cleaner\", prepare_data()),\n"," ('vectorizer', tfidf_vector), # replace with tf_idf, or bow to try\n"," ('classifier', classifier)])\n","\n","# model generation\n","pipe.fit(X_train,y_train)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":288},"id":"VSU5n0TsNWiG","executionInfo":{"status":"ok","timestamp":1737405936350,"user_tz":480,"elapsed":216584,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"}},"outputId":"8d7b10b2-06a9-47c2-8223-372aafeb11dd"},"execution_count":13,"outputs":[{"output_type":"stream","name":"stderr","text":["/usr/local/lib/python3.11/dist-packages/sklearn/feature_extraction/text.py:517: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'\n"," warnings.warn(\n"]},{"output_type":"stream","name":"stdout","text":["CPU times: user 3min 16s, sys: 3.5 s, total: 3min 19s\n","Wall time: 3min 36s\n"]},{"output_type":"execute_result","data":{"text/plain":["Pipeline(steps=[('cleaner', <__main__.prepare_data object at 0x786d79834310>),\n"," ('vectorizer',\n"," TfidfVectorizer(tokenizer=)),\n"," ('classifier',\n"," LogisticRegression(max_iter=1000, penalty=None, tol=0.001))])"],"text/html":["
Pipeline(steps=[('cleaner', <__main__.prepare_data object at 0x786d79834310>),\n","                ('vectorizer',\n","                 TfidfVectorizer(tokenizer=<function spacy_tokenizer at 0x786d7baf9120>)),\n","                ('classifier',\n","                 LogisticRegression(max_iter=1000, penalty=None, tol=0.001))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
"]},"metadata":{},"execution_count":13}]},{"cell_type":"code","execution_count":14,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":88768,"status":"ok","timestamp":1737406025109,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"UWdSA0QwHjgl","outputId":"4d350e94-40ca-4760-c523-cf12a5808173"},"outputs":[{"output_type":"stream","name":"stdout","text":["Logistic Regression Accuracy: 0.838\n","Logistic Regression Precision: 0.8333333333333334\n","Logistic Regression Recall: 0.8560311284046692\n","CPU times: user 1min 20s, sys: 264 ms, total: 1min 20s\n","Wall time: 1min 28s\n"]}],"source":["%%time\n","from sklearn import metrics\n","# Predicting with a test dataset\n","predicted = pipe.predict(X_test)\n","\n","# Model Accuracy\n","print(\"Logistic Regression Accuracy:\",metrics.accuracy_score(y_test, predicted))\n","print(\"Logistic Regression Precision:\",metrics.precision_score(y_test, predicted))\n","print(\"Logistic Regression Recall:\",metrics.recall_score(y_test, predicted))"]},{"cell_type":"code","execution_count":15,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":6,"status":"ok","timestamp":1737406025109,"user":{"displayName":"Sanjiv Das","userId":"06377870171053566924"},"user_tz":480},"id":"xTQang5YHjgn","outputId":"e3bae9ee-692a-4dae-b88c-ee88fef5e85a"},"outputs":[{"output_type":"stream","name":"stdout","text":["acc = 0.838\n","[[597 132]\n"," [111 660]]\n"]}],"source":["from sklearn.metrics import confusion_matrix\n","cm = confusion_matrix(y_test, predicted)\n","acc = sum(diag(cm))/sum(cm)\n","print(\"acc =\",acc)\n","print(cm)"]}],"metadata":{"celltoolbar":"Slideshow","colab":{"provenance":[],"toc_visible":true},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.12"}},"nbformat":4,"nbformat_minor":0}