21. Document Similarity#
There are several measures of similarity between documents. Some, but not all, exploit the vector representations of documents.
from google.colab import drive
drive.mount('/content/drive') # Add My Drive/<>
import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
%pylab inline
import pandas as pd
import os
from IPython.display import Image
# %load_ext rpy2.ipython
21.1. Cosine Similarity in the Text Domain#
In this segment we will learn some popular functions on text that are used in practice. One of the first things we like to do is to find similar text or like sentences (think of web search as one application). Since documents are vectors in the TDM, we may want to find the closest vectors or compute the distance between vectors.
where \(||A|| = \sqrt{A \cdot A}\), is the dot product of \(A\) with itself, also known as the norm of \(A\). This gives the cosine of the angle between the two vectors and is zero for orthogonal vectors and 1 for identical vectors.
For a collection of distance measures, see: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa
#COSINE DISTANCE OR SIMILARITY
A = array([0,3,4,1,7,0,1])
B = array([0,4,3,0,6,1,1])
cos = A.dot(B)/(sqrt(A.dot(A)) * sqrt(B.dot(B)))
print('Cosine similarity = ',cos)
#Using sklearn
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([A, B], dense_output=True)
Cosine similarity = 0.9682727993019339
array([[1. , 0.9682728],
[0.9682728, 1. ]])
21.2. Minimum Edit Distance#
The MED is the minimum number of edits needed to transform one string into another. The strings could be words or sentences or even documents. This is also known as the Levenshtein distance.
For example, to convert Apple into Amazon, we need to change p->m, p->a, l->z, e->o, add n. This entails 5 simple operations.
Properties:
Zero only for identical strings.
Minimum = the difference of the sizes of the two strings.
Maximum = length of the longer string.
Satisfies the Triangle Inequality: The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string.
See the Lazy Prices paper: https://hbswk.hbs.edu/item/lazy-prices, which uses MED for document similarity. Get the published paper through the library for free: https://onlinelibrary.wiley.com/doi/epdf/10.1111/jofi.12885
(Adapted from kristinauko/challenge_100)
import builtins
def min_edit_distance(string1, string2):
if len(string1) > len(string2):
difference = len(string1) - len(string2)
string1[:difference]
elif len(string2) > len(string1):
difference = len(string2) - len(string1)
string2[:difference]
else:
difference = 0
for i in range(builtins.min(len(string1),len(string2))):
if string1[i] != string2[i]:
difference += 1
return difference
print(min_edit_distance("Amazon", "Apple"))
print(min_edit_distance("Amazon", "Amazing"))
5
2
21.3. Simple Similarity#
(Example: Used in the Lazy Prices paper.)
This measure compares two documents word by word or character by character. It uses an old document \(D_1\) and a new document \(D_2\) and counts the additions, deletions, and changes of words, normalized by the sum of words in the two documents.
It is a simple side-by-side comparison method. Much like the function “Track Changes” in Microsoft Word or the function “diff” in Unix/Linux terminal.
First we look at MED at the word level and then consider Simple Similarity.
import os
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
import difflib # https://docs.python.org/2/library/difflib.html
from nltk.tokenize import word_tokenize
D1 = "Some areas around the world that were devastated by the coronavirus in the spring — and are now tightening rules to head off a second wave — are facing resistance from residents who are exhausted, confused and frustrated."
print(D1, "\n")
D1 = word_tokenize(D1)
D2 = "Some parts of the world devastated by the terrible coronavirus in the winter — have now tightened rules to head off a second wave but are facing resistance from residents who are exhausted, bewildered and angry."
print(D2, "\n")
D2 = word_tokenize(D2)
print("Length D1: ",len(D1[:5]),D1[:5])
print("Length D2: ",len(D2[:5]),D2[:5])
print("MED =",min_edit_distance(D1[:5],D2[:5]))
print("Length D1: ",len(D1),D1)
print("Length D2: ",len(D2),D2)
print("MED =",min_edit_distance(D1,D2))
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.
Some areas around the world that were devastated by the coronavirus in the spring — and are now tightening rules to head off a second wave — are facing resistance from residents who are exhausted, confused and frustrated.
Some parts of the world devastated by the terrible coronavirus in the winter — have now tightened rules to head off a second wave but are facing resistance from residents who are exhausted, bewildered and angry.
Length D1: 5 ['Some', 'areas', 'around', 'the', 'world']
Length D2: 5 ['Some', 'parts', 'of', 'the', 'world']
MED = 2
Length D1: 40 ['Some', 'areas', 'around', 'the', 'world', 'that', 'were', 'devastated', 'by', 'the', 'coronavirus', 'in', 'the', 'spring', '—', 'and', 'are', 'now', 'tightening', 'rules', 'to', 'head', 'off', 'a', 'second', 'wave', '—', 'are', 'facing', 'resistance', 'from', 'residents', 'who', 'are', 'exhausted', ',', 'confused', 'and', 'frustrated', '.']
Length D2: 38 ['Some', 'parts', 'of', 'the', 'world', 'devastated', 'by', 'the', 'terrible', 'coronavirus', 'in', 'the', 'winter', '—', 'have', 'now', 'tightened', 'rules', 'to', 'head', 'off', 'a', 'second', 'wave', 'but', 'are', 'facing', 'resistance', 'from', 'residents', 'who', 'are', 'exhausted', ',', 'bewildered', 'and', 'angry', '.']
MED = 37
res = list(difflib.ndiff(D1,D2))
print("DIFFs =",res)
nplus = len([j for j in res if j[0].startswith('+')])
nminus = len([j for j in res if j[0].startswith('-')])
print("SIMPSIM =",nplus,nminus,(nplus+nminus)/(len(D1)+len(D2)))
DIFFs = [' Some', '- areas', '- around', '+ parts', '+ of', ' the', ' world', '- that', '- were', ' devastated', ' by', ' the', '+ terrible', ' coronavirus', ' in', ' the', '- spring', '+ winter', ' —', '+ have', '- and', '- are', ' now', '- tightening', '+ tightened', ' rules', ' to', ' head', ' off', ' a', ' second', ' wave', '- —', '+ but', ' are', ' facing', ' resistance', ' from', ' residents', ' who', ' are', ' exhausted', ' ,', '- confused', '+ bewildered', ' and', '- frustrated', '+ angry', ' .']
SIMPSIM = 9 11 0.2564102564102564
21.4. Sentence Similarity via Language Model Representation#
We can determine sentence similarity based on raw text using set-based similarity methods, as we will see later in this notebook.
However, computing similarity is basically a mathematical operation and requires quantification of text into vectors, matrices, tensors. We have seen an example of such similarity in the computation of cosine similarity above. In that example, we used simple word count vectors.
However, there are other ways of transforming sentences into fixed-length vectors so that we can compute cosine similarity. These are known as “embeddings”, i.e., we convert the text of a sentence into a numeric vector of dimension \(n\) which can be thought of as an embedding of that sentence into \(n\)-dimensional space.
Two popular ways this is done is using traditional word embeddings such as word2vec and BERT model embeddings. Word2vec creates word embeddings and there is a corresponding package for sentence enbeddings, sent2vec.
%%time
!pip install sent2vec
Collecting sent2vec
Downloading sent2vec-0.3.0-py3-none-any.whl.metadata (5.8 kB)
Requirement already satisfied: transformers in /usr/local/lib/python3.11/dist-packages (from sent2vec) (4.50.3)
Requirement already satisfied: torch in /usr/local/lib/python3.11/dist-packages (from sent2vec) (2.6.0+cu124)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from sent2vec) (2.0.2)
Collecting gensim (from sent2vec)
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Requirement already satisfied: spacy in /usr/local/lib/python3.11/dist-packages (from sent2vec) (3.8.5)
Requirement already satisfied: pytest in /usr/local/lib/python3.11/dist-packages (from sent2vec) (8.3.5)
Collecting numpy (from sent2vec)
Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 2.3 MB/s eta 0:00:00
?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim->sent2vec)
Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.6/60.6 kB 5.4 MB/s eta 0:00:00
?25hRequirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.11/dist-packages (from gensim->sent2vec) (7.1.0)
Requirement already satisfied: iniconfig in /usr/local/lib/python3.11/dist-packages (from pytest->sent2vec) (2.1.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from pytest->sent2vec) (24.2)
Requirement already satisfied: pluggy<2,>=1.5 in /usr/local/lib/python3.11/dist-packages (from pytest->sent2vec) (1.5.0)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (1.0.12)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (2.0.11)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (3.0.9)
Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (8.3.4)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (2.5.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (2.0.10)
Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (0.4.1)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (0.15.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (4.67.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (2.32.3)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (2.11.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (3.1.6)
Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (75.2.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.11/dist-packages (from spacy->sent2vec) (3.5.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from torch->sent2vec) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.11/dist-packages (from torch->sent2vec) (4.13.1)
Requirement already satisfied: networkx in /usr/local/lib/python3.11/dist-packages (from torch->sent2vec) (3.4.2)
Requirement already satisfied: fsspec in /usr/local/lib/python3.11/dist-packages (from torch->sent2vec) (2025.3.2)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->sent2vec)
Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->sent2vec)
Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->sent2vec)
Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch->sent2vec)
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch->sent2vec)
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch->sent2vec)
Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch->sent2vec)
Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch->sent2vec)
Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch->sent2vec)
Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /usr/local/lib/python3.11/dist-packages (from torch->sent2vec) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.11/dist-packages (from torch->sent2vec) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch->sent2vec) (12.4.127)
Collecting nvidia-nvjitlink-cu12==12.4.127 (from torch->sent2vec)
Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Requirement already satisfied: triton==3.2.0 in /usr/local/lib/python3.11/dist-packages (from torch->sent2vec) (3.2.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.11/dist-packages (from torch->sent2vec) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy==1.13.1->torch->sent2vec) (1.3.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.26.0 in /usr/local/lib/python3.11/dist-packages (from transformers->sent2vec) (0.30.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.11/dist-packages (from transformers->sent2vec) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.11/dist-packages (from transformers->sent2vec) (2024.11.6)
Requirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.11/dist-packages (from transformers->sent2vec) (0.21.1)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.11/dist-packages (from transformers->sent2vec) (0.5.3)
Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.11/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy->sent2vec) (1.3.0)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy->sent2vec) (0.7.0)
Requirement already satisfied: pydantic-core==2.33.1 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy->sent2vec) (2.33.1)
Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy->sent2vec) (0.4.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy->sent2vec) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy->sent2vec) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy->sent2vec) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy->sent2vec) (2025.1.31)
Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open>=1.8.1->gensim->sent2vec) (1.17.2)
Requirement already satisfied: blis<1.3.0,>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from thinc<8.4.0,>=8.3.4->spacy->sent2vec) (1.2.1)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.11/dist-packages (from thinc<8.4.0,>=8.3.4->spacy->sent2vec) (0.1.5)
Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy->sent2vec) (8.1.8)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy->sent2vec) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy->sent2vec) (13.9.4)
Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spacy->sent2vec) (0.21.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->spacy->sent2vec) (3.0.2)
Requirement already satisfied: marisa-trie>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy->sent2vec) (1.2.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy->sent2vec) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy->sent2vec) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy->sent2vec) (0.1.2)
Downloading sent2vec-0.3.0-py3-none-any.whl (8.1 kB)
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.7/26.7 MB 65.0 MB/s eta 0:00:00
?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 80.7 MB/s eta 0:00:00
?25hDownloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 363.4/363.4 MB 4.3 MB/s eta 0:00:00
?25hDownloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.8/13.8 MB 93.7 MB/s eta 0:00:00
?25hDownloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.6/24.6 MB 74.3 MB/s eta 0:00:00
?25hDownloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 883.7/883.7 kB 54.9 MB/s eta 0:00:00
?25hDownloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 2.7 MB/s eta 0:00:00
?25hDownloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.5/211.5 MB 5.9 MB/s eta 0:00:00
?25hDownloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 12.4 MB/s eta 0:00:00
?25hDownloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 127.9/127.9 MB 7.4 MB/s eta 0:00:00
?25hDownloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.5/207.5 MB 5.5 MB/s eta 0:00:00
?25hDownloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.1/21.1 MB 58.4 MB/s eta 0:00:00
?25hDownloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.6/38.6 MB 12.6 MB/s eta 0:00:00
?25hInstalling collected packages: nvidia-nvjitlink-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, scipy, nvidia-cusparse-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12, gensim, sent2vec
Attempting uninstall: nvidia-nvjitlink-cu12
Found existing installation: nvidia-nvjitlink-cu12 12.5.82
Uninstalling nvidia-nvjitlink-cu12-12.5.82:
Successfully uninstalled nvidia-nvjitlink-cu12-12.5.82
Attempting uninstall: nvidia-curand-cu12
Found existing installation: nvidia-curand-cu12 10.3.6.82
Uninstalling nvidia-curand-cu12-10.3.6.82:
Successfully uninstalled nvidia-curand-cu12-10.3.6.82
Attempting uninstall: nvidia-cufft-cu12
Found existing installation: nvidia-cufft-cu12 11.2.3.61
Uninstalling nvidia-cufft-cu12-11.2.3.61:
Successfully uninstalled nvidia-cufft-cu12-11.2.3.61
Attempting uninstall: nvidia-cuda-runtime-cu12
Found existing installation: nvidia-cuda-runtime-cu12 12.5.82
Uninstalling nvidia-cuda-runtime-cu12-12.5.82:
Successfully uninstalled nvidia-cuda-runtime-cu12-12.5.82
Attempting uninstall: nvidia-cuda-nvrtc-cu12
Found existing installation: nvidia-cuda-nvrtc-cu12 12.5.82
Uninstalling nvidia-cuda-nvrtc-cu12-12.5.82:
Successfully uninstalled nvidia-cuda-nvrtc-cu12-12.5.82
Attempting uninstall: nvidia-cuda-cupti-cu12
Found existing installation: nvidia-cuda-cupti-cu12 12.5.82
Uninstalling nvidia-cuda-cupti-cu12-12.5.82:
Successfully uninstalled nvidia-cuda-cupti-cu12-12.5.82
Attempting uninstall: nvidia-cublas-cu12
Found existing installation: nvidia-cublas-cu12 12.5.3.2
Uninstalling nvidia-cublas-cu12-12.5.3.2:
Successfully uninstalled nvidia-cublas-cu12-12.5.3.2
Attempting uninstall: numpy
Found existing installation: numpy 2.0.2
Uninstalling numpy-2.0.2:
Successfully uninstalled numpy-2.0.2
Attempting uninstall: scipy
Found existing installation: scipy 1.14.1
Uninstalling scipy-1.14.1:
Successfully uninstalled scipy-1.14.1
Attempting uninstall: nvidia-cusparse-cu12
Found existing installation: nvidia-cusparse-cu12 12.5.1.3
Uninstalling nvidia-cusparse-cu12-12.5.1.3:
Successfully uninstalled nvidia-cusparse-cu12-12.5.1.3
Attempting uninstall: nvidia-cudnn-cu12
Found existing installation: nvidia-cudnn-cu12 9.3.0.75
Uninstalling nvidia-cudnn-cu12-9.3.0.75:
Successfully uninstalled nvidia-cudnn-cu12-9.3.0.75
Attempting uninstall: nvidia-cusolver-cu12
Found existing installation: nvidia-cusolver-cu12 11.6.3.83
Uninstalling nvidia-cusolver-cu12-11.6.3.83:
Successfully uninstalled nvidia-cusolver-cu12-11.6.3.83
Successfully installed gensim-4.3.3 numpy-1.26.4 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-nvjitlink-cu12-12.4.127 scipy-1.13.1 sent2vec-0.3.0
from scipy import spatial # for cosine distance
from sent2vec.vectorizer import Vectorizer # uses DistilBERT
sentences = [
"There are several approaches to learn NLP.",
"BERT is an amazing NLP language model.",
"We can use embedding, encoding, or vectorizing to represent language.",
]
vectorizer = Vectorizer()
vectors_bert = vectorizer.run(sentences)
vectors_bert = vectorizer.vectors
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-48b66712e256> in <cell line: 0>()
1 from scipy import spatial # for cosine distance
----> 2 from sent2vec.vectorizer import Vectorizer # uses DistilBERT
3
4 sentences = [
5 "There are several approaches to learn NLP.",
/usr/local/lib/python3.11/dist-packages/sent2vec/vectorizer.py in <module>
1 import numpy as np
2 import os
----> 3 import gensim
4 import torch
5 import transformers as ppb
/usr/local/lib/python3.11/dist-packages/gensim/__init__.py in <module>
9 import logging
10
---> 11 from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils # noqa:F401
12
13
/usr/local/lib/python3.11/dist-packages/gensim/corpora/__init__.py in <module>
4
5 # bring corpus classes directly into package namespace, to save some typing
----> 6 from .indexedcorpus import IndexedCorpus # noqa:F401 must appear before the other classes
7
8 from .mmcorpus import MmCorpus # noqa:F401
/usr/local/lib/python3.11/dist-packages/gensim/corpora/indexedcorpus.py in <module>
12 import numpy
13
---> 14 from gensim import interfaces, utils
15
16 logger = logging.getLogger(__name__)
/usr/local/lib/python3.11/dist-packages/gensim/interfaces.py in <module>
17 import logging
18
---> 19 from gensim import utils, matutils
20
21
/usr/local/lib/python3.11/dist-packages/gensim/matutils.py in <module>
1032 try:
1033 # try to load fast, cythonized code if possible
-> 1034 from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
1035
1036 except ImportError:
/usr/local/lib/python3.11/dist-packages/gensim/_matutils.pyx in init gensim._matutils()
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
print(len(vectors_bert))
print([len(v) for v in vectors_bert])
print(vectors_bert[0])
print(sentences)
dist_1 = 1 - spatial.distance.cosine(vectors_bert[0], vectors_bert[1]) # Similarity = 1 - Distance
print(dist_1)
dist_2 = 1 - spatial.distance.cosine(vectors_bert[0], vectors_bert[2])
print(dist_2)
dist_3 = 1 - spatial.distance.cosine(vectors_bert[1], vectors_bert[2])
print(dist_3)
To summarize, here is a graphic that depicts various distance measures.
Image("NLP_images/Distance_Measures_in_DataScience.png", width=500)