Parts of Speech (POS) Tagging

9. Parts of Speech (POS) Tagging#

%%capture
#INCLUDING SCIENTIFIC AND NUMERICAL COMPUTING LIBRARIES
#Run this code to make sure that you have all the libraries at one go.
%pylab inline
import os
import pandas as pd
%load_ext rpy2.ipython

9.1. Information Extraction#

This is often the first step in NLP work, i.e., extracting information from raw text. The simplest form of this is pattern matching. Here, we see how easy it is to do with SpaCy.

%%capture
# spaCy stuff
# !pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download en

import spacy

nlp = spacy.load("en_core_web_sm")
import re
import string
import nltk
from tqdm import tqdm
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy
# Load text
text = "Records Obtained by The Times After Years of Secrecy. The Times has obtained tax-return data for President Trump extending over more than two decades. It tells a story fundamentally different from the one he’s sold to the public. Mr. Trump’s finances are under stress, beset by hundreds of millions in debt coming due and an I.R.S. audit that could cost him over $100 million. He paid $750 in federal income taxes in 2016, and nothing at all in 10 of the prior 15 years, largely because he lost so much money."
print(text)

# spaCy object
doc = nlp(text)
type(doc)
Records Obtained by The Times After Years of Secrecy. The Times has obtained tax-return data for President Trump extending over more than two decades. It tells a story fundamentally different from the one he’s sold to the public. Mr. Trump’s finances are under stress, beset by hundreds of millions in debt coming due and an I.R.S. audit that could cost him over $100 million. He paid $750 in federal income taxes in 2016, and nothing at all in 10 of the prior 15 years, largely because he lost so much money.
spacy.tokens.doc.Doc
# Analyze each token for its role in the text
for tok in doc:
  print(tok.text, "-->",tok.dep_,"-->", tok.pos_)
Records --> ROOT --> NOUN
Obtained --> acl --> VERB
by --> agent --> ADP
The --> det --> DET
Times --> pobj --> PROPN
After --> prep --> ADP
Years --> pobj --> NOUN
of --> prep --> ADP
Secrecy --> pobj --> PROPN
. --> punct --> PUNCT
The --> det --> DET
Times --> nsubj --> PROPN
has --> aux --> AUX
obtained --> ROOT --> VERB
tax --> compound --> NOUN
- --> punct --> PUNCT
return --> compound --> NOUN
data --> dobj --> NOUN
for --> prep --> ADP
President --> compound --> PROPN
Trump --> nsubj --> PROPN
extending --> advcl --> VERB
over --> prep --> ADP
more --> amod --> ADJ
than --> quantmod --> ADP
two --> nummod --> NUM
decades --> dobj --> NOUN
. --> punct --> PUNCT
It --> nsubj --> PRON
tells --> ROOT --> VERB
a --> det --> DET
story --> dobj --> NOUN
fundamentally --> advmod --> ADV
different --> amod --> ADJ
from --> prep --> ADP
the --> det --> DET
one --> pobj --> NOUN
he --> nsubjpass --> PRON
’s --> auxpass --> AUX
sold --> relcl --> VERB
to --> prep --> ADP
the --> det --> DET
public --> pobj --> NOUN
. --> punct --> PUNCT
Mr. --> compound --> PROPN
Trump --> poss --> PROPN
’s --> case --> PART
finances --> nsubj --> NOUN
are --> ROOT --> AUX
under --> prep --> ADP
stress --> pobj --> NOUN
, --> punct --> PUNCT
beset --> advcl --> VERB
by --> agent --> ADP
hundreds --> quantmod --> NOUN
of --> quantmod --> ADP
millions --> pobj --> NOUN
in --> prep --> ADP
debt --> pobj --> NOUN
coming --> acl --> VERB
due --> acomp --> ADJ
and --> cc --> CCONJ
an --> det --> DET
I.R.S. --> compound --> PROPN
audit --> conj --> NOUN
that --> nsubj --> PRON
could --> aux --> AUX
cost --> relcl --> VERB
him --> dative --> PRON
over --> quantmod --> ADP
$ --> quantmod --> SYM
100 --> compound --> NUM
million --> dobj --> NUM
. --> punct --> PUNCT
He --> nsubj --> PRON
paid --> ROOT --> VERB
$ --> nmod --> SYM
750 --> dobj --> NUM
in --> prep --> ADP
federal --> amod --> ADJ
income --> compound --> NOUN
taxes --> pobj --> NOUN
in --> prep --> ADP
2016 --> pobj --> NUM
, --> punct --> PUNCT
and --> cc --> CCONJ
nothing --> conj --> PRON
at --> advmod --> ADV
all --> advmod --> ADV
in --> prep --> ADP
10 --> pobj --> NUM
of --> prep --> ADP
the --> det --> DET
prior --> amod --> ADJ
15 --> nummod --> NUM
years --> pobj --> NOUN
, --> punct --> PUNCT
largely --> advmod --> ADV
because --> mark --> SCONJ
he --> nsubj --> PRON
lost --> advcl --> VERB
so --> advmod --> ADV
much --> amod --> ADJ
money --> dobj --> NOUN
. --> punct --> PUNCT
# Look for a pattern: https://spacy.io/usage/rule-based-matching
matcher = Matcher(nlp.vocab)
pattern1 = [{'POS': 'NOUN'},{'LOWER': 'in'},{'POS': 'NOUN'}] # This is a triple and can define a relationship
matcher.add('Match1', [pattern1]) # add the pattern to the matcher
matches = matcher(doc)
print(matches)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)
[(2356062832825383259, 56, 59)]
2356062832825383259 Match1 56 59 millions in debt
# Look for a pattern: https://spacy.io/usage/rule-based-matching
matcher = Matcher(nlp.vocab)
pattern2 = [{'POS': 'DET'},{'POS': 'NOUN'}] # This is a triple and can define a relationshipmatcher.add('Match1', None, pattern) # add the pattern to the matcher
matcher.add('Match2', [pattern1, pattern2]) # add the pattern to the matcher
matches = matcher(doc)
print(matches)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)
[(2988625564965749595, 30, 32), (2988625564965749595, 35, 37), (2988625564965749595, 41, 43), (2988625564965749595, 56, 59)]
2988625564965749595 Match2 30 32 a story
2988625564965749595 Match2 35 37 the one
2988625564965749595 Match2 41 43 the public
2988625564965749595 Match2 56 59 millions in debt

You can combine this with regex commands, thereby extending the scope of these patterns to both character-based and POS-based parsing.