Using Amazon Textract

Contents

14. Using Amazon Textract#

Documentation: https://textract.readthedocs.io/en/stable/python_package.html
Code samples: aws-samples/amazon-textract-code-samples
In case you need to install tesseract: https://textract.readthedocs.io/en/latest/installation.html

%%capture
#INCLUDING SCIENTIFIC AND NUMERICAL COMPUTING LIBRARIES
#Run this code to make sure that you have all the libraries at one go.
%pylab inline
import os
!pip install ipypublish
from ipypublish import nb_setup
import pandas as pd
%load_ext rpy2.ipython

# Basic lines of code needed to import a data file with permissions from Google Drive
from google.colab import drive
# drive.mount("/content/drive", force_remount=True)
drive.mount('/content/drive')
os.chdir("drive/My Drive/Books_Writings/NLPBook/")

Mounted at /content/drive

14.1. Sample Text in a JPEG file for OCR#

nb_setup.images_hconcat(["NLP_images/wsj_text.jpeg"], width=600)

_images/325a71b7e029593aa9517c7f2da4d691554ed1313d22a5e6088e50180f8ff147.png

14.2. AWS Textract Steps#

Create a S3 bucket, use all lower case.
Upload a JPEG file.
Call it from Textract using a simple command line interface.

[default]
aws_access_key_id = AKIAYNQOKXOFYHNEWHSWJ
aws_secret_access_key = dSwPa0krngewWudGcyUgZ219XVZ0IzpMe3LO3ojzP56

Textract returns a JSON structure as shown below.

text = !aws textract analyze-document --document '{"S3Object":{"Bucket":"nlp-course-sanjivda","Name":"wsj_text.jpeg"}}' --feature-types '["TABLES","FORMS"]'

text

['/bin/bash: line 1: aws: command not found']

14.3. Examine the text fields#

f = text.fields()
f

[['/bin/bash:', 'line', '1:', 'aws:', 'command', 'not', 'found']]

14.4. Tokenize the text in the JSON structure#

res = [j[1:] for j in f if j[0]=='"Text":']
print(res)

[]

14.5. Reconstruct the text#

res2 = [" ".join(j) for j in res]
res2 = " ".join(res2)
print(res2)
print('------------')
res2 = re.sub('", "',' ',res2)
print(res2)

------------