4. Regular Expressions#

A quick introduction to Regex.

Here is an awesome video from Socratica.

%%capture
#INCLUDING SCIENTIFIC AND NUMERICAL COMPUTING LIBRARIES
#Run this code to make sure that you have all the libraries at one go.
%pylab inline
import os
import pandas as pd

4.1. Regex#

Use regular expressions to help us with more complex parsing.

Read: https://automatetheboringstuff.com/2e/chapter7/

For example '@[A-Za-z0-9_]+' will return all words that:

  • start with '@' and are followed by at least one:

  • capital letter ('A-Z')

  • lowercase letter ('a-z')

  • number ('0-9')

  • or underscore ('_')

Cory Doctorow suggests that we teach regex even before we learn programming.

4.2. Character codes#

  • \d, digit 0-9

  • \D, not \d, not a numeric digit from 0-9

  • \w, any letter, numeric, underscore

  • \W, not \w

  • \s, space, tab, newline

  • \S, not \s

text = "We the People of the United States, in Order to form a more perfect Union, establish Justice, \
        insure domestic Tranquility, provide for the common defence, promote the general Welfare, \
        and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish \
        this Constitution for the United States of America."

#Tokenize the words, separating by spaces, periods, commas
x = text.split(" ")
[j for j in x if j.isalnum()]

print(x)
['We', 'the', 'People', 'of', 'the', 'United', 'States,', 'in', 'Order', 'to', 'form', 'a', 'more', 'perfect', 'Union,', 'establish', 'Justice,', '', '', '', '', '', '', '', '', 'insure', 'domestic', 'Tranquility,', 'provide', 'for', 'the', 'common', 'defence,', 'promote', 'the', 'general', 'Welfare,', '', '', '', '', '', '', '', '', 'and', 'secure', 'the', 'Blessings', 'of', 'Liberty', 'to', 'ourselves', 'and', 'our', 'Posterity,', 'do', 'ordain', 'and', 'establish', '', '', '', '', '', '', '', '', 'this', 'Constitution', 'for', 'the', 'United', 'States', 'of', 'America.']
#Find words that contain 'a' using RE
# Remember x was the list of words from the first paragraph of the constitution
import re
[j for j in x if re.search('[Aa]',j)]
['States,',
 'a',
 'establish',
 'Tranquility,',
 'general',
 'Welfare,',
 'and',
 'and',
 'ordain',
 'and',
 'establish',
 'States',
 'America.']
# Checking for phone number patterns (do the same thing in two steps)
pattern = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
res = pattern.search('Call me at 408-554-4000 for SCU')
print(res)
print(res.group()) # show the required item because all discovered items are in a  group
print(res[0]) # It is also the main item returned
<re.Match object; span=(11, 23), match='408-554-4000'>
408-554-4000
408-554-4000
# Use search(pattern, text) function structure
re.search(r'\d\d\d-\d\d\d-\d\d\d\d', 'Call me at 408-554-4000 for SCU')[0]
'408-554-4000'
# Note that it finds the first match not all matches
res = pattern.search('Call me at 408-554-4000 or 408-551-5700 for SCU')
print(res)
print(res.group())
print(res[0])
<re.Match object; span=(11, 23), match='408-554-4000'>
408-554-4000
408-554-4000
# Grouping components of the search object using parentheses
pattern = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
res = pattern.search('Call me at 408-554-4000 for SCU')
print(res)
print([res.group(j) for j in range(4)])
print(res[0])
print(res[1])
print(res[2])
print(res[3])
<re.Match object; span=(11, 23), match='408-554-4000'>
['408-554-4000', '408', '554', '4000']
408-554-4000
408
554
4000
# Special characters need to be 'escaped' with backslashes
pattern = re.compile(r'(\(\d\d\d\)) (\d\d\d)-(\d\d\d\d)')
res = pattern.search('Call me at (408) 554-4000 for SCU')
print(res)
print([res.group(j) for j in range(4)])
print(res[0])
print(res[1])
print(res[2])
print(res[3])
<re.Match object; span=(11, 25), match='(408) 554-4000'>
['(408) 554-4000', '(408)', '554', '4000']
(408) 554-4000
(408)
554
4000
# Using Unix pipes
# Special characters need to be 'escaped' with backslashes
# Do not leave a gap between the pipe character and strings
pattern = re.compile(r'(\(\d\d\d\)) (\d\d\d)-(\d\d\d\d)|(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
res = pattern.search('Call me at (408) 554-4000 or 408-551-5700 for SCU')
print(res)
print([res.group(j) for j in range(4)])
print(res[0])
print(res[1])
print(res[2])
print(res[3])
<re.Match object; span=(11, 25), match='(408) 554-4000'>
['(408) 554-4000', '(408)', '554', '4000']
(408) 554-4000
(408)
554
4000
# Find all matches
# Shows all optional matches as blanks if the condition is met.
# First matches at the broad group level
pattern = re.compile(r'(\(\d\d\d\)) (\d\d\d)-(\d\d\d\d)|(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
res = pattern.findall('Call me at (408) 554-4000 or 408-551-5700 for SCU')
print(res)
[('(408)', '554', '4000', '', '', ''), ('', '', '', '408', '551', '5700')]
# Without groups
pattern = re.compile(r'\(\d\d\d\) \d\d\d-\d\d\d\d|\d\d\d-\d\d\d-\d\d\d\d')
res = pattern.findall('Call me at (408) 554-4000 or 408-551-5700 for SCU')
print(res)
['(408) 554-4000', '408-551-5700']
# Varying the match with optional strings
pattern = re.compile(r'Sanjiv (Ranjan )?Das')
res = pattern.search("My name is Sanjiv Das")
print(res)
res = pattern.search("My name is Sanjiv Ranjan Das")
print(res)
<re.Match object; span=(11, 21), match='Sanjiv Das'>
<re.Match object; span=(11, 28), match='Sanjiv Ranjan Das'>
# Matching multiple occurrences, including absences (*)
pattern = re.compile(r'0*1*2*')  # includes blanks spaces
res = pattern.search('012890')
print(res)
res = pattern.search('30128 90')
print(res)
res = pattern.findall('30128 90')
print(res)
res = pattern.findall('3 0128 90')
print(res)
res = pattern.search('112890')
print(res)
res = pattern.findall('112890')
print(res)
<re.Match object; span=(0, 3), match='012'>
<re.Match object; span=(0, 0), match=''>
['', '012', '', '', '', '0', '']
['', '', '012', '', '', '', '0', '']
<re.Match object; span=(0, 3), match='112'>
['112', '', '', '0', '']
# What if you use brackets?
pattern = re.compile(r'[0*1*2*]')
res = pattern.search('312890')
print(res)
res = pattern.search('321890')
print(res)
res = pattern.findall('3 0128 90')
print(res)
res = pattern.search('112890')
print(res)
<re.Match object; span=(1, 2), match='1'>
<re.Match object; span=(1, 2), match='2'>
['0', '1', '2', '0']
<re.Match object; span=(0, 1), match='1'>
# Matching at least one or more occurrences (+)
pattern = re.compile(r'0+1+2+')
res = pattern.search('312890')
print(res)
res = pattern.findall('3 01218 90')
print(res)
res = pattern.findall('3 0012218 90')
print(res)
res = pattern.search('112890')
print(res)
None
['012']
['00122']
None
# Repeat patterns with braces
pattern = re.compile(r'(xo){3}')  # specific repeat pattern of size 3
res = pattern.search('Bye for now xoxoxo - see you later')
print(res)

res = pattern.search('Bye for now xoxoox - see you later')
print(res)

pattern = re.compile(r'[xo]{3}') # length of pattern to include any of the characters
res = pattern.search('Bye for now xoxoxoox - see you later')
print(res)

pattern = re.compile(r'[ox]{3}') # reversing the order makes no difference
res = pattern.findall('Bye for now xooxoxxxxo - see you later')
print(res)
<re.Match object; span=(12, 18), match='xoxoxo'>
None
<re.Match object; span=(12, 15), match='xox'>
['xoo', 'xxx', 'xox']
# Matches of variable size
pattern = re.compile(r'[xo]{2,6}') # length of pattern to include any of the characters
res = pattern.search('Bye for now xoxxooooxxxxoooxxoxoxo - see you later')
print(res)
res = pattern.findall('Bye for now xoxoxoooxxxxoooxxoxoxo - see you later')
print(res)
<re.Match object; span=(12, 18), match='xoxxoo'>
['xoxoxo', 'ooxxxx', 'oooxxo', 'xoxo']
# Pulling it all together, using character codes
pattern = re.compile(r'\d+\s\w')
res = pattern.findall('Pigeons and 4 and 20 blackbirds baked in 3 pies')
print(res)

pattern = re.compile(r'\d+\s\w+')
res = pattern.findall('Pigeons and 4 and 20 blackbirds baked in 3 pies')
print(res)

pattern = re.compile(r'\d+\s\w+')
res = pattern.findall('Pigeons and 4 and 20   blackbirds baked in 3 pies')
print(res)

pattern = re.compile(r'\d+\s+\w+')
res = pattern.findall('Pigeons and 4 and 20   blackbirds baked in 3 pies')
print(res)
['4 a', '20 b', '3 p']
['4 and', '20 blackbirds', '3 pies']
['4 and', '3 pies']
['4 and', '20   blackbirds', '3 pies']
pattern = re.compile(r'\s*\w+')  # * is zero or more appearances
res = pattern.findall('Pigeons and 4 and 20 blackbirds baked in 3 pies')
print(res)

pattern = re.compile(r'[A-Za-z ]+') # Note the space is included
res = pattern.findall('Pigeons and 4 and 20 blackbirds baked in 3 pies')
print(res)
print("".join(res))

pattern = re.compile(r'[A-Za-z]+') # Note the space is not included
res = pattern.findall('Pigeons and 4 and 20 blackbirds baked in 3 pies')
print(res)
print(" ".join(res)) # space added here
['Pigeons', ' and', ' 4', ' and', ' 20', ' blackbirds', ' baked', ' in', ' 3', ' pies']
['Pigeons and ', ' and ', ' blackbirds baked in ', ' pies']
Pigeons and  and  blackbirds baked in  pies
['Pigeons', 'and', 'and', 'blackbirds', 'baked', 'in', 'pies']
Pigeons and and blackbirds baked in pies
# Begins with using caret (^), must begin with
# ^ matches position just before the first character of the string
pattern = re.compile(r'^[A-Za-z ]+') # Note the space is included

res = pattern.search('pigeons and 4 and 20 blackbirds baked in 3 pies')
print(res)

res = pattern.findall('pigeons and 4 and 20 blackbirds baked in 3 pies')
print(res)

res = pattern.findall('3 pigeons and 4 and 20 blackbirds baked in 3 pies')
print(res)
<re.Match object; span=(0, 12), match='pigeons and '>
['pigeons and ']
[]
# Ends with using dollar sign ($), must end with
# $ matches position just after the last character of the string
pattern = re.compile(r'[A-Za-z ]+$') # Note the space is included
res = pattern.findall('pigeons and 4 and 20 blackbirds baked in 3 pies')
print(res)

res = pattern.findall('Pigeons and 4 and 20 blackbirds baked in 3 pies ')
print(res)

res = pattern.findall('Pigeons and 4 and 20 blackbirds baked in 3 pies 4')
print(res)
[' pies']
[' pies ']
[]
# Wildcard using (.)
pattern = re.compile(r'.s')
res = pattern.findall('Pigeons and 4 and 20 blackbirds baked in 3 pies 4')
print(res)

pattern = re.compile(r'..s')
res = pattern.findall('Pigeons and 4 and 20 blackbirds baked in 3 pies 4')
print(res)

pattern = re.compile(r'(.*?)s')
res = pattern.findall('Pigeons and 4 and 20 blackbirds baked in 3 pies 4')
print(res)
['ns', 'ds', 'es']
['ons', 'rds', 'ies']
['Pigeon', ' and 4 and 20 blackbird', ' baked in 3 pie']
# Substitute strings
# re.sub('replacement','string containing pattern')
pattern = re.compile(r'\d+')
res = pattern.sub('NUMBER','Pigeons and 4 and 20 blackbirds baked in 3 pies 4')
print(res)
Pigeons and NUMBER and NUMBER blackbirds baked in NUMBER pies NUMBER
# Common case: emails and phone numbers
pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+(\.[a-zA-Z]{2,4})|(\d{3}|\(\d{3}\))?(\s|-|\.)?(\d{3})(\s|-|\.)(\d{4})(\s*(ext|x|ext.)\s*(\d{2,5}))?')

res = pattern.search('My email is srdas_3.142@scu.edu')
print(res)

txt = 'My phone number is (408) 554-2776 and my email is srdas_3.142@scu.edu'
res = pattern.search(txt)
print(res)

res2 = pattern.search(txt[res.span()[1]:])
print(res2)
<re.Match object; span=(12, 31), match='srdas_3.142@scu.edu'>
<re.Match object; span=(19, 33), match='(408) 554-2776'>
<re.Match object; span=(17, 36), match='srdas_3.142@scu.edu'>

4.3. Password Example#

Strong Password Detection

Write a function that uses regular expressions to make sure the password string it is passed is strong. A strong password is defined as one that is at least eight characters long, contains both uppercase and lowercase characters, and has at least one digit. You may need to test the string against multiple regex patterns to validate its strength.

import re
pattern = re.compile(r'(?=^.{8,}$)(?=.*\d)(?=.*[A-Z])(?=.*[a-z]).*$')
print(pattern.search('Ae4cvf7890'))
print(pattern.search('7890Ae4cvf'))
print(pattern.search('abc7890Ae4cvf'))
print(pattern.search('abc7890Ae4cvf35'))
print(pattern.search('ABC7890A4CVF'))
print(pattern.search('ab7A5'))
print(pattern.search('ab7A5xy'))
print(pattern.search('ab7A5xyZ'))
print(pattern.search('ab7A5xyZb'))
print(pattern.search('abAxyZbXyW'))
<re.Match object; span=(0, 10), match='Ae4cvf7890'>
<re.Match object; span=(0, 10), match='7890Ae4cvf'>
<re.Match object; span=(0, 13), match='abc7890Ae4cvf'>
<re.Match object; span=(0, 15), match='abc7890Ae4cvf35'>
None
None
None
<re.Match object; span=(0, 8), match='ab7A5xyZ'>
<re.Match object; span=(0, 9), match='ab7A5xyZb'>
None