Natural Language Processing

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Patrick Laub

Natural Language Processing

Lecture Outline

  • Natural Language Processing

  • Car Crash Police Reports

  • Text Vectorisation

  • Bag Of Words

  • Limiting The Vocabulary

  • Intelligently Limit The Vocabulary

  • Word Embeddings

  • Word Embeddings II

What is NLP?

A field of research at the intersection of computer science, linguistics, and artificial intelligence that takes the naturally spoken or written language of humans and processes it with machines to automate or help in certain tasks

How the computer sees text

Spot the odd one out:

[112, 97, 116, 114, 105, 99, 107, 32, 108, 97, 117, 98]
[80, 65, 84, 82, 73, 67, 75, 32, 76, 65, 85, 66]
[76, 101, 118, 105, 32, 65, 99, 107, 101, 114, 109, 97, 110]

Generated by:

print([ord(x) for x in "patrick laub"])
print([ord(x) for x in "PATRICK LAUB"])
print([ord(x) for x in "Levi Ackerman"])

The ord built-in turns characters into their ASCII form.

Question

The largest value for a character is 127, can you guess why?

ASCII

American Standard Code for Information Interchange

Unicode is the new standard.

Random strings

The built-in chr function turns numbers into characters.

rnd.seed(1)
chars = [chr(rnd.randint(32, 127)) for _ in range(10)]
chars
['E', ',', 'h', ')', 'k', '%', 'o', '`', '0', '!']
" ".join(chars)
'E , h ) k % o ` 0 !'
"".join([chr(rnd.randint(32, 127)) for _ in range(50)])
"lg&9R42t+<=.Rdww~v-)'_]6Y! \\q(x-Oh>g#f5QY#d8Kl:TpI"
"".join([chr(rnd.randint(0, 128)) for _ in range(50)])
'R\x0f@D\x19obW\x07\x1a\x19h\x16\tCg~\x17}d\x1b%9S&\x08 "\n\x17\x0foW\x19Gs\\J>. X\x177AqM\x03\x00x'

Escape characters

print("Hello,\tworld!")
Hello,  world!
print("Line 1\nLine 2")
Line 1
Line 2
print("Patrick\rLaub")
Laubick
print("C:\tom\new folder")
C:  om
ew folder

Escape the backslash:

print("C:\\tom\\new folder")
C:\tom\new folder
repr("Hello,\rworld!")
"'Hello,\\rworld!'"

Non-natural language processing I

How would you evaluate

10 + 2 * -3

All that Python sees is a string of characters.

[ord(c) for c in "10 + 2 * -3"]
[49, 48, 32, 43, 32, 50, 32, 42, 32, 45, 51]
10 + 2 * -3
4

Non-natural language processing II

Python first tokenizes the string:

import tokenize
import io

code = "10 + 2 * -3"
tokens = tokenize.tokenize(io.BytesIO(code.encode("utf-8")).readline)
for token in tokens:
    print(token)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='10', start=(1, 0), end=(1, 2), line='10 + 2 * -3')
TokenInfo(type=54 (OP), string='+', start=(1, 3), end=(1, 4), line='10 + 2 * -3')
TokenInfo(type=2 (NUMBER), string='2', start=(1, 5), end=(1, 6), line='10 + 2 * -3')
TokenInfo(type=54 (OP), string='*', start=(1, 7), end=(1, 8), line='10 + 2 * -3')
TokenInfo(type=54 (OP), string='-', start=(1, 9), end=(1, 10), line='10 + 2 * -3')
TokenInfo(type=2 (NUMBER), string='3', start=(1, 10), end=(1, 11), line='10 + 2 * -3')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 11), end=(1, 12), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

Non-natural language processing III

Python needs to parse the tokens into an abstract syntax tree.

import ast

print(ast.dump(ast.parse("10 + 2 * -3"), indent="  "))
Module(
  body=[
    Expr(
      value=BinOp(
        left=Constant(value=10),
        op=Add(),
        right=BinOp(
          left=Constant(value=2),
          op=Mult(),
          right=UnaryOp(
            op=USub(),
            operand=Constant(value=3)))))],
  type_ignores=[])

graph TD;
    Expr --> C[Add]
    C --> D[10]
    C --> E[Mult]
    E --> F[2]
    E --> G[USub]
    G --> H[3]

Non-natural language processing IV

The abstract syntax tree is then compiled into bytecode.

import dis

def expression(a, b, c):
    return a + b * -c

dis.dis(expression)
  3           0 RESUME                   0

  4           2 LOAD_FAST                0 (a)
              4 LOAD_FAST                1 (b)
              6 LOAD_FAST                2 (c)
              8 UNARY_NEGATIVE
             10 BINARY_OP                5 (*)
             14 BINARY_OP                0 (+)
             18 RETURN_VALUE

Running the bytecode

ChatGPT tokenization

Example of GPT 3.5/4’s tokenization

E.g. 犭 radical for animals

狗 gǒu (dog)

猫 māo (cat)

狼 láng (wolf)

狮 shī (lion)

Applications of NLP in Industry

1) Classifying documents: Using the language within a body of text to classify it into a particular category, e.g.:

  • Grouping emails into high and low urgency
  • Movie reviews into positive and negative sentiment (i.e. sentiment analysis)
  • Company news into bullish (positive) and bearish (negative) statements

2) Machine translation: Assisting language translators with machine-generated suggestions from a source language (e.g. English) to a target language

Applications of NLP in Industry II

3) Search engine functions, including:

  • Autocomplete
  • Predicting what information or website user is seeking

4) Speech recognition: Interpreting voice commands to provide information or take action. Used in virtual assistants such as Alexa, Siri, and Cortana

Deep learning & NLP?

Simple NLP applications such as spell checkers and synonym suggesters do not require deep learning and can be solved with deterministic, rules-based code with a dictionary/thesaurus.

More complex NLP applications such as classifying documents, search engine word prediction, and chatbots are complex enough to be solved using deep learning methods.

NLP in 1966-1973 #1

A typical story occurred in early machine translation efforts, which were generously funded by the U.S. National Research Council in an attempt to speed up the translation of Russian scientific papers in the wake of the Sputnik launch in 1957. It was thought initially that simple syntactic transformations, based on the grammars of Russian and English, and word replacement from an electronic dictionary, would suffice to preserve the exact meanings of sentences.

NLP in 1966-1973 #2

The fact is that accurate translation requires background knowledge in order to resolve ambiguity and establish the content of the sentence. The famous retranslation of “the spirit is willing but the flesh is weak” as “the vodka is good but the meat is rotten” illustrates the difficulties encountered. In 1966, a report by an advisory committee found that “there has been no machine translation of general scientific text, and none is in immediate prospect.” All U.S. government funding for academic translation projects was canceled.

High-level history of deep learning

A brief history of deep learning.

Car Crash Police Reports

Lecture Outline

  • Natural Language Processing

  • Car Crash Police Reports

  • Text Vectorisation

  • Bag Of Words

  • Limiting The Vocabulary

  • Intelligently Limit The Vocabulary

  • Word Embeddings

  • Word Embeddings II

Downloading the dataset

Look at the (U.S.) National Highway Traffic Safety Administration’s (NHTSA) National Motor Vehicle Crash Causation Survey (NMVCCS) dataset.

from pathlib import Path

if not Path("NHTSA_NMVCCS_extract.parquet.gzip").exists():
    print("Downloading dataset")                                    
    !wget https://github.com/JSchelldorfer/ActuarialDataScience/raw/master/12%20-%20NLP%20Using%20Transformers/NHTSA_NMVCCS_extract.parquet.gzip

df = pd.read_parquet("NHTSA_NMVCCS_extract.parquet.gzip")
print(f"shape of DataFrame: {df.shape}")
shape of DataFrame: (6949, 16)

Features

  • level_0, index, SCASEID: all useless row numbers
  • SUMMARY_EN and SUMMARY_GE: summaries of the accident
  • NUMTOTV: total number of vehicles involved in the accident
  • WEATHER1 to WEATHER8 (not one-hot):
    • WEATHER1: cloudy
    • WEATHER2: snow
    • WEATHER3: fog, smog, smoke
    • WEATHER4: rain
    • WEATHER5: sleet, hail (freezing drizzle or rain)
    • WEATHER6: blowing snow
    • WEATHER7: severe crosswinds
    • WEATHER8: other
  • INJSEVA and INJSEVB: injury severity & (binary) presence of bodily injury

Crash summaries

df["SUMMARY_EN"]
0       V1, a 2000 Pontiac Montana minivan, made a lef...
1       The crash occurred in the eastbound lane of a ...
2       This crash occurred just after the noon time h...
                              ...                        
6946    The crash occurred in the eastbound lanes of a...
6947    This single-vehicle crash occurred in a rural ...
6948    This two vehicle daytime collision occurred mi...
Name: SUMMARY_EN, Length: 6949, dtype: object
df["SUMMARY_EN"].map(lambda summary: len(summary)).hist(grid=False);

A crash summary

df["SUMMARY_EN"].iloc[1]
"The crash occurred in the eastbound lane of a two-lane, two-way asphalt roadway on level grade.  The conditions were daylight and wet with cloudy skies in the early afternoon on a weekday.\t\r \r V1, a 1995 Chevrolet Lumina was traveling eastbound.  V2, a 2004 Chevrolet Trailblazer was also traveling eastbound on the same roadway.  V2, was attempting to make a left-hand turn into a private drive on the North side of the roadway.  While turning V1 attempted to pass V2 on the left-hand side contacting it's front to the left side of V2.  Both vehicles came to final rest on the roadway at impact.\r \r The driver of V1 fled the scene and was not identified, so no further information could be obtained from him.  The Driver of V2 stated that the driver was a male and had hit his head and was bleeding.  She did not pursue the driver because she thought she saw a gun. The officer said that the car had been reported stolen.\r \r The Critical Precrash Event for the driver of V1 was this vehicle traveling over left lane line on the left side of travel.  The Critical Reason for the Critical Event was coded as unknown reason for the critical event because the driver was not available. \r \r The driver of V2 was a 41-year old female who had reported that she had stopped prior to turning to make sure she was at the right house.  She was going to show a house for a client.  She had no health related problems.  She had taken amoxicillin.  She does not wear corrective lenses and felt rested.  She was not injured in the crash.\r \r The Critical Precrash Event for the driver of V2 was other vehicle encroachment from adjacent lane over left lane line.  The Critical Reason for the Critical Event was not coded for this vehicle and the driver of V2 was not thought to have contributed to the crash."

Carriage returns

print(df["SUMMARY_EN"].iloc[1])
The Critical Precrash Event for the driver of V2 was other vehicle encroachment from adjacent lane over left lane line.  The Critical Reason for the Critical Event was not coded for this vehicle and the driver of V2 was not thought to have contributed to the crash.r corrective lenses and felt rested.  She was not injured in the crash. of V2.  Both vehicles came to final rest on the roadway at impact.
# Replace every \r with \n
def replace_carriage_return(summary):
    return summary.replace("\r", "\n")

df["SUMMARY_EN"] = df["SUMMARY_EN"].map(replace_carriage_return)
print(df["SUMMARY_EN"].iloc[1][:500])
The crash occurred in the eastbound lane of a two-lane, two-way asphalt roadway on level grade.  The conditions were daylight and wet with cloudy skies in the early afternoon on a weekday.    
 
 V1, a 1995 Chevrolet Lumina was traveling eastbound.  V2, a 2004 Chevrolet Trailblazer was also traveling eastbound on the same roadway.  V2, was attempting to make a left-hand turn into a private drive on the North side of the roadway.  While turning V1 attempted to pass V2 on the left-hand side contactin

Target

Predict number of vehicles in the crash.

df["NUMTOTV"].value_counts()\
    .sort_index()
NUMTOTV
1    1822
2    4151
3     783
4     150
5      34
6       5
7       2
8       1
9       1
Name: count, dtype: int64
np.sum(df["NUMTOTV"] > 3)
193

Simplify the target to just:

  • 1 vehicle
  • 2 vehicles
  • 3+ vehicles
df["NUM_VEHICLES"] = \
  df["NUMTOTV"].map(lambda x: \
    str(x) if x <= 2 else "3+")
df["NUM_VEHICLES"].value_counts()\
  .sort_index()
NUM_VEHICLES
1     1822
2     4151
3+     976
Name: count, dtype: int64

Just ignore this for now…

rnd.seed(123)

for i, summary in enumerate(df["SUMMARY_EN"]):
    word_numbers = ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"]
    num_cars = 10
    new_car_nums = [f"V{rnd.randint(100, 10000)}" for _ in range(num_cars)]
    num_spaces = 4

    for car in range(1, num_cars+1):
        new_num = new_car_nums[car-1]
        summary = summary.replace(f"V-{car}", new_num)
        summary = summary.replace(f"Vehicle {word_numbers[car-1]}", new_num).replace(f"vehicle {word_numbers[car-1]}", new_num)
        summary = summary.replace(f"Vehicle #{word_numbers[car-1]}", new_num).replace(f"vehicle #{word_numbers[car-1]}", new_num)
        summary = summary.replace(f"Vehicle {car}", new_num).replace(f"vehicle {car}", new_num)
        summary = summary.replace(f"Vehicle #{car}", new_num).replace(f"vehicle #{car}", new_num)
        summary = summary.replace(f"Vehicle # {car}", new_num).replace(f"vehicle # {car}", new_num)

        for j in range(num_spaces+1):
            summary = summary.replace(f"V{' '*j}{car}", new_num).replace(f"V{' '*j}#{car}", new_num).replace(f"V{' '*j}# {car}", new_num)
            summary = summary.replace(f"v{' '*j}{car}", new_num).replace(f"v{' '*j}#{car}", new_num).replace(f"v{' '*j}# {car}", new_num)
         
    df.loc[i, "SUMMARY_EN"] = summary

Convert y to integers & split the data

from sklearn.preprocessing import LabelEncoder
target_labels = df["NUM_VEHICLES"]
target = LabelEncoder().fit_transform(target_labels)
target
array([1, 1, 1, ..., 2, 0, 1])
weather_cols = [f"WEATHER{i}" for i in range(1, 9)]
features = df[["SUMMARY_EN"] + weather_cols]

X_main, X_test, y_main, y_test = \
    train_test_split(features, target, test_size=0.2, random_state=1)

# As 0.25 x 0.8 = 0.2
X_train, X_val, y_train, y_val = \
    train_test_split(X_main, y_main, test_size=0.25, random_state=1)

X_train.shape, X_val.shape, X_test.shape
((4169, 9), (1390, 9), (1390, 9))
print([np.mean(y_train == y) for y in [0, 1, 2]])
[0.25833533221396016, 0.6032621731830176, 0.1384024946030223]

Text Vectorisation

Lecture Outline

  • Natural Language Processing

  • Car Crash Police Reports

  • Text Vectorisation

  • Bag Of Words

  • Limiting The Vocabulary

  • Intelligently Limit The Vocabulary

  • Word Embeddings

  • Word Embeddings II

Grab the start of a few summaries

first_summaries = X_train["SUMMARY_EN"].iloc[:3]
first_summaries
2532    This crash occurred in the early afternoon of ...
6209    This two-vehicle crash occurred in a four-legg...
2561    The crash occurred in the eastbound direction ...
Name: SUMMARY_EN, dtype: object
first_words = first_summaries.map(lambda txt: txt.split(" ")[:7])
first_words
2532    [This, crash, occurred, in, the, early, aftern...
6209    [This, two-vehicle, crash, occurred, in, a, fo...
2561    [The, crash, occurred, in, the, eastbound, dir...
Name: SUMMARY_EN, dtype: object
start_of_summaries = first_words.map(lambda txt: " ".join(txt))
start_of_summaries
2532          This crash occurred in the early afternoon
6209    This two-vehicle crash occurred in a four-legged
2561       The crash occurred in the eastbound direction
Name: SUMMARY_EN, dtype: object

Count words in the first summaries

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
counts = vect.fit_transform(start_of_summaries)
vocab = vect.get_feature_names_out()
print(len(vocab), vocab)
13 ['afternoon' 'crash' 'direction' 'early' 'eastbound' 'four' 'in' 'legged'
 'occurred' 'the' 'this' 'two' 'vehicle']
counts
<3x13 sparse matrix of type '<class 'numpy.int64'>'
    with 21 stored elements in Compressed Sparse Row format>
counts.toarray()
array([[1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0]])

Encode new sentences to BoW

vect.transform([
    "first car hit second car in a crash",
    "ipad os 26 beta released",
])
<2x13 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>
vect.transform([
    "first car hit second car in a crash",
    "ipad os 18 beta released",
]).toarray()
array([[0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
print(vocab)
['afternoon' 'crash' 'direction' 'early' 'eastbound' 'four' 'in' 'legged'
 'occurred' 'the' 'this' 'two' 'vehicle']

Bag of n-grams

vect = CountVectorizer(ngram_range=(1, 2))
counts = vect.fit_transform(start_of_summaries)
vocab = vect.get_feature_names_out()
print(len(vocab), vocab)
27 ['afternoon' 'crash' 'crash occurred' 'direction' 'early'
 'early afternoon' 'eastbound' 'eastbound direction' 'four' 'four legged'
 'in' 'in four' 'in the' 'legged' 'occurred' 'occurred in' 'the'
 'the crash' 'the early' 'the eastbound' 'this' 'this crash' 'this two'
 'two' 'two vehicle' 'vehicle' 'vehicle crash']
counts.toarray()
array([[1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,
        0, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
        1, 1, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 2, 1, 0, 1, 0, 0,
        0, 0, 0, 0, 0]])

See: Google Books Ngram Viewer

TF-IDF

Stands for term frequency-inverse document frequency.

Infographic explaining TF-IDF

Bag Of Words

Lecture Outline

  • Natural Language Processing

  • Car Crash Police Reports

  • Text Vectorisation

  • Bag Of Words

  • Limiting The Vocabulary

  • Intelligently Limit The Vocabulary

  • Word Embeddings

  • Word Embeddings II

Count words in all the summaries

vect = CountVectorizer()
vect.fit(X_train["SUMMARY_EN"])
vocab = list(vect.get_feature_names_out())
len(vocab)
18866
vocab[:5], vocab[len(vocab)//2:(len(vocab)//2 + 5)], vocab[-5:]
(['00', '000', '000lbs', '003', '005'],
 ['swinger', 'swinging', 'swipe', 'swiped', 'swiping'],
 ['zorcor', 'zotril', 'zx2', 'zx5', 'zyrtec'])

Create the X matrices

def vectorise_dataset(X, vect, txt_col="SUMMARY_EN", dataframe=False):
    X_vects = vect.transform(X[txt_col]).toarray()
    X_other = X.drop(txt_col, axis=1)

    if not dataframe:
        return np.concatenate([X_vects, X_other], axis=1)                           
    else:
        # Add column names and indices to the combined dataframe.
        vocab = list(vect.get_feature_names_out())
        X_vects_df = pd.DataFrame(X_vects, columns=vocab, index=X.index)
        return pd.concat([X_vects_df, X_other], axis=1)
X_train_bow = vectorise_dataset(X_train, vect)
X_val_bow = vectorise_dataset(X_val, vect)
X_test_bow = vectorise_dataset(X_test, vect)

Check the input matrix

vectorise_dataset(X_train, vect, dataframe=True)
00 000 000lbs 003 005 007 00am 00pm 00tydo2 01 ... zx5 zyrtec WEATHER1 WEATHER2 WEATHER3 WEATHER4 WEATHER5 WEATHER6 WEATHER7 WEATHER8
2532 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6209 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2561 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6882 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
206 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6356 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

4169 rows × 18874 columns

Make a simple dense model

num_features = X_train_bow.shape[1]
num_cats = 3 # 1, 2, 3+ vehicles

def build_model(num_features, num_cats):
    random.seed(42)
    
    model = Sequential([
        Input((num_features,)),
        Dense(100, activation="relu"),
        Dense(num_cats, activation="softmax")
    ])
    
    topk = SparseTopKCategoricalAccuracy(k=2, name="topk")
    model.compile("adam", "sparse_categorical_crossentropy",
        metrics=["accuracy", topk])
    
    return model

Inspect the model

model = build_model(num_features, num_cats)
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 100)            │     1,887,500 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 3)              │           303 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,887,803 (7.20 MB)
 Trainable params: 1,887,803 (7.20 MB)
 Non-trainable params: 0 (0.00 B)

Fit & evaluate the model

es = EarlyStopping(patience=1, restore_best_weights=True,
    monitor="val_accuracy", verbose=2)
%time hist = model.fit(X_train_bow, y_train, epochs=10, \
    callbacks=[es], validation_data=(X_val_bow, y_val), verbose=0);
Epoch 5: early stopping
Restoring model weights from the end of the best epoch: 4.
CPU times: user 12.6 s, sys: 5.88 s, total: 18.5 s
Wall time: 4.42 s
model.evaluate(X_train_bow, y_train, verbose=0)
[0.002541526686400175, 1.0, 1.0]
model.evaluate(X_val_bow, y_val, verbose=0)
[2.776606321334839, 0.9453237652778625, 0.9949640035629272]

Limiting The Vocabulary

Lecture Outline

  • Natural Language Processing

  • Car Crash Police Reports

  • Text Vectorisation

  • Bag Of Words

  • Limiting The Vocabulary

  • Intelligently Limit The Vocabulary

  • Word Embeddings

  • Word Embeddings II

The max_features value

vect = CountVectorizer(max_features=10)
vect.fit(X_train["SUMMARY_EN"])
vocab = vect.get_feature_names_out()
len(vocab)
10
print(vocab)
['and' 'driver' 'for' 'in' 'lane' 'of' 'the' 'to' 'vehicle' 'was']

What is left?

for i in range(3):
    sentence = X_train["SUMMARY_EN"].iloc[i]
    for word in sentence.split(" ")[:10]:
        word_or_qn = word if word in vocab else "?"
        print(word_or_qn, end=" ")
    print() # Same as print("\n", end="")
? ? ? in the ? ? of ? ? 
? ? ? ? in ? ? ? ? ? 
? ? ? in the ? ? of ? ? 
for i in range(3):
    sentence = X_train["SUMMARY_EN"].iloc[i]
    num_words = 0
    for word in sentence.split(" "):
        if word in vocab:
            print(word, end=" ")
            num_words += 1
        if num_words == 10:
            break
    print()
in the of in the of of was and was 
in and of in and for the of the and 
in the of to was was of was was and 

Remove stop words

vect = CountVectorizer(max_features=10, stop_words="english")
vect.fit(X_train["SUMMARY_EN"])
vocab = vect.get_feature_names_out()
len(vocab)
10
print(vocab)
['coded' 'crash' 'critical' 'driver' 'event' 'intersection' 'lane' 'left'
 'roadway' 'vehicle']
for i in range(3):
    sentence = X_train["SUMMARY_EN"].iloc[i]
    num_words = 0
    for word in sentence.split(" "):
        if word in vocab:
            print(word, end=" ")
            num_words += 1
        if num_words == 10:
            break
    print()
crash intersection roadway roadway roadway intersection lane lane intersection driver 
crash roadway left roadway roadway roadway lane lane roadway crash 
crash vehicle left left vehicle driver vehicle lane lane coded 

Keep 1,000 most frequent words

vect = CountVectorizer(max_features=1_000, stop_words="english")
vect.fit(X_train["SUMMARY_EN"])
vocab = vect.get_feature_names_out()
len(vocab)
1000
print(vocab[:5], vocab[len(vocab)//2:(len(vocab)//2 + 5)], vocab[-5:])
['10' '105' '113' '12' '15'] ['interruption' 'intersected' 'intersecting' 'intersection' 'interstate'] ['year' 'years' 'yellow' 'yield' 'zone']

Create the X matrices:

X_train_bow = vectorise_dataset(X_train, vect)
X_val_bow = vectorise_dataset(X_val, vect)
X_test_bow = vectorise_dataset(X_test, vect)

What is left?

for i in range(8):
    sentence = X_train["SUMMARY_EN"].iloc[i]
    num_words = 0
    for word in sentence.split(" "):
        if word in vocab:
            print(word, end=" ")
            num_words += 1
        if num_words == 10:
            break
    print()
crash occurred early afternoon weekday middle suburban intersection consisted lanes 
crash occurred roadway level consists lanes direction center left turn 
crash occurred eastbound direction entrance ramp right curved road uphill 
crash occurred straight roadway consists lanes direction center left turn 
collision occurred evening hours crash occurred level bituminous roadway residential 
vehicle crash occurred daylight location lane undivided left curved downhill 
vehicle crash occurred early morning daylight hours roadway traffic roadway 
crash occurred northbound lanes northbound southbound slightly street curved posted 

Note

We hope to see SMS-like language, with limited vocabulary but still able to understand it.

Check the input matrix

vectorise_dataset(X_train, vect, dataframe=True)
10 105 113 12 15 150 16 17 18 180 ... yield zone WEATHER1 WEATHER2 WEATHER3 WEATHER4 WEATHER5 WEATHER6 WEATHER7 WEATHER8
2532 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6209 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2561 1 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6882 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
206 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6356 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

4169 rows × 1008 columns

Make & inspect the model

num_features = X_train_bow.shape[1]
model = build_model(num_features, num_cats)
model.summary()
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_2 (Dense)                 │ (None, 100)            │       100,900 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 3)              │           303 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 101,203 (395.32 KB)
 Trainable params: 101,203 (395.32 KB)
 Non-trainable params: 0 (0.00 B)

Fit & evaluate the model

es = EarlyStopping(patience=1, restore_best_weights=True,
    monitor="val_accuracy", verbose=2)
%time hist = model.fit(X_train_bow, y_train, epochs=10, \
    callbacks=[es], validation_data=(X_val_bow, y_val), verbose=0);
Epoch 3: early stopping
Restoring model weights from the end of the best epoch: 2.
CPU times: user 1.19 s, sys: 480 ms, total: 1.67 s
Wall time: 1.16 s
model.evaluate(X_train_bow, y_train, verbose=0)
[0.1021684780716896, 0.9815303683280945, 0.9990405440330505]
model.evaluate(X_val_bow, y_val, verbose=0)
[2.4335880279541016, 0.9381294846534729, 0.9942445755004883]

Intelligently Limit The Vocabulary

Lecture Outline

  • Natural Language Processing

  • Car Crash Police Reports

  • Text Vectorisation

  • Bag Of Words

  • Limiting The Vocabulary

  • Intelligently Limit The Vocabulary

  • Word Embeddings

  • Word Embeddings II

Keep 1,000 most frequent words

vect = CountVectorizer(max_features=1_000, stop_words="english")
vect.fit(X_train["SUMMARY_EN"])
vocab = vect.get_feature_names_out()
len(vocab)
1000
print(vocab[:5], vocab[len(vocab)//2:(len(vocab)//2 + 5)], vocab[-5:])
['10' '105' '113' '12' '15'] ['interruption' 'intersected' 'intersecting' 'intersection' 'interstate'] ['year' 'years' 'yellow' 'yield' 'zone']

Install spacy

!pip install spacy
!python -m spacy download en_core_web_sm
import spacy

nlp = spacy.load("en_core_web_trf")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_, token.lemma_)
Apple PROPN nsubj Apple
is AUX aux be
looking VERB ROOT look
at ADP prep at
buying VERB pcomp buy
U.K. PROPN compound U.K.
startup NOUN dobj startup
for ADP prep for
$ SYM quantmod $
1 NUM compound 1
billion NUM pobj billion

Dependency visualiser

Code
# I needed to monkey-patch this to get displacy to work..
import IPython
import IPython.display
IPython.core.display.display = IPython.display.display
from spacy import displacy
doc = nlp(df["SUMMARY_EN"].iloc[1])
displacy.render(doc, style="dep")
The DET crash NOUN occurred VERB in ADP the DET eastbound ADJ lane NOUN of ADP a DET two- NUM lane, NOUN two- NUM way NOUN asphalt NOUN roadway NOUN on ADP level ADJ grade. NOUN SPACE The DET conditions NOUN were AUX daylight NOUN and CCONJ wet ADJ with ADP cloudy ADJ skies NOUN in ADP the DET early ADJ afternoon NOUN on ADP a DET weekday. NOUN SPACE V342542243, PROPN a DET 1995 NUM Chevrolet PROPN Lumina PROPN was AUX traveling VERB eastbound. ADV SPACE V342542269, PROPN a DET 2004 NUM Chevrolet PROPN Trailblazer PROPN was AUX also ADV traveling VERB eastbound ADV on ADP the DET same ADJ roadway. NOUN SPACE V342542269, PROPN was AUX attempting VERB to PART make VERB a DET left- ADJ hand NOUN turn NOUN into ADP a DET private ADJ drive NOUN on ADP the DET North ADJ side NOUN of ADP the DET roadway. NOUN SPACE While SCONJ turning VERB V342542243 NUM attempted VERB to PART pass VERB V342542269 NUM on ADP the DET left- ADJ hand NOUN side NOUN contacting VERB it PRON 's PART front NOUN to ADP the DET left ADJ side NOUN of ADP V342542269. PROPN SPACE Both DET vehicles NOUN came VERB to ADP final ADJ rest NOUN on ADP the DET roadway NOUN at ADP impact. NOUN SPACE The DET driver NOUN of ADP V342542243 PROPN fled VERB the DET scene NOUN and CCONJ was AUX not PART identified, VERB so CCONJ no DET further ADJ information NOUN could AUX be AUX obtained VERB from ADP him. PRON SPACE The DET Driver NOUN of ADP V342542269 PROPN stated VERB that SCONJ the DET driver NOUN was AUX a DET male NOUN and CCONJ had AUX hit VERB his PRON head NOUN and CCONJ was AUX bleeding. VERB SPACE She PRON did AUX not PART pursue VERB the DET driver NOUN because SCONJ she PRON thought VERB she PRON saw VERB a DET gun. NOUN The DET officer NOUN said VERB that SCONJ the DET car NOUN had AUX been AUX reported VERB stolen. VERB SPACE The DET Critical ADJ Precrash NOUN Event NOUN for ADP the DET driver NOUN of ADP V342542243 PROPN was AUX this DET vehicle NOUN traveling VERB over ADP left ADJ lane NOUN line NOUN on ADP the DET left ADJ side NOUN of ADP travel. NOUN SPACE The DET Critical ADJ Reason NOUN for ADP the DET Critical ADJ Event NOUN was AUX coded VERB as ADP unknown ADJ reason NOUN for ADP the DET critical ADJ event NOUN because SCONJ the DET driver NOUN was AUX not PART available. ADJ SPACE The DET driver NOUN of ADP V342542269 PROPN was AUX a DET 41- NUM year NOUN old ADJ female NOUN who PRON had AUX reported VERB that SCONJ she PRON had AUX stopped VERB prior ADV to ADP turning VERB to PART make VERB sure ADJ she PRON was AUX at ADP the DET right ADJ house. NOUN SPACE She PRON was AUX going VERB to PART show VERB a DET house NOUN for ADP a DET client. NOUN SPACE She PRON had VERB no DET health NOUN related VERB problems. NOUN SPACE She PRON had AUX taken VERB amoxicillin. PROPN SPACE She PRON does AUX not PART wear VERB corrective ADJ lenses NOUN and CCONJ felt VERB rested. ADJ SPACE She PRON was AUX not PART injured VERB in ADP the DET crash. NOUN SPACE The DET Critical PROPN Precrash PROPN Event PROPN for ADP the DET driver NOUN of ADP V342542269 PROPN was AUX other ADJ vehicle NOUN encroachment NOUN from ADP adjacent ADJ lane NOUN over ADP left ADJ lane NOUN line. NOUN SPACE The DET Critical ADJ Reason NOUN for ADP the DET Critical PROPN Event NOUN was AUX not PART coded VERB for ADP this DET vehicle NOUN and CCONJ the DET driver NOUN of ADP V342542269 PROPN was AUX not PART thought VERB to PART have AUX contributed VERB to ADP the DET crash. NOUN det nsubj prep det amod pobj prep det nummod nmod nummod compound compound pobj prep amod pobj dep det nsubj attr cc conj prep amod pobj prep det amod pobj prep det pobj dep nsubj det nummod compound appos aux advmod dep nsubj det nummod compound appos aux advmod advmod prep det amod pobj dep nsubj aux aux xcomp det amod compound dobj prep det amod pobj prep det compound pobj prep det pobj dep mark advcl dobj aux xcomp dobj prep det amod compound pobj advcl poss case dobj prep det amod pobj prep pobj dep det nsubj prep amod pobj prep det pobj prep pobj det nsubj prep pobj det dobj cc auxpass neg conj cc det amod nsubjpass aux auxpass conj prep pobj dep det nsubj prep pobj mark det nsubj ccomp det attr cc aux conj poss dobj cc aux conj nsubj aux neg det dobj mark nsubj advcl nsubj ccomp det dobj det nsubj mark det nsubjpass aux auxpass ccomp xcomp dep det amod compound nsubj prep det pobj prep pobj det attr acl prep amod compound pobj prep det amod pobj prep pobj dep det amod nsubjpass prep det amod pobj auxpass prep amod pobj prep det amod pobj mark det nsubj advcl neg acomp dep det nsubj prep pobj det nummod npadvmod amod attr nsubj aux relcl mark nsubj aux ccomp advmod prep pcomp aux advcl ccomp nsubj ccomp prep det amod pobj dep nsubj aux aux xcomp det dobj prep det pobj dep nsubj det npadvmod amod dobj dep nsubj aux dobj dep nsubj aux neg amod dobj cc conj acomp dep nsubjpass auxpass neg prep det pobj dep det compound compound nsubj prep det pobj prep pobj amod compound attr prep amod pobj prep amod compound pobj det compound nsubjpass prep det compound pobj auxpass neg prep det pobj cc det nsubjpass prep pobj auxpass neg conj aux aux xcomp prep det pobj

Entity recognition

doc = nlp(df["SUMMARY_EN"].iloc[1])
displacy.render(doc, style="ent")
The crash occurred in the eastbound lane of a two CARDINAL -lane, two CARDINAL -way asphalt roadway on level grade. The conditions were daylight and wet with cloudy skies in the early afternoon TIME on a weekday DATE .

V342542243 PRODUCT , a 1995 DATE Chevrolet ORG Lumina PRODUCT was traveling eastbound. V342542269 PRODUCT , a 2004 DATE Chevrolet ORG Trailblazer PRODUCT was also traveling eastbound on the same roadway. V342542269 PRODUCT , was attempting to make a left-hand turn into a private drive on the North side of the roadway. While turning V342542243 PRODUCT attempted to pass V342542269 PRODUCT on the left-hand side contacting it's front to the left side of V342542269 PRODUCT . Both vehicles came to final rest on the roadway at impact.

The driver of V342542243 PRODUCT fled the scene and was not identified, so no further information could be obtained from him. The Driver of V342542269 PRODUCT stated that the driver was a male and had hit his head and was bleeding. She did not pursue the driver because she thought she saw a gun. The officer said that the car had been reported stolen.

The Critical Precrash Event for the driver of V342542243 PRODUCT was this vehicle traveling over left lane line on the left side of travel. The Critical Reason for the Critical Event was coded as unknown reason for the critical event because the driver was not available.

The driver of V342542269 PRODUCT was a 41-year old DATE female who had reported that she had stopped prior to turning to make sure she was at the right house. She was going to show a house for a client. She had no health related problems. She had taken amoxicillin. She does not wear corrective lenses and felt rested. She was not injured in the crash.

The Critical Precrash Event for the driver of V342542269 PRODUCT was other vehicle encroachment from adjacent lane over left lane line. The Critical Reason for the Critical Event was not coded for this vehicle and the driver of V342542269 PRODUCT was not thought to have contributed to the crash.

Stemming

“Stemming refers to the process of removing suffixes and reducing a word to some base form such that all different variants of that word can be represented by the same form (e.g., “car” and “cars” are both reduced to “car”). This is accomplished by applying a fixed set of rules (e.g., if the word ends in “-es,” remove “-es”). More such examples are shown in Figure 2-7. Although such rules may not always end up in a linguistically correct base form, stemming is commonly used in search engines to match user queries to relevant documents and in text classification to reduce the feature space to train machine learning models.”

Lemmatization

“Lemmatization is the process of mapping all the different forms of a word to its base word, or lemma. While this seems close to the definition of stemming, they are, in fact, different. For example, the adjective “better,” when stemmed, remains the same. However, upon lemmatization, this should become “good,” as shown in Figure 2-7. Lemmatization requires more linguistic knowledge, and modeling and developing efficient lemmatizers remains an open problem in NLP research even now.”

Stemming and lemmatizing

Examples of stemming and lemmatization

Original: “The striped bats are hanging on their feet for best”

Stemmed: “the stripe bat are hang on their feet for best”

Lemmatized: “the stripe bat be hang on their foot for good”

Examples

Stemmed

organization -> organ

civilization -> civil

information -> inform

consultant -> consult

Lemmatized

Here’s looking at you, kid. -> here be look at you , kid .

Lemmatize the text

def lemmatize(txt):
    doc = nlp(txt)
    good_tokens = [token.lemma_.lower() for token in doc \
        if not token.like_num and \
           not token.is_punct and \
           not token.is_space and \
           not token.is_currency and \
           not token.is_stop]
    return " ".join(good_tokens)
test_str = "Incident at 100kph and '10 incidents -13.3%' are incidental?\t $5"
lemmatize(test_str)
'incident 100kph incident incidental'
test_str = "I interviewed 5-years ago, 150 interviews every year at 10:30 are.."
lemmatize(test_str)
'interview year ago interview year 10:30'

Apply to the whole dataset

df["SUMMARY_EN_LEMMA"] = df["SUMMARY_EN"].map(lemmatize)
weather_cols = [f"WEATHER{i}" for i in range(1, 9)]
features = df[["SUMMARY_EN_LEMMA"] + weather_cols]

X_main, X_test, y_main, y_test = \
    train_test_split(features, target, test_size=0.2, random_state=1)

# As 0.25 x 0.8 = 0.2
X_train, X_val, y_train, y_val = \
    train_test_split(X_main, y_main, test_size=0.25, random_state=1)

X_train.shape, X_val.shape, X_test.shape
((4169, 9), (1390, 9), (1390, 9))

What is left?

print("Original:", df["SUMMARY_EN"].iloc[0][:250])
Original: V6357885318682, a 2000 Pontiac Montana minivan, made a left turn from a private driveway onto a northbound 5-lane two-way, dry asphalt roadway on a downhill grade.  The posted speed limit on this roadway was 80 kmph (50 MPH). V6357885318682 entered t
print("Lemmatized:", df["SUMMARY_EN_LEMMA"].iloc[0][:250])
Lemmatized: v6357885318682 pontiac montana minivan left turn private driveway northbound lane way dry asphalt roadway downhill grade post speed limit roadway kmph mph v6357885318682 enter roadway cross southbound lane enter northbound lane left turn lane way int
print("Original:", df["SUMMARY_EN"].iloc[1][:250])
Original: The crash occurred in the eastbound lane of a two-lane, two-way asphalt roadway on level grade.  The conditions were daylight and wet with cloudy skies in the early afternoon on a weekday.  
 
 V342542243, a 1995 Chevrolet Lumina was traveling eastbou
print("Lemmatized:", df["SUMMARY_EN_LEMMA"].iloc[1][:250])
Lemmatized: crash occur eastbound lane lane way asphalt roadway level grade condition daylight wet cloudy sky early afternoon weekday v342542243 chevrolet lumina travel eastbound v342542269 chevrolet trailblazer travel eastbound roadway v342542269 attempt left h

Keep 1,000 most frequent lemmas

vect = CountVectorizer(max_features=1_000, stop_words="english")
vect.fit(X_train["SUMMARY_EN_LEMMA"])
vocab = vect.get_feature_names_out()
len(vocab)
1000
print(vocab[:5], vocab[len(vocab)//2:(len(vocab)//2 + 5)], vocab[-5:])
['10' '150' '48kmph' '4x4' '56kmph'] ['let' 'level' 'lexus' 'license' 'light'] ['yaw' 'year' 'yellow' 'yield' 'zone']

Create the X matrices:

X_train_bow = vectorise_dataset(X_train, vect, "SUMMARY_EN_LEMMA")
X_val_bow = vectorise_dataset(X_val, vect, "SUMMARY_EN_LEMMA")
X_test_bow = vectorise_dataset(X_test, vect, "SUMMARY_EN_LEMMA")

Check the input matrix

vectorise_dataset(X_train, vect, "SUMMARY_EN_LEMMA", dataframe=True)
10 150 48kmph 4x4 56kmph 64kmph 72kmph ability able accelerate ... yield zone WEATHER1 WEATHER2 WEATHER3 WEATHER4 WEATHER5 WEATHER6 WEATHER7 WEATHER8
2532 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6209 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2561 0 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6882 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
206 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6356 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

4169 rows × 1008 columns

Make & inspect the model

num_features = X_train_bow.shape[1]
model = build_model(num_features, num_cats)
model.summary()
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_4 (Dense)                 │ (None, 100)            │       100,900 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 3)              │           303 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 101,203 (395.32 KB)
 Trainable params: 101,203 (395.32 KB)
 Non-trainable params: 0 (0.00 B)

Fit & evaluate the model

es = EarlyStopping(patience=1, restore_best_weights=True,
    monitor="val_accuracy", verbose=2)
%time hist = model.fit(X_train_bow, y_train, epochs=10, \
    callbacks=[es], validation_data=(X_val_bow, y_val), verbose=0);
Epoch 3: early stopping
Restoring model weights from the end of the best epoch: 2.
CPU times: user 1.17 s, sys: 386 ms, total: 1.56 s
Wall time: 1.02 s
model.evaluate(X_train_bow, y_train, verbose=0)
[0.09232959151268005, 0.982249915599823, 0.9992803931236267]
model.evaluate(X_val_bow, y_val, verbose=0)
[3.648836851119995, 0.9395683407783508, 0.9935252070426941]

Word Embeddings

Lecture Outline

  • Natural Language Processing

  • Car Crash Police Reports

  • Text Vectorisation

  • Bag Of Words

  • Limiting The Vocabulary

  • Intelligently Limit The Vocabulary

  • Word Embeddings

  • Word Embeddings II

Overview

Popular methods for converting text into numbers include:

  • One-hot encoding
  • Bag of words
  • TF-IDF
  • Word vectors (transfer learning)

Assigning Numbers

Word Vectors

  • One-hot representations capture word ‘existence’ only, whereas word vectors capture information about word meaning as well as location.
  • This enables deep learning NLP models to automatically learn linguistic features.
  • Word2Vec & GloVe are popular algorithms for generating word embeddings (i.e. word vectors).

Word Vectors II

Illustrative word vectors.

Remember this diagram?

Embeddings will gradually improve during training.

Word2Vec

Key idea: You’re known by the company you keep.

Two algorithms are used to calculate embeddings:

  • Continuous bag of words: uses the context words to predict the target word
  • Skip-gram: uses the target word to predict the context words

Predictions are made using a neural network with one hidden layer. Through backpropagation, we update a set of “weights” which become the word vectors.

Word2Vec training methods

Continuous bag of words is a centre word prediction task

Skip-gram is a neighbour word prediction task

Suggested viewing

Computerphile (2019), Vectoring Words (Word Embeddings), YouTube (16 mins).

The skip-gram network

The skip-gram model. Both the input vector \boldsymbol{x} and the output \boldsymbol{y} are one-hot encoded word representations. The hidden layer is the word embedding of size N.

Word Vector Arithmetic

Relationships between words becomes vector math.

You remember vectors, right?

Illustrative word vector arithmetic

Screenshot from Word2viz

Word Embeddings II

Lecture Outline

  • Natural Language Processing

  • Car Crash Police Reports

  • Text Vectorisation

  • Bag Of Words

  • Limiting The Vocabulary

  • Intelligently Limit The Vocabulary

  • Word Embeddings

  • Word Embeddings II

Pretrained word embeddings

!pip install gensim

Load word2vec embeddings trained on Google News:

import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

When run for the first time, that downloads a huge file:

gensim_dir = Path("~/gensim-data/").expanduser()
[str(p) for p in gensim_dir.iterdir()]
['/Users/z3535837/gensim-data/word2vec-google-news-300',
 '/Users/z3535837/gensim-data/information.json']
next(gensim_dir.glob("*/*.gz")).stat().st_size / 1024**3
1.6238203644752502
f"The size of the vocabulary is {len(wv)}"
'The size of the vocabulary is 3000000'

Treat wv like a dictionary

wv["pizza"]
array([-1.26e-01,  2.54e-02,  1.67e-01,  5.51e-01, -7.67e-02,  1.29e-01,
        1.03e-01, -3.95e-04,  1.22e-01,  4.32e-02,  1.73e-01, -6.84e-02,
        3.42e-01,  8.40e-02,  6.69e-02,  2.68e-01, -3.71e-02, -5.57e-02,
        1.81e-01,  1.90e-02, -5.08e-02,  9.03e-03,  1.77e-01,  6.49e-02,
       -6.25e-02, -9.42e-02, -9.72e-02,  4.00e-01,  1.15e-01,  1.03e-01,
       -1.87e-02, -2.70e-01,  1.81e-01,  1.25e-01, -3.17e-02, -5.49e-02,
        3.46e-01, -1.57e-02,  1.82e-05,  2.07e-01, -1.26e-01, -2.83e-01,
        2.00e-01,  8.35e-02, -4.74e-02, -3.11e-02, -2.62e-01,  1.70e-01,
       -2.03e-02,  1.53e-01, -1.21e-01,  3.75e-01, -5.69e-02, -4.76e-03,
       -1.95e-01, -2.03e-01,  3.01e-01, -1.01e-01, -3.18e-01, -9.03e-02,
       -1.19e-01,  1.95e-01, -8.79e-02,  1.58e-01,  1.52e-02, -1.60e-01,
       -3.30e-01, -4.67e-01,  1.69e-01,  2.23e-02,  1.55e-01,  1.08e-01,
       -3.56e-02,  9.13e-02, -8.69e-02, -1.20e-01, -3.09e-01, -2.61e-02,
       -7.23e-02, -4.80e-01,  3.78e-02, -1.36e-01, -1.03e-01, -2.91e-01,
       -1.93e-01, -4.22e-01, -1.06e-01,  3.55e-01,  1.67e-01, -3.63e-03,
       -7.42e-02, -3.22e-01, -7.52e-02, -8.25e-02, -2.91e-01, -1.26e-01,
        1.68e-02,  5.00e-02,  1.28e-01, -7.42e-02, -1.31e-01, -2.46e-01,
        6.49e-02,  1.53e-01,  2.60e-01, -1.05e-01,  3.57e-01, -4.30e-02,
       -1.58e-01,  8.20e-02, -5.98e-02, -2.34e-01, -3.22e-01, -1.26e-01,
        5.40e-02, -1.88e-01,  1.36e-01, -6.59e-02,  8.36e-03, -1.85e-01,
       -2.97e-01, -1.85e-01, -4.74e-02, -1.06e-01, -6.93e-02,  3.83e-02,
       -3.20e-02,  3.64e-02, -1.20e-01,  1.77e-01, -1.16e-01,  1.99e-02,
        8.64e-02,  6.08e-02, -1.41e-01,  3.30e-01,  1.94e-01, -1.56e-01,
        3.93e-01,  1.81e-03,  7.28e-02, -2.54e-01, -3.54e-02,  2.87e-03,
       -1.73e-01,  9.77e-03, -1.56e-02,  3.23e-03, -1.70e-01,  1.55e-01,
        7.18e-02,  4.10e-01, -2.11e-01,  1.32e-01,  7.63e-03,  4.79e-02,
       -4.54e-02,  7.32e-02, -4.06e-01, -2.06e-02, -4.04e-01, -1.01e-01,
       -2.03e-01,  1.55e-01, -1.89e-01,  6.59e-02,  6.54e-02, -2.05e-01,
        5.47e-02, -3.06e-02, -1.54e-01, -2.62e-01,  3.81e-03, -8.20e-02,
       -3.20e-01,  2.84e-02,  2.70e-01,  1.74e-01, -1.67e-01,  2.23e-01,
        6.35e-02, -1.96e-01,  1.46e-01, -1.56e-02,  2.60e-02, -6.30e-02,
        2.94e-02,  3.28e-01, -4.69e-02, -1.52e-01,  6.98e-02,  3.18e-01,
       -1.08e-01,  3.66e-02, -1.99e-01,  1.64e-03,  6.41e-03, -1.47e-01,
       -6.25e-02, -4.36e-03, -2.75e-01,  8.54e-02, -5.00e-02, -3.12e-01,
       -1.34e-01, -1.99e-01,  5.18e-02, -9.28e-02, -2.40e-01, -7.86e-02,
       -1.54e-01, -6.64e-02, -1.97e-01,  1.77e-01, -1.57e-01, -1.63e-01,
        6.01e-02, -5.86e-02, -2.23e-01, -6.59e-02, -9.38e-02, -4.14e-01,
        2.56e-01, -1.77e-01,  2.52e-01,  1.48e-01, -1.04e-01, -8.61e-03,
       -1.23e-01, -9.23e-02,  4.42e-02, -1.71e-01, -1.98e-01,  1.92e-01,
        2.85e-01, -4.35e-02,  1.08e-01, -5.37e-02, -2.10e-02,  1.46e-01,
        3.83e-01,  2.32e-02, -8.84e-02,  7.32e-02, -1.01e-01, -1.06e-01,
        4.12e-01,  2.11e-01,  2.79e-01, -2.09e-02,  2.07e-01,  9.81e-02,
        2.39e-01,  7.67e-02,  2.02e-01, -6.08e-02, -2.64e-03, -1.84e-01,
       -1.57e-02, -3.20e-01,  9.03e-02,  1.02e-01, -4.96e-01, -9.72e-02,
       -8.11e-02, -1.81e-01, -1.46e-01,  8.64e-02, -2.04e-01, -2.02e-01,
       -5.47e-02,  2.54e-01,  2.09e-02, -1.16e-01,  2.02e-01, -8.06e-02,
       -1.05e-01, -7.96e-02,  1.97e-02, -2.49e-01,  1.31e-01,  2.89e-01,
       -2.26e-01,  4.55e-01, -2.73e-01, -2.58e-01, -3.15e-02,  4.04e-01,
       -2.68e-01,  2.89e-01, -1.84e-01, -1.48e-01, -1.07e-01,  1.28e-01,
        5.47e-01, -8.69e-02, -1.48e-02,  6.98e-02, -8.50e-02, -1.55e-01],
      dtype=float32)
len(wv["pizza"])
300

Find nearby word vectors

wv.most_similar("Python")
[('Jython', 0.6152505874633789),
 ('Perl_Python', 0.5710949897766113),
 ('IronPython', 0.5704679489135742),
 ('scripting_languages', 0.5695091485977173),
 ('PHP_Perl', 0.5687724947929382),
 ('Java_Python', 0.5681070685386658),
 ('PHP', 0.5660915374755859),
 ('Python_Ruby', 0.5632461905479431),
 ('Visual_Basic', 0.5603479743003845),
 ('Perl', 0.5530891418457031)]
wv.similarity("Python", "Java")
0.46189713
wv.similarity("Python", "sport")
0.08406469
wv.similarity("Python", "R")
0.06695429

What does ‘similarity’ mean?

The ‘similarity’ scores

wv.similarity("Sydney", "Melbourne")
0.8613987

are normally based on cosine distance.

x = wv["Sydney"]
y = wv["Melbourne"]
x.dot(y) / (np.linalg.norm(x) * np.linalg.norm(y))
0.8613986
wv.similarity("Sydney", "Aarhus")
0.19079602

Weng’s GoT Word2Vec

In the GoT word embedding space, the top similar words to “king” and “queen” are:

model.most_similar("king")
('kings', 0.897245) 
('baratheon', 0.809675) 
('son', 0.763614)
('robert', 0.708522)
('lords', 0.698684)
('joffrey', 0.696455)
('prince', 0.695699)
('brother', 0.685239)
('aerys', 0.684527)
('stannis', 0.682932)
model.most_similar("queen")
('cersei', 0.942618)
('joffrey', 0.933756)
('margaery', 0.931099)
('sister', 0.928902)
('prince', 0.927364)
('uncle', 0.922507)
('varys', 0.918421)
('ned', 0.917492)
('melisandre', 0.915403)
('robb', 0.915272)

Combining word vectors

You can summarise a sentence by averaging the individual word vectors.

sv = (wv["Melbourne"] + wv["has"] + wv["better"] + wv["coffee"]) / 4
len(sv), sv[:5]
(300, array([-0.08, -0.11, -0.16,  0.24,  0.06], dtype=float32))

As it turns out, averaging word embeddings is a surprisingly effective way to create word embeddings. It’s not perfect (as you’ll see), but it does a strong job of capturing what you might perceive to be complex relationships between words.

Recipe recommender

Recipes are the average of the word vectors of the ingredients.

Nearest neighbours used to classify new recipes as potentially delicious.

Analogies with word vectors

Obama is to America as ___ is to Australia.

\text{Obama} - \text{America} + \text{Australia} = ?

wv.most_similar(positive=["Obama", "Australia"], negative=["America"])
[('Mr_Rudd', 0.615142285823822),
 ('Prime_Minister_Julia_Gillard', 0.6045385003089905),
 ('Prime_Minister_Kevin_Rudd', 0.5982581973075867),
 ('Kevin_Rudd', 0.5627648830413818),
 ('Ms_Gillard', 0.5517690181732178),
 ('Opposition_Leader_Kevin_Rudd', 0.5298037528991699),
 ('Mr_Beazley', 0.5259249806404114),
 ('Gillard', 0.5250653028488159),
 ('NARDA_GILMORE', 0.5203536748886108),
 ('Mr_Downer', 0.5150347948074341)]

Testing more associations

wv.most_similar(positive=["France", "London"], negative=["Paris"])
[('Britain', 0.7368934750556946),
 ('UK', 0.6637030839920044),
 ('England', 0.6119861602783203),
 ('United_Kingdom', 0.6067784428596497),
 ('Great_Britain', 0.5870823860168457),
 ('Britian', 0.5852951407432556),
 ('Scotland', 0.5410018563270569),
 ('British', 0.5318331718444824),
 ('Europe', 0.5307437181472778),
 ('East_Midlands', 0.5230222344398499)]

Quickly get to bad associations

wv.most_similar(positive=["King", "woman"], negative=["man"])
[('Queen', 0.5515626072883606),
 ('Oprah_BFF_Gayle', 0.47597548365592957),
 ('Geoffrey_Rush_Exit', 0.46460166573524475),
 ('Princess', 0.4533674418926239),
 ('Yvonne_Stickney', 0.4507041573524475),
 ('L._Bonauto', 0.4422135353088379),
 ('gal_pal_Gayle', 0.4408389925956726),
 ('Alveda_C.', 0.440279096364975),
 ('Tupou_V.', 0.4373863935470581),
 ('K._Letourneau', 0.435103178024292)]
wv.most_similar(positive=["computer_programmer", "woman"], negative=["man"])
[('homemaker', 0.5627118945121765),
 ('housewife', 0.5105047225952148),
 ('graphic_designer', 0.5051802396774292),
 ('schoolteacher', 0.49794942140579224),
 ('businesswoman', 0.493489146232605),
 ('paralegal', 0.4925510883331299),
 ('registered_nurse', 0.4907974898815155),
 ('saleswoman', 0.48816272616386414),
 ('electrical_engineer', 0.4797726571559906),
 ('mechanical_engineer', 0.4755399525165558)]

Bias in NLP models

… there are serious questions to answer, like how are we going to teach AI using public data without incorporating the worst traits of humanity? If we create bots that mirror their users, do we care if their users are human trash? There are plenty of examples of technology embodying — either accidentally or on purpose — the prejudices of society, and Tay’s adventures on Twitter show that even big corporations like Microsoft forget to take any preventative measures against these problems.

The library cheats a little bit

wv.similar_by_vector(wv["computer_programmer"] - wv["man"] + wv["woman"])
[('computer_programmer', 0.910581111907959),
 ('homemaker', 0.5771315693855286),
 ('schoolteacher', 0.5500192046165466),
 ('graphic_designer', 0.5464698672294617),
 ('mechanical_engineer', 0.539836585521698),
 ('electrical_engineer', 0.5337055325508118),
 ('housewife', 0.5274525284767151),
 ('programmer', 0.5096209049224854),
 ('businesswoman', 0.5029540657997131),
 ('keypunch_operator', 0.4974639415740967)]

To get the ‘nice’ analogies, the .most_similar ignores the input words as possible answers.

# ignore (don't return) keys from the input
result = [
    (self.index_to_key[sim + clip_start], float(dists[sim]))
    for sim in best if (sim + clip_start) not in all_keys
]

Package Versions

from watermark import watermark
print(watermark(python=True, packages="keras,matplotlib,numpy,pandas,seaborn,scipy,torch,tensorflow,tf_keras"))
Python implementation: CPython
Python version       : 3.11.12
IPython version      : 9.3.0

keras     : 3.8.0
matplotlib: 3.10.0
numpy     : 1.26.4
pandas    : 2.2.2
seaborn   : 0.13.2
scipy     : 1.13.1
torch     : 2.6.0
tensorflow: 2.18.0
tf_keras  : 2.18.0

Glossary

  • bag of words
  • lemmatization
  • n-grams
  • one-hot embedding
  • TF-IDF
  • vocabulary
  • word embedding
  • word2vec