Assignment 1: Exploring Word Vectors

Starting from:

$30

CS224N Assignment 1: Exploring Word Vectors (25
Points)

Welcome to CS224N!
Before you start, make sure you read the README.txt in the same directory as this notebook for important
setup information. A lot of code is provided in this notebook, and we highly encourage you to read and
understand it as part of the learning :)
If you aren't super familiar with Python, Numpy, or Matplotlib, we recommend you check out the review
session on Friday. The session will be recorded and the material will be made available on our website
(http://web.stanford.edu/class/cs224n/index.html#schedule). The CS231N Python/Numpy tutorial
(https://cs231n.github.io/python-numpy-tutorial/) is also a great resource.
Assignment Notes: Please make sure to save the notebook as you go along. Submission Instructions are
located at the bottom of the notebook.
In [39]: # All Import Statements Defined Here
# Note: Do not add to this list.
# ----------------
import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
import nltk
nltk.download('reuters')
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
START_TOKEN = '<START>'
END_TOKEN = '<END>'
np.random.seed(0)
random.seed(0)
# ----------------
Word Vectors
Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question
answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths
and weaknesses. Here, you will explore two types of word vectors: those derived from co-occurrence
matrices, and those derived via GloVe.
Note on Terminology: The terms "word vectors" and "word embeddings" are often used interchangeably.
The term "embedding" refers to the fact that we are encoding aspects of a word's meaning in a lower
dimensional space. As Wikipedia (https://en.wikipedia.org/wiki/Word_embedding) states, "conceptually it
involves a mathematical embedding from a space with one dimension per word to a continuous vector space
with a much lower dimension".
[nltk_data] Downloading package reuters to /Users/chih-
[nltk_data] hsuankao/nltk_data...
[nltk_data] Package reuters is already up-to-date!
Part 1: Count-Based Word Vectors (10 points)
Most word vector models start from the following idea:
You shall know a word by the company it keeps (Firth, J. R. 1957:11
(https://en.wikipedia.org/wiki/John_Rupert_Firth))
Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be
used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset
of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With
this intuition in mind, many "old school" approaches to constructing word vectors relied on word counts.
Here we elaborate upon one of those strategies, co-occurrence matrices (for more information, see here
(http://web.stanford.edu/class/cs124/lec/vectorsemantics.video.pdf) or here (https://medium.com/datascience-group-iitr/word-embedding-2d05d270b285)).
Co-Occurrence
A co-occurrence matrix counts how often things co-occur in some environment. Given some word
occurring in the document, we consider the context window surrounding . Supposing our fixed window
size is , then this is the preceding and subsequent words in that document, i.e. words
and . We build a co-occurrence matrix , which is a symmetric word-by-word matrix in
which is the number of times appears inside 's window among all documents.
Example: Co-Occurrence with Fixed Window of n=1:
Document 1: "all that glitters is not gold"
Document 2: "all is well that ends well"
* <START> all that glitters is not gold well ends <END>
<START> 0 2 0 0 0 0 0 0 0 0
all 2 0 1 0 1 0 0 0 0 0
that 0 1 0 1 0 0 0 1 1 0
glitters 0 0 1 0 1 0 0 0 0 0
is 0 1 0 1 0 1 0 1 0 0
not 0 0 0 0 1 0 1 0 0 0
gold 0 0 0 0 0 1 0 0 0 1
well 0 0 1 0 1 0 0 0 1 1
ends 0 0 1 0 0 0 0 1 0 0
<END> 0 0 0 0 0 0 1 1 0 0
푤푖
푤푖
푛 푛 푛 푤푖−푛 … 푤푖−1
푤푖+1 … 푤푖+푛 푀
푀푖푗 푤푗 푤푖
Note: In NLP, we often add <START> and <END> tokens to represent the beginning and end of
sentences, paragraphs or documents. In thise case we imagine <START> and <END> tokens
encapsulating each document, e.g., " <START> All that glitters is not gold <END> ", and include these
tokens in our co-occurrence counts.
The rows (or columns) of this matrix provide one type of word vectors (those based on word-word cooccurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus,
our next step is to run dimensionality reduction. In particular, we will run SVD (Singular Value
Decomposition), which is a kind of generalized PCA (Principal Components Analysis) to select the top
principal components. Here's a visualization of dimensionality reduction with SVD. In this picture our cooccurrence matrix is with rows corresponding to words. We obtain a full matrix decomposition, with
the singular values ordered in the diagonal matrix, and our new, shorter length- word vectors in .
This reduced-dimensionality co-occurrence representation preserves semantic relationships between words,
e.g. doctor and hospital will be closer than doctor and dog.
Notes: If you can barely remember what an eigenvalue is, here's a slow, friendly introduction to SVD
(https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf). If you want to learn more thoroughly
about PCA or SVD, feel free to check out lectures 7 (https://web.stanford.edu/class/cs168/l/l7.pdf), 8
(http://theory.stanford.edu/~tim/s15/l/l8.pdf), and 9 (https://web.stanford.edu/class/cs168/l/l9.pdf) of CS168.
These course notes provide a great high-level treatment of these general purpose algorithms. Though, for
the purpose of this class, you only need to know how to extract the k-dimensional embeddings by utilizing
pre-programmed implementations of these algorithms from the numpy, scipy, or sklearn python packages. In
practice, it is challenging to apply full SVD to large corpora because of the memory needed to perform PCA
or SVD. However, if you only want the top vector components for relatively small — known as Truncated
SVD (https://en.wikipedia.org/wiki/Singular_value_decomposition#Truncated_SVD) — then there are
reasonably scalable techniques to compute those iteratively.
푘
퐴 푛 푛
푆 푘 푈푘
푘 푘
Plotting Co-Occurrence Word Embeddings
Here, we will be using the Reuters (business and financial news) corpus. If you haven't run the import cell at
the top of this page, please run it now (click it and press SHIFT-RETURN). The corpus consists of 10,788
news documents totaling 1.3 million words. These documents span 90 categories and are split into train and
test. For more details, please see https://www.nltk.org/book/ch02.html
(https://www.nltk.org/book/ch02.html). We provide a read_corpus function below that pulls out only
articles from the "crude" (i.e. news articles about oil, gas, etc.) category. The function also adds <START>
and <END> tokens to each of the documents, and lowercases words. You do not have to perform any
other kind of pre-processing.
In [40]: def read_corpus(category="crude"):
""" Read files from the specified Reuter's category.
Params:
category (string): category name
Return:
list of lists, with words from each of the processed files
"""
files = reuters.fileids(category)
return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))
] + [END_TOKEN] for f in files]
Let's have a look what these documents are like….
In [41]: reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:3], compact=True, width=100)
[['<START>', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy',
'demand', 'downwards', 'the',
'ministry', 'of', 'international', 'trade', 'and', 'industry', '('
, 'miti', ')', 'will', 'revise',
'its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'ou
tlook', 'by', 'august', 'to',
'meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy',
'demand', ',', 'ministry',
'officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower',
'the', 'projection', 'for',
'primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to'
, '550', 'mln', 'kilolitres',
'(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 't
he', 'decision', 'follows',
'the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese
', 'industry', 'following',
'the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a
', 'decline', 'in', 'domestic',
'electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to'
, 'work', 'out', 'a', 'revised',
'energy', 'supply', '/', 'demand', 'outlook', 'through', 'delibera
tions', 'of', 'committee',
'meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', '
and', 'energy', ',', 'the',
'officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also',
'review', 'the', 'breakdown',
'of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',',
'nuclear', ',', 'coal', 'and',
'natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bu
lk', 'of', 'japan', "'", 's',
'electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'marc
h', '31', ',', 'supplying',
'an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour'
, 'basis', ',', 'followed',
'by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural',
'gas', '(', '21', 'pct', '),',
'they', 'noted', '.', '<END>'],
['<START>', 'energy', '/', 'u', '.', 's', '.', 'petrochemical', 'in
dustry', 'cheap', 'oil',
'feedstocks', ',', 'the', 'weakened', 'u', '.', 's', '.', 'dollar'
, 'and', 'a', 'plant',
'utilization', 'rate', 'approaching', '90', 'pct', 'will', 'propel
', 'the', 'streamlined', 'u',
'.', 's', '.', 'petrochemical', 'industry', 'to', 'record', 'profi
ts', 'this', 'year', ',',
'with', 'growth', 'expected', 'through', 'at', 'least', '1990', ',
', 'major', 'company',
'executives', 'predicted', '.', 'this', 'bullish', 'outlook', 'for
', 'chemical', 'manufacturing',
'and', 'an', 'industrywide', 'move', 'to', 'shed', 'unrelated', 'b
usinesses', 'has', 'prompted',
'gaf', 'corp', '&', 'lt', ';', 'gaf', '>,', 'privately', '-', 'hel
d', 'cain', 'chemical', 'inc',
',', 'and', 'other', 'firms', 'to', 'aggressively', 'seek', 'acqui
sitions', 'of', 'petrochemical',
'plants', '.', 'oil', 'companies', 'such', 'as', 'ashland', 'oil',
'inc', '&', 'lt', ';', 'ash',
'>,', 'the', 'kentucky', '-', 'based', 'oil', 'refiner', 'and', 'm
arketer', ',', 'are', 'also',
'shopping', 'for', 'money', '-', 'making', 'petrochemical', 'busin
esses', 'to', 'buy', '.', '"',
'i', 'see', 'us', 'poised', 'at', 'the', 'threshold', 'of', 'a', '
golden', 'period', ',"', 'said',
'paul', 'oreffice', ',', 'chairman', 'of', 'giant', 'dow', 'chemic
al', 'co', '&', 'lt', ';',
'dow', '>,', 'adding', ',', '"', 'there', "'", 's', 'no', 'major',
'plant', 'capacity', 'being',
'added', 'around', 'the', 'world', 'now', '.', 'the', 'whole', 'ga
me', 'is', 'bringing', 'out',
'new', 'products', 'and', 'improving', 'the', 'old', 'ones', '."',
'analysts', 'say', 'the',
'chemical', 'industry', "'", 's', 'biggest', 'customers', ',', 'au
tomobile', 'manufacturers',
'and', 'home', 'builders', 'that', 'use', 'a', 'lot', 'of', 'paint
s', 'and', 'plastics', ',',
'are', 'expected', 'to', 'buy', 'quantities', 'this', 'year', '.',
'u', '.', 's', '.',
'petrochemical', 'plants', 'are', 'currently', 'operating', 'at',
'about', '90', 'pct',
'capacity', ',', 'reflecting', 'tighter', 'supply', 'that', 'could
', 'hike', 'product', 'prices',
'by', '30', 'to', '40', 'pct', 'this', 'year', ',', 'said', 'john'
, 'dosher', ',', 'managing',
'director', 'of', 'pace', 'consultants', 'inc', 'of', 'houston', '
.', 'demand', 'for', 'some',
'products', 'such', 'as', 'styrene', 'could', 'push', 'profit', 'm
argins', 'up', 'by', 'as',
'much', 'as', '300', 'pct', ',', 'he', 'said', '.', 'oreffice', ',
', 'speaking', 'at', 'a',
'meeting', 'of', 'chemical', 'engineers', 'in', 'houston', ',', 's
aid', 'dow', 'would', 'easily',
'top', 'the', '741', 'mln', 'dlrs', 'it', 'earned', 'last', 'year'
, 'and', 'predicted', 'it',
'would', 'have', 'the', 'best', 'year', 'in', 'its', 'history', '.
', 'in', '1985', ',', 'when',
'oil', 'prices', 'were', 'still', 'above', '25', 'dlrs', 'a', 'bar
rel', 'and', 'chemical',
'exports', 'were', 'adversely', 'affected', 'by', 'the', 'strong',
'u', '.', 's', '.', 'dollar',
',', 'dow', 'had', 'profits', 'of', '58', 'mln', 'dlrs', '.', '"',
'i', 'believe', 'the',
'entire', 'chemical', 'industry', 'is', 'headed', 'for', 'a', 'rec
ord', 'year', 'or', 'close',
'to', 'it', ',"', 'oreffice', 'said', '.', 'gaf', 'chairman', 'sam
uel', 'heyman', 'estimated',
'that', 'the', 'u', '.', 's', '.', 'chemical', 'industry', 'would'
, 'report', 'a', '20', 'pct',
'gain', 'in', 'profits', 'during', '1987', '.', 'last', 'year', ',
', 'the', 'domestic',
'industry', 'earned', 'a', 'total', 'of', '13', 'billion', 'dlrs',
',', 'a', '54', 'pct', 'leap',
'from', '1985', '.', 'the', 'turn', 'in', 'the', 'fortunes', 'of',
'the', 'once', '-', 'sickly',
'chemical', 'industry', 'has', 'been', 'brought', 'about', 'by', '
a', 'combination', 'of', 'luck',
'and', 'planning', ',', 'said', 'pace', "'", 's', 'john', 'dosher'
, '.', 'dosher', 'said', 'last',
'year', "'", 's', 'fall', 'in', 'oil', 'prices', 'made', 'feedstoc
ks', 'dramatically', 'cheaper',
'and', 'at', 'the', 'same', 'time', 'the', 'american', 'dollar', '
was', 'weakening', 'against',
'foreign', 'currencies', '.', 'that', 'helped', 'boost', 'u', '.',
's', '.', 'chemical',
'exports', '.', 'also', 'helping', 'to', 'bring', 'supply', 'and',
'demand', 'into', 'balance',
'has', 'been', 'the', 'gradual', 'market', 'absorption', 'of', 'th
e', 'extra', 'chemical',
'manufacturing', 'capacity', 'created', 'by', 'middle', 'eastern',
'oil', 'producers', 'in',
'the', 'early', '1980s', '.', 'finally', ',', 'virtually', 'all',
'major', 'u', '.', 's', '.',
'chemical', 'manufacturers', 'have', 'embarked', 'on', 'an', 'exte
nsive', 'corporate',
'restructuring', 'program', 'to', 'mothball', 'inefficient', 'plan
ts', ',', 'trim', 'the',
'payroll', 'and', 'eliminate', 'unrelated', 'businesses', '.', 'th
e', 'restructuring', 'touched',
'off', 'a', 'flurry', 'of', 'friendly', 'and', 'hostile', 'takeove
r', 'attempts', '.', 'gaf', ',',
'which', 'made', 'an', 'unsuccessful', 'attempt', 'in', '1985', 't
o', 'acquire', 'union',
'carbide', 'corp', '&', 'lt', ';', 'uk', '>,', 'recently', 'offere
d', 'three', 'billion', 'dlrs',
'for', 'borg', 'warner', 'corp', '&', 'lt', ';', 'bor', '>,', 'a',
'chicago', 'manufacturer',
'of', 'plastics', 'and', 'chemicals', '.', 'another', 'industry',
'powerhouse', ',', 'w', '.',
'r', '.', 'grace', '&', 'lt', ';', 'gra', '>', 'has', 'divested',
'its', 'retailing', ',',
'restaurant', 'and', 'fertilizer', 'businesses', 'to', 'raise', 'c
ash', 'for', 'chemical',
'acquisitions', '.', 'but', 'some', 'experts', 'worry', 'that', 't
he', 'chemical', 'industry',
'may', 'be', 'headed', 'for', 'trouble', 'if', 'companies', 'conti
nue', 'turning', 'their',
'back', 'on', 'the', 'manufacturing', 'of', 'staple', 'petrochemic
al', 'commodities', ',', 'such',
'as', 'ethylene', ',', 'in', 'favor', 'of', 'more', 'profitable',
'specialty', 'chemicals',
'that', 'are', 'custom', '-', 'designed', 'for', 'a', 'small', 'gr
oup', 'of', 'buyers', '.', '"',
'companies', 'like', 'dupont', '&', 'lt', ';', 'dd', '>', 'and', '
monsanto', 'co', '&', 'lt', ';',
'mtc', '>', 'spent', 'the', 'past', 'two', 'or', 'three', 'years',
'trying', 'to', 'get', 'out',
'of', 'the', 'commodity', 'chemical', 'business', 'in', 'reaction'
, 'to', 'how', 'badly', 'the',
'market', 'had', 'deteriorated', ',"', 'dosher', 'said', '.', '"',
'but', 'i', 'think', 'they',
'will', 'eventually', 'kill', 'the', 'margins', 'on', 'the', 'prof
itable', 'chemicals', 'in',
'the', 'niche', 'market', '."', 'some', 'top', 'chemical', 'execut
ives', 'share', 'the',
'concern', '.', '"', 'the', 'challenge', 'for', 'our', 'industry',
'is', 'to', 'keep', 'from',
'getting', 'carried', 'away', 'and', 'repeating', 'past', 'mistake
s', ',"', 'gaf', "'", 's',
'heyman', 'cautioned', '.', '"', 'the', 'shift', 'from', 'commodit
y', 'chemicals', 'may', 'be',
'ill', '-', 'advised', '.', 'specialty', 'businesses', 'do', 'not'
, 'stay', 'special', 'long',
'."', 'houston', '-', 'based', 'cain', 'chemical', ',', 'created',
'this', 'month', 'by', 'the',
'sterling', 'investment', 'banking', 'group', ',', 'believes', 'it
', 'can', 'generate', '700',
'mln', 'dlrs', 'in', 'annual', 'sales', 'by', 'bucking', 'the', 'i
ndustry', 'trend', '.',
'chairman', 'gordon', 'cain', ',', 'who', 'previously', 'led', 'a'
, 'leveraged', 'buyout', 'of',
'dupont', "'", 's', 'conoco', 'inc', "'", 's', 'chemical', 'busine
ss', ',', 'has', 'spent', '1',
'.', '1', 'billion', 'dlrs', 'since', 'january', 'to', 'buy', 'sev
en', 'petrochemical', 'plants',
'along', 'the', 'texas', 'gulf', 'coast', '.', 'the', 'plants', 'p
roduce', 'only', 'basic',
'commodity', 'petrochemicals', 'that', 'are', 'the', 'building', '
blocks', 'of', 'specialty',
'products', '.', '"', 'this', 'kind', 'of', 'commodity', 'chemical
', 'business', 'will', 'never',
'be', 'a', 'glamorous', ',', 'high', '-', 'margin', 'business', ',
"', 'cain', 'said', ',',
'adding', 'that', 'demand', 'is', 'expected', 'to', 'grow', 'by',
'about', 'three', 'pct',
'annually', '.', 'garo', 'armen', ',', 'an', 'analyst', 'with', 'd
ean', 'witter', 'reynolds', ',',
'said', 'chemical', 'makers', 'have', 'also', 'benefitted', 'by',
'increasing', 'demand', 'for',
'plastics', 'as', 'prices', 'become', 'more', 'competitive', 'with
', 'aluminum', ',', 'wood',
'and', 'steel', 'products', '.', 'armen', 'estimated', 'the', 'upt
urn', 'in', 'the', 'chemical',
'business', 'could', 'last', 'as', 'long', 'as', 'four', 'or', 'fi
ve', 'years', ',', 'provided',
'the', 'u', '.', 's', '.', 'economy', 'continues', 'its', 'modest'
, 'rate', 'of', 'growth', '.',
'<END>'],
['<START>', 'turkey', 'calls', 'for', 'dialogue', 'to', 'solve', 'd
ispute', 'turkey', 'said',
'today', 'its', 'disputes', 'with', 'greece', ',', 'including', 'r
ights', 'on', 'the',
'continental', 'shelf', 'in', 'the', 'aegean', 'sea', ',', 'should
', 'be', 'solved', 'through',
'negotiations', '.', 'a', 'foreign', 'ministry', 'statement', 'sai
d', 'the', 'latest', 'crisis',
'between', 'the', 'two', 'nato', 'members', 'stemmed', 'from', 'th
e', 'continental', 'shelf',
'dispute', 'and', 'an', 'agreement', 'on', 'this', 'issue', 'would
', 'effect', 'the', 'security',
',', 'economy', 'and', 'other', 'rights', 'of', 'both', 'countries
', '.', '"', 'as', 'the',
'issue', 'is', 'basicly', 'political', ',', 'a', 'solution', 'can'
, 'only', 'be', 'found', 'by',
'bilateral', 'negotiations', ',"', 'the', 'statement', 'said', '.'
Question 1.1: Implement distinct_words [code] (2 points)
Write a method to work out the distinct words (word types) that occur in the corpus. You can do this with
for loops, but it's more efficient to do it with Python list comprehensions. In particular, this
(https://coderwall.com/p/rcmaea/flatten-a-list-of-lists-in-one-line-in-python) may be useful to flatten a list of
lists. If you're not familiar with Python list comprehensions in general, here's more information
(https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html).
Your returned corpus_words should be sorted. You can use python's sorted function for this.
You may find it useful to use Python sets (https://www.w3schools.com/python/python_sets.asp) to remove
duplicate words.
, 'greece', 'has', 'repeatedly',
'said', 'the', 'issue', 'was', 'legal', 'and', 'could', 'be', 'sol
ved', 'at', 'the',
'international', 'court', 'of', 'justice', '.', 'the', 'two', 'cou
ntries', 'approached', 'armed',
'confrontation', 'last', 'month', 'after', 'greece', 'announced',
'it', 'planned', 'oil',
'exploration', 'work', 'in', 'the', 'aegean', 'and', 'turkey', 'sa
id', 'it', 'would', 'also',
'search', 'for', 'oil', '.', 'a', 'face', '-', 'off', 'was', 'aver
ted', 'when', 'turkey',
'confined', 'its', 'research', 'to', 'territorrial', 'waters', '.'
, '"', 'the', 'latest',
'crises', 'created', 'an', 'historic', 'opportunity', 'to', 'solve
', 'the', 'disputes', 'between',
'the', 'two', 'countries', ',"', 'the', 'foreign', 'ministry', 'st
atement', 'said', '.', 'turkey',
"'", 's', 'ambassador', 'in', 'athens', ',', 'nazmi', 'akiman', ',
', 'was', 'due', 'to', 'meet',
'prime', 'minister', 'andreas', 'papandreou', 'today', 'for', 'the
', 'greek', 'reply', 'to', 'a',
'message', 'sent', 'last', 'week', 'by', 'turkish', 'prime', 'mini
ster', 'turgut', 'ozal', '.',
'the', 'contents', 'of', 'the', 'message', 'were', 'not', 'disclos
ed', '.', '<END>']]
In [42]: def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): sorted list of distinct wo
rds across the corpus
num_corpus_words (integer): number of distinct words acros
s the corpus
"""
corpus_words = []
num_corpus_words = -1

# ------------------
# Use list comprehension
corpus_words = [eachword for line in corpus for eachword in line]
# Capture distinct ones using set
corpus_words_set = set(corpus_words)
# Make elemennts a sorted list
corpus_words = sorted(list(corpus_words_set))
# get number of distinct words across the corpus
num_corpus_words = len(corpus_words)
# ------------------
return corpus_words, num_corpus_words
In [43]: # ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# ---------------------
# Define toy corpus
test_corpus = ["{} All that glitters isn't gold {}".format(START_TOKEN
, END_TOKEN).split(" "), "{} All's well that ends well {}".format(STAR
T_TOKEN, END_TOKEN).split(" ")]
test_corpus_words, num_corpus_words = distinct_words(test_corpus)
# Correct answers
ans_test_corpus_words = sorted([START_TOKEN, "All", "ends", "that", "g
old", "All's", "glitters", "isn't", "well", END_TOKEN])
ans_num_corpus_words = len(ans_test_corpus_words)
# Test correct number of words
assert(num_corpus_words == ans_num_corpus_words), "Incorrect number of
distinct words. Correct: {}. Yours: {}".format(ans_num_corpus_words, n
um_corpus_words)
# Test correct words
assert (test_corpus_words == ans_test_corpus_words), "Incorrect corpus
_words.\nCorrect: {}\nYours: {}".format(str(ans_test_corpus_words),
str(test_corpus_words))
# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)
Question 1.2: Implement compute_co_occurrence_matrix [code] (3
points)
Write a method that constructs a co-occurrence matrix for a certain window-size (with a default of 4),
considering words before and after the word in the center of the window. Here, we start to use numpy
(np) to represent vectors, matrices, and tensors. If you're not familiar with NumPy, there's a NumPy tutorial
in the second half of this cs231n Python NumPy tutorial (http://cs231n.github.io/python-numpy-tutorial/).
푛
푛 푛
In [44]: def compute_co_occurrence_matrix(corpus, window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_s
ize (default of 4).

Note: Each word in a document should be at the center of a win
--------------------------------------------------------------------
------------
Passed All Tests!
--------------------------------------------------------------------
------------
dow. Words near edges will have a smaller
number of co-occurring words.

For example, if we take the document "<START> All that g
litters is not gold <END>" with window size of 4,
"All" will co-occur with "<START>", "that", "glitters",
"is", and "not".

Params:
corpus (list of list of strings): corpus of documents
window_size (int): size of context window
Return:
M (a symmetric numpy matrix of shape (number of unique wor
ds in the corpus , number of unique words in the corpus)):
Co-occurence matrix of word counts.
The ordering of the words in the rows/columns should b
e the same as the ordering of the words given by the distinct_words fu
nction.
word2ind (dict): dictionary that maps word to index (i.e.
row/column number) for matrix M.
"""
words, num_words = distinct_words(corpus)
M = None
word2ind = {}

# ------------------
# Initialize numpy matrix M of shape num_words*num_words with all
zero's
M = np.zeros((num_words, num_words))
# Initialize dictionary mapping words to indices
for i in range(num_words):
word2ind[words[i]] = i

# Loop through each line in corpus
for sentence in corpus:
# Locate each word in this line
for i, center_word in enumerate(sentence):
center_index = word2ind[center_word]
# Capture windows based on current index and window size
left_index = max(i - window_size, 0)
right_index = min(i + window_size, len(sentence) - 1)
# Loo through half, add 1 to the symmetric matrix
for j in range(left_index, i):
window_word = sentence[j]
M[center_index][word2ind[window_word]] += 1
M[word2ind[window_word]][center_index] += 1

# ------------------
return M, word2ind
In [45]: # ---------------------
# Run this sanity check
# Note that this is not an exhaustive check for correctness.
# ---------------------
# Define toy corpus and get student's co-occurrence matrix
test_corpus = ["{} All that glitters isn't gold {}".format(START_TOKEN
, END_TOKEN).split(" "), "{} All's well that ends well {}".format(STAR
T_TOKEN, END_TOKEN).split(" ")]
M_test, word2ind_test = compute_co_occurrence_matrix(test_corpus, wind
ow_size=1)
# Correct M and word2ind
M_test_ans = np.array(
[[0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,],
[0., 0., 1., 1., 0., 0., 0., 0., 0., 0.,],
[0., 1., 0., 0., 0., 0., 0., 0., 1., 0.,],
[0., 1., 0., 0., 0., 0., 0., 0., 0., 1.,],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,],
[0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,],
[1., 0., 0., 0., 0., 0., 0., 1., 0., 0.,],
[0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,],
[0., 0., 1., 0., 1., 1., 0., 0., 0., 1.,],
[1., 0., 0., 1., 1., 0., 0., 0., 1., 0.,]]
)
ans_test_corpus_words = sorted([START_TOKEN, "All", "ends", "that", "g
old", "All's", "glitters", "isn't", "well", END_TOKEN])
word2ind_ans = dict(zip(ans_test_corpus_words, range(len(ans_test_corp
us_words))))
# Test correct word2ind
assert (word2ind_ans == word2ind_test), "Your word2ind is incorrect:\n
Correct: {}\nYours: {}".format(word2ind_ans, word2ind_test)
# Test correct M shape
assert (M_test.shape == M_test_ans.shape), "M matrix has incorrect sha
pe.\nCorrect: {}\nYours: {}".format(M_test.shape, M_test_ans.shape)
# Test correct M values
for w1 in word2ind_ans.keys():
idx1 = word2ind_ans[w1]
for w2 in word2ind_ans.keys():
idx2 = word2ind_ans[w2]
student = M_test[idx1, idx2]
correct = M_test_ans[idx1, idx2]
if student != correct:
print("Correct M:")
print(M_test_ans)
print("Your M: ")
print(M_test)
raise AssertionError("Incorrect count at index ({}, {})=({
}, {}) in matrix M. Yours has {} but should have {}.".format(idx1, idx
2, w1, w2, student, correct))
# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)
Question 1.3: Implement reduce_to_k_dim [code] (1 point)
Construct a method that performs dimensionality reduction on the matrix to produce k-dimensional
embeddings. Use SVD to take the top k components and produce a new matrix of k-dimensional
embeddings.
Note: All of numpy, scipy, and scikit-learn ( sklearn ) provide some implementation of SVD, but only scipy
and sklearn provide an implementation of Truncated SVD, and only sklearn provides an efficient randomized
algorithm for calculating large-scale Truncated SVD. So please use sklearn.decomposition.TruncatedSVD
(https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html).
--------------------------------------------------------------------
------------
Passed All Tests!
--------------------------------------------------------------------
------------
In [46]: def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurence count matrix of dimensionality (num_corp
us_words, num_corpus_words)
to a matrix of dimensionality (num_corpus_words, k) using the
following SVD function from Scikit-Learn:
- http://scikit-learn.org/stable/modules/generated/sklearn
.decomposition.TruncatedSVD.html

Params:
M (numpy matrix of shape (number of unique words in the co
rpus , number of unique words in the corpus)): co-occurence matrix of
word counts
k (int): embedding size of each word after dimension reduc
tion
Return:
M_reduced (numpy matrix of shape (number of corpus words,
k)): matrix of k-dimensioal word embeddings.
In terms of the SVD from math class, this actually
returns U * S
"""
n_iters = 10 # Use this parameter in your call to `TruncatedSV
D`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))

# ------------------

svd = TruncatedSVD(n_components=k, n_iter=n_iters)
M_reduced = svd.fit_transform(M)

# ------------------
print("Done.")
return M_reduced
In [47]: # ---------------------
# Run this sanity check
# Note that this is not an exhaustive check for correctness
# In fact we only check that your M_reduced has the right dimensions.
# ---------------------
# Define toy corpus and run student code
test_corpus = ["{} All that glitters isn't gold {}".format(START_TOKEN
, END_TOKEN).split(" "), "{} All's well that ends well {}".format(STAR
T_TOKEN, END_TOKEN).split(" ")]
M_test, word2ind_test = compute_co_occurrence_matrix(test_corpus, wind
ow_size=1)
M_test_reduced = reduce_to_k_dim(M_test, k=2)
# Test proper dimensions
assert (M_test_reduced.shape[0] == 10), "M_reduced has {} rows; should
have {}".format(M_test_reduced.shape[0], 10)
assert (M_test_reduced.shape[1] == 2), "M_reduced has {} columns; shou
ld have {}".format(M_test_reduced.shape[1], 2)
# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)
Question 1.4: Implement plot_embeddings [code] (1 point)
Here you will write a function to plot a set of 2D vectors in 2D space. For graphs, we will use Matplotlib
( plt ).
For this example, you may find it useful to adapt this code
(http://web.archive.org/web/20190924160434/https://www.pythonmembers.club/2018/05/08/matplotlibscatter-plot-annotate-set-text-at-label-each-point/). In the future, a good way to make a plot is to look at the
Matplotlib gallery (https://matplotlib.org/gallery/index.html), find a plot that looks somewhat like what you
want, and adapt the code they give.
Running Truncated SVD over 10 words...
Done.
--------------------------------------------------------------------
------------
Passed All Tests!
--------------------------------------------------------------------
------------
In [48]: def plot_embeddings(M_reduced, word2ind, words):
""" Plot in a scatterplot the embeddings of the words specified in
the list "words".
NOTE: do not plot all the words listed in M_reduced / word2ind
.
Include a label next to each point.

Params:
M_reduced (numpy matrix of shape (number of unique words i
n the corpus , 2)): matrix of 2-dimensioal word embeddings
word2ind (dict): dictionary that maps word to indices for
matrix M
words (list of strings): words whose embeddings we want to
visualize
"""
# ------------------

for count, word in enumerate(words):
x = M_reduced[word2ind[word],0]
y = M_reduced[word2ind[word],1]
plt.scatter(x, y, marker='x', color='red')
plt.text(x, y, word, fontsize=10)
#plt.show()
# ------------------
In [49]: # ---------------------
# Run this sanity check
# Note that this is not an exhaustive check for correctness.
# The plot produced should look like the "test solution plot" depicted
below.
# ---------------------
print ("-" * 80)
print ("Outputted Plot:")
M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0
, 0]])
word2ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3,
'test5': 4}
words = ['test1', 'test2', 'test3', 'test4', 'test5']
plot_embeddings(M_reduced_plot_test, word2ind_plot_test, words)
print ("-" * 80)
--------------------------------------------------------------------
------------
Outputted Plot:
--------------------------------------------------------------------
------------
**Test Plot Solution**
Question 1.5: Co-Occurrence Plot Analysis [written] (3 points)
Now we will put together all the parts you have written! We will compute the co-occurrence matrix with fixed
window of 4 (the default window size), over the Reuters "crude" (oil) corpus. Then we will use TruncatedSVD
to compute 2-dimensional embeddings of each word. TruncatedSVD returns U*S, so we need to normalize
the returned vectors, so that all the vectors will appear around the unit circle (therefore closeness is
directional closeness). Note: The line of code below that does the normalizing uses the NumPy concept of
broadcasting. If you don't know about broadcasting, check out Computation on Arrays: Broadcasting by
Jake VanderPlas (https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arraysbroadcasting.html).
Run the below cell to produce the plot. It'll probably take a few seconds to run. What clusters together in 2-
dimensional embedding space? What doesn't cluster together that you might think should have? Note:
"bpd" stands for "barrels per day" and is a commonly used abbreviation in crude oil topic articles.
In [50]: # -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
reuters_corpus = read_corpus()
M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix
(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)
# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # br
oadcasting
words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait',
'oil', 'output', 'petroleum', 'iraq']
plot_embeddings(M_normalized, word2ind_co_occurrence, words)
Country names such as iraq, equador (and kuwait not so far from them) cluster together. This is
because these terms belong to a single type. However, bpd (barrels per day) and barrels (and
probably output) don't cluster together although they seem to have similar meanings in terms of oil
and energy.
Running Truncated SVD over 8185 words...
Done.
Part 2: Prediction-Based Word Vectors (15 points)
As discussed in class, more recently prediction-based word vectors have demonstrated better performance,
such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the
embeddings produced by GloVe. Please revisit the class notes and lecture slides for more details on the
word2vec and GloVe algorithms. If you're feeling adventurous, challenge yourself and try reading GloVe's
original paper (https://nlp.stanford.edu/pubs/glove.pdf).
Then run the following cells to load the GloVe vectors into memory. Note: If this is your first time to run these
cells, i.e. download the embedding model, it will take a couple minutes to run. If you've run these cells
before, rerunning them will load the model without redownloading it, which will take about 1 to 2 minutes.
In [51]: def load_embedding_model():
""" Load GloVe Vectors
Return:
wv_from_bin: All 400000 embeddings, each lengh 200
"""
import gensim.downloader as api
wv_from_bin = api.load("glove-wiki-gigaword-200")
print("Loaded vocab size %i" % len(wv_from_bin.vocab.keys()))
return wv_from_bin
In [52]: # -----------------------------------
# Run Cell to Load Word Vectors
# Note: This will take a couple minutes
# -----------------------------------
wv_from_bin = load_embedding_model()
Note: If you are receiving a "reset by peer" error, rerun the cell to restart the download.
Reducing dimensionality of Word Embeddings
Let's directly compare the GloVe embeddings to those of the co-occurrence matrix. In order to avoid running
out of memory, we will work with a sample of 10000 GloVe vectors instead. Run the following cells to:
1. Put 10000 Glove vectors into a matrix M
2. Run reduce_to_k_dim (your Truncated SVD function) to reduce the vectors from 200-dimensional to
2-dimensional.
Loaded vocab size 400000
In [53]: def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd
', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petrol
eum', 'iraq']):
""" Put the GloVe vectors into a matrix M.
Param:
wv_from_bin: KeyedVectors object; the 400000 GloVe vectors
loaded from file
Return:
M: numpy matrix shape (num words, 200) containing the vect
ors
word2ind: dictionary mapping each word to its row number i
n M
"""
import random
words = list(wv_from_bin.vocab.keys())
print("Shuffling words ...")
random.seed(224)
random.shuffle(words)
words = words[:10000]
print("Putting %i words into word2ind and matrix M..." % len(words
))
word2ind = {}
M = []
curInd = 0
for w in words:
try:
M.append(wv_from_bin.word_vec(w))
word2ind[w] = curInd
curInd += 1
except KeyError:
continue
for w in required_words:
if w in words:
continue
try:
M.append(wv_from_bin.word_vec(w))
word2ind[w] = curInd
curInd += 1
except KeyError:
continue
M = np.stack(M)
print("Done.")
return M, word2ind
In [54]: # -----------------------------------------------------------------
# Run Cell to Reduce 200-Dimensional Word Embeddings to k Dimensions
# Note: This should be quick to run
# -----------------------------------------------------------------
M, word2ind = get_matrix_of_vectors(wv_from_bin)
M_reduced = reduce_to_k_dim(M, k=2)
# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced, axis=1)
M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcas
ting
Note: If you are receiving out of memory issues on your local machine, try closing other applications
to free more memory on your device. You may want to try restarting your machine so that you can
free up extra memory. Then immediately run the jupyter notebook and see if you can load the word
vectors properly. If you still have problems with loading the embeddings onto your local machine after
this, please go to office hours or contact course staff.
Question 2.1: GloVe Plot Analysis [written] (3 points)
Run the cell below to plot the 2D GloVe embeddings for ['barrels', 'bpd', 'ecuador',
'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq'] .
What clusters together in 2-dimensional embedding space? What doesn't cluster together that you think
should have? How is the plot different from the one generated earlier from the co-occurrence matrix? What
is a possible cause for the difference?
Shuffling words ...
Putting 10000 words into word2ind and matrix M...
Done.
Running Truncated SVD over 10010 words...
Done.
In [55]: words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait',
'oil', 'output', 'petroleum', 'iraq']
plot_embeddings(M_reduced_normalized, word2ind, words)
Now, there are two clusters --(ecuador, iraq and petroleum) and (energy, industry) in the 2D
embedding space. Similarly, bpd, barrels and output don't cluster together that I think should have.
The clustering is different from the previous one in the sense that we now don't have country names
cluster together (also, kuwait is even far from the first cluster anymore and petroleum joins). A
possible cause for the difference might due to something beyond the word co-occurrence nature in
obtaining word vectors [thanks to its aggregated global word-word co-occurrence statistics from a
corpus]. GloVe takes advantages as it looks at the ratio between the co-occurence probabilities to
extract the inner meaning of words in addition to co-occurrence of words.
Cosine Similarity
Now that we have word vectors, we need a way to quantify the similarity between individual words,
according to these vectors. One such metric is cosine-similarity. We will be using this to find words that are
"close" and "far" from one another.
We can think of n-dimensional vectors as points in n-dimensional space. If we take this perspective L1
(http://mathworld.wolfram.com/L1-Norm.html) and L2 (http://mathworld.wolfram.com/L2-Norm.html)
Distances help quantify the amount of space "we must travel" to get between these two points. Another
approach is to examine the angle between two vectors. From trigonometry we know that:
Instead of computing the actual angle, we can leave the similarity in terms of .
Formally the Cosine Similarity (https://en.wikipedia.org/wiki/Cosine_similarity) between two vectors and
is defined as:
푠푖푚푖푙푎푟푖푡푦 = 푐표푠(Θ)
푠 푝
푞
푠 = , where 푠 ∈ [−1, 1] 푝 ⋅ 푞
||푝||||푞||
Question 2.2: Words with Multiple Meanings (1.5 points) [code + written]
Polysemes and homonyms are words that have more than one meaning (see this wiki page
(https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and
homonyms ). Find a word with at least two different meanings such that the top-10 most similar words
(according to cosine similarity) contain related words from both meanings. For example, "leaves" has both
"go_away" and "a_structure_of_a_plant" meaning in the top 10, and "scoop" has both
"handed_waffle_cone" and "lowdown". You will probably need to try several polysemous or homonymic
words before you find one.
Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think
many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only
contain one of the meanings of the words)?
Note: You should use the wv_from_bin.most_similar(word) function to get the top 10 similar words.
This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word.
For further assistance, please check the GenSim documentation
(https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedV
In [56]: # ------------------
wv_from_bin.most_similar("chair")
# ------------------
The word 'chair' has at least two meanings--one is a separate seat for one person, typically with a
back and four legs; another one is the person in charge of a meeting or of an organization. The former
meaning corresponds well to the 2nd (sitting), 4th (seat), 5th (sits), etc. similar words, while the latter
one corresponds well to the 3rd (head), 9th (panel), 10th (board) etc. on the other hand. Many of the
polysemous or homonymic words might not work because some words in the top 10 ones don't hold
the same meaning with the target word, sometimes even with opposite meanings, which hinders its
probability of showing up.
Question 2.3: Synonyms & Antonyms (2 points) [code + written]
When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1
- Cosine Similarity.
Find three words where and are synonyms and and are antonyms, but Cosine
Distance Cosine Distance .
As an example, ="happy" is closer to ="sad" than to ="cheerful". Please find a different example
that satisfies the above. Once you have found your example, please give a possible explanation for why this
counter-intuitive result may have happened.
You should use the the wv_from_bin.distance(w1, w2) function here in order to compute the cosine
distance between two words. Please see the GenSim documentation
(https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVfor further assistance.
(푤 , , ) 1 푤2 푤3 푤1 푤2 푤1 푤3
(푤1, 푤3) < (푤 , ) 1 푤2
푤1 푤3 푤2
Out[56]: [('chairs', 0.7992520928382874),
('sitting', 0.5966616272926331),
('head', 0.5494985580444336),
('seat', 0.5444087386131287),
('sits', 0.5396795868873596),
('sit', 0.5316585898399353),
('sat', 0.531306266784668),
('chaired', 0.5231366157531738),
('panel', 0.5091920495033264),
('board', 0.503804624080658)]
In [57]: # ------------------
w1 = "mr"
w2 = "sir"
w3 = "mrs"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)
print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_
dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_
dist))

# ------------------
Typically speaking, 'mr' and 'sir' usually hold a similar meaning, while 'mr' and 'mrs' are used for
different genders. However, the distance between 'mr' and 'mrs' is closer, i.e. they have closer
meanings. The counter-intuitive result may have happened because 'mr' and 'mrs' are ususlly used in
the same context although they do not have identical meanings.
Question 2.4: Analogies with Word Vectors [written] (1.5 points)
Word vectors have been shown to sometimes exhibit the ability to solve analogies.
As an example, for the analogy "man : king :: woman : x" (read: man is to king as woman is to x), what is x?
In the cell below, we show you how to use word vectors to find x using the most_similar function from
the GenSim documentation
(https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.mThe function finds words that are most similar to the words in the positive list and most dissimilar from
the words in the negative list (while omitting the input words, which are often the most similar; see this
paper (https://www.aclweb.org/anthology/N18-2039.pdf)). The answer to the analogy will have the highest
cosine similarity (largest returned numerical value).
Synonyms mr, sir have cosine distance: 0.48897749185562134
Antonyms mr, mrs have cosine distance: 0.3088870644569397
In [58]: # Run this cell to answer the analogy -- man : king :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], neg
ative=['man']))
Let , , , and denote the word vectors for man , king , woman , and the answer, respectively. Using
only vectors , , , and the vector arithmetic operators and in your answer, what is the expression in
which we are maximizing cosine similarity with ?
Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to
draw out a 2D example using arbitrary locations of each vector. Where would man and woman lie in the
coordinate plane relative to king and the answer?
푚 푘 푤 푥
푚 푘 푤 + −
푥
w+k-m
Question 2.5: Finding Analogies [code + written] (1.5 points)
Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In
your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated,
explain why the analogy holds in one or two sentences.
Note: You may have to try many analogies to find one that works!
[('queen', 0.6978678703308105),
('princess', 0.6081745028495789),
('monarch', 0.5889754891395569),
('throne', 0.5775108933448792),
('prince', 0.5750998258590698),
('elizabeth', 0.5463595986366272),
('daughter', 0.5399125814437866),
('kingdom', 0.5318052172660828),
('mother', 0.5168544054031372),
('crown', 0.5164473056793213)]
In [59]: # ------------------
pprint.pprint(wv_from_bin.most_similar(positive=['niece','brother'], n
egative=['sister']))
# ------------------
sister : brother :: niece : nephew
Question 2.6: Incorrect Analogy [code + written] (1.5 points)
Find an example of analogy that does not hold according to these vectors. In your solution, state the
intended analogy in the form x:y :: a:b, and state the (incorrect) value of b according to the word vectors.
In [60]: # ------------------
pprint.pprint(wv_from_bin.most_similar(positive=['patient','student'],
negative=['teacher']))
# ------------------
teacher : student :: doctor : patient; however, the word 'doctor' didn't show up to be the top choice in
the most similar word lists.
[('nephew', 0.8642321825027466),
('grandson', 0.7840785384178162),
('son', 0.7793194055557251),
('uncle', 0.7447565197944641),
('cousin', 0.7052454352378845),
('son-in-law', 0.6898064613342285),
('eldest', 0.6872141361236572),
('father', 0.6855365633964539),
('brother-in-law', 0.6810104846954346),
('grandfather', 0.678931474685669)]
[('patients', 0.7294403314590454),
('doctors', 0.5614489316940308),
('treatment', 0.5569756627082825),
('medical', 0.5410650968551636),
('hospital', 0.5231661796569824),
('treating', 0.5166430473327637),
('treat', 0.5109313130378723),
('psychiatric', 0.5056506991386414),
('treated', 0.49996280670166016),
('care', 0.49673765897750854)]
Question 2.7: Guided Analysis of Bias in Word Vectors [written] (1 point)
It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word
embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ
these models.
Run the cell below, to examine (a) which terms are most similar to "woman" and "worker" and most
dissimilar to "man", and (b) which terms are most similar to "man" and "worker" and most dissimilar to
"woman". Point out the difference between the list of female-associated words and the list of maleassociated words, and explain how it is reflecting gender bias.
In [61]: # Run this cell
# Here `positive` indicates the list of words to be similar to and `ne
gative` indicates the list of words to be
# most dissimilar from.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'worker'], n
egative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'worker'], neg
ative=['woman']))
The top 1 words of man - worker = woman - x is employee, while the top 1 word of woman - worker =
man - x is workers. This represents the gender bias in the context because female is basically
associated with the role of employee in the available context.
[('employee', 0.6375863552093506),
('workers', 0.6068919897079468),
('nurse', 0.5837947130203247),
('pregnant', 0.5363885760307312),
('mother', 0.5321309566497803),
('employer', 0.5127025842666626),
('teacher', 0.5099577307701111),
('child', 0.5096741914749146),
('homemaker', 0.5019455552101135),
('nurses', 0.4970571994781494)]
[('workers', 0.611325740814209),
('employee', 0.5983108878135681),
('working', 0.5615329742431641),
('laborer', 0.5442320108413696),
('unemployed', 0.5368517637252808),
('job', 0.5278826951980591),
('work', 0.5223963260650635),
('mechanic', 0.5088937282562256),
('worked', 0.5054520964622498),
('factory', 0.4940453767776489)]
Question 2.8: Independent Analysis of Bias in Word Vectors [code + written] (1
point)
Use the most_similar function to find another case where some bias is exhibited by the vectors. Please
briefly explain the example of bias that you discover.
In [62]: # ------------------
print("man:doctor :: woman: ?")
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'doctor'], n
egative=['man']))
print("woman:doctor :: man: ?")
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'doctor'], neg
ative=['woman']))
# ------------------
On one hand, in the top 10 of man:doctor :: woman: ?, the top 1 word is nurse. On the other hand, in
the top 10 word of woman:doctor :: man:: ?, the top 1 word is dr. This again represents certain gender
bias in the word embedding becuase the female doctor is less associated in the context of words;
hence, there must exist some bias in terms of the word embedding.
man:doctor :: woman: ?
[('nurse', 0.6813318729400635),
('physician', 0.6672453284263611),
('doctors', 0.6173422932624817),
('dentist', 0.5775880217552185),
('surgeon', 0.5691418647766113),
('hospital', 0.564996600151062),
('pregnant', 0.5649075508117676),
('nurses', 0.5590691566467285),
('medical', 0.5542058944702148),
('patient', 0.5518484711647034)]
woman:doctor :: man: ?
[('dr.', 0.5486295819282532),
('physician', 0.5327188372612),
('he', 0.5275284647941589),
('him', 0.5230658054351807),
('himself', 0.5116502642631531),
('medical', 0.5046803951263428),
('his', 0.5044265985488892),
('brother', 0.503484845161438),
('surgeon', 0.5005415081977844),
('mr.', 0.4938008189201355)]
Question 2.9: Thinking About Bias [written] (2 points)
Give one explanation of how bias gets into the word vectors. What is an experiment that you could do to
test for or to measure this source of bias?
Some biases result from the implicit biases of the context of data, i.e. the source of our data contents.
This reflects the perpective of our society nowadays. One way to measure the bias is to utilize some
cost function. A cost function could describe the cost of incorrectly predicting, as I pointed out in
previous examples. Since a model making an incorrect prediction is an undesirable outcome, we
could calculate the cost to do the testing.
Submission Instructions
1. Click the Save button at the top of the Jupyter Notebook.
2. Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of
all cells).
3. Select Cell -> Run All. This will run all the cells in order, and will take several minutes.
4. Once you've rerun everything, select File -> Download as -> PDF via LaTeX (If you have trouble using
"PDF via LaTex", you can also save the webpage as pdf. Make sure all your solutions especially the
coding parts are displayed in the pdf, it's okay if the provided codes get cut off because lines are not
wrapped in code cells).
5. Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is the only
thing your graders will see!
6. Submit your PDF on Gradescope.