Unsupervised NLP Analysis of Novelists¶

Author: Ryan Harper

Overview
Experiment
1. Cleaning Data
2. Word2Vec Embedding
3. IDF Vectorization
4. SVD Component Analysis
5. Cluster Modeling
6. Cluster Evaluation

Overview (top)¶

Data Source:

NLTK Gutenberg

Summary:

This project explores the variation in vocabulary and writing structure of three different authors by employing Word2Vec, IDF, and a gaussian mixture cluster model to examine the sentences in three different novels.

Variables:

author: author of the novel
sentence: sentence from the novel/author

Experiment (top)¶

Hypothesis:

Authors will use different vocabulary and collocations that will help models predict differences between the authors in an unsupervised environment.

Observations:

Proper nouns (names and places) are the strongest indicator of different authors (via their specific novels)

Some frequently used 1 and 2 word expressions (i.e.'Oh' and 'was said') are a good indicator of different authors

Clustering models on a Truncated SVD IDF vector does not appear to produce meaningful results. Future experiments might need to explore other ways to better vectorize sentences for cluster models.

Method:

Imported and merged the three novels into a data set - added a fourth novel from the first author to control for differences in novels

Split up the novels by sentences and randomly sorted the samples.

Implemented Word2Vec on the sentences and visualized word similarities

Used IDF and SVD to assess important components and then matched them to the original sentences.

Used the Gaussian Mixture model (cluster model) to look at the three combined samples and to determine if clusters could be determined from sentences.

Used v_measurement, cross tab, and akaike to assess the effectiveness of the clustering.

1. Cleaning the Data (top)¶

%matplotlib inline
import numpy as np;import pandas as pd;import matplotlib.pyplot as plt
import scipy
import re
import sklearn;from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import gutenberg
from IPython.display import display
import warnings
import seaborn as sns
from textblob import TextBlob
warnings.filterwarnings('ignore')

from colorama import Fore
green = Fore.GREEN
red = Fore.RED
black = Fore.RESET

print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

# Utility function for standard text cleaning.
def text_cleaner(text):
    text = str(text)
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub('\n'," ",text)
    text = re.sub('\r'," ",text)
    return text

def mlist(story):
    return [text_cleaner(i) for i in TextBlob(re.sub(r'Chapter \d+', '', story)).sentences]

def mtokens(story):
    return [list(i.words.lower()) for i in TextBlob(re.sub(r'Chapter \d+', '', story)).sentences]

# Load and clean the data.
chestertonbrown = mlist(gutenberg.raw('chesterton-brown.txt'))
edgeworth = mlist(gutenberg.raw('edgeworth-parents.txt'))
alice = mlist(gutenberg.raw('carroll-alice.txt'))

sentences = chestertonbrown + alice + edgeworth

def nl(name,data):
    return [name for i in range(len(data))]
name = nl('chesterton',chestertonbrown)+nl('edgeworth',edgeworth)+nl('carroll',alice)

# Add second book to control for book differences (instead of author differences)
chestertonball = mlist(gutenberg.raw('chesterton-ball.txt'))
sentences+=chestertonball
name+=nl('chesterton',chestertonball)

stories = pd.DataFrame([sentences,name]).T.sample(frac=1,random_state=42)
stories.columns = ['sentence','author']

2. Word Embedding - Word2Vec¶

from gensim.models import word2vec
from gensim.utils import tokenize

stories.tokens = stories.sentence.apply(tokenize).apply(list)

%%time
load_model = False
 
vec_size = 200
word_vec = word2vec.Word2Vec(
    stories.tokens,
    workers=4,     # Number of threads to run in parallel
    min_count=5,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=vec_size,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

# List of words in model.
vocab = word_vec.wv.vocab.keys()

CPU times: user 6.92 s, sys: 34.1 ms, total: 6.95 s
Wall time: 2.16 s

Cosine Similarity Function

$$ cos(\theta) = \frac{A \bullet B} {\Vert A \Vert \Vert B \Vert} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{ \sum_{i=1}^n A^2} \sqrt{ \sum_{i=1}^n B^2}} $$

Version A. Cosine Similarity Function on Word2Vec Matrix (with numpy)

euclidean_norm = lambda m: np.sqrt(np.array([a*a for a in m]).sum())
def similarity_vec(a,b):
    return (np.dot(a,b))/(euclidean_norm(a)*euclidean_norm(b))

ladyship_miss = similarity_vec(word_vec['ladyship'],word_vec['Miss'])
ladyship_statue = similarity_vec(word_vec['ladyship'],word_vec['statue'])
statue_miss = similarity_vec(word_vec['statue'],word_vec['Miss'])
bore_hard = similarity_vec(word_vec['bore'],word_vec['hard'])

print(green+'ladyship - Miss: {}'.format(ladyship_miss)+black)
print(red+'ladyship - statue: {}'.format(ladyship_statue)+black)
print(red+'statue - Miss: {}'.format(statue_miss)+black)
print(green+'bore - hard: {}'.format(bore_hard)+black)

ladyship - Miss: 0.6202283501625061
ladyship - statue: 0.19411465525627136
statue - Miss: 0.06034122034907341
bore - hard: 0.6270266175270081

Version B. Gensim

w1,w2,w3 = 'Miss','ladyship','men'
print(word_vec.wv.most_similar(positive=[w1, w2], negative=[w3], topn=1))

w1 = 'statue'
print(word_vec.wv.most_similar(positive=w1,topn=3))

w1 = 'hard'
print(word_vec.wv.most_similar(positive=w1,topn=3))

[('mother', 0.7773891687393188)]
[('restaurant', 0.8764076232910156), ('engine', 0.8655489683151245), ('unknown', 0.8647832870483398)]
[('whistle', 0.7745720148086548), ('grotesque', 0.7595998048782349), ('tame', 0.7530933022499084)]

Version C. SKlearn

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(word_vec['ladyship'].reshape(-1,1),word_vec['Miss'].reshape(-1,1))

array([[ 1., -1.,  1., ...,  1.,  1., -1.],
       [-1.,  1., -1., ..., -1., -1.,  1.],
       [ 1., -1.,  1., ...,  1.,  1., -1.],
       ...,
       [ 1., -1.,  1., ...,  1.,  1., -1.],
       [-1.,  1., -1., ..., -1., -1.,  1.],
       [-1.,  1., -1., ..., -1., -1.,  1.]], dtype=float32)

Returns a matrix for first part of cosine similarity equation

Word Similarity Visualization¶

colors_dict = dict(zip(['statue','Miss','ladyship'],['red','green','green']))

from sklearn.manifold import TSNE
X = word_vec[word_vec.wv.vocab]
words = list(word_vec.wv.vocab)
graph_tsne = TSNE(n_components=2)
result = graph_tsne.fit_transform(X)
# create a scatter plot of the projection
plt.figure(figsize=(20,20))
plt.scatter(result[:, 0], result[:, 1],color='gray')
for i, word in enumerate(words):
    if word in colors_dict.keys():
        plt.annotate(word, 
                     xy=(result[i, 0], 
                         result[i, 1]),
                     size=18,
                     backgroundcolor=colors_dict[word])
plt.show()

3. IDF Vectorization (top)¶

vectorizer = TfidfVectorizer(max_df=0.3,
                             min_df=2,
                             stop_words='english', 
                             lowercase=True, #lower case
                             use_idf=True,#inverse document frequencies
                             norm=u'l1', #longer and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies
                            )

X_train, y_train = stories.sentence,stories.author

#Applying the vectorizer
stories_paras_tfidf=vectorizer.fit_transform(X_train)
print("Number of features: %d" % stories_paras_tfidf.get_shape()[1])

Number of features: 9040

# Reformat tfidf (Thinkful code)
X_train_tfidf_csr = stories_paras_tfidf.tocsr()

n = X_train_tfidf_csr.shape[0]

tfidf_bypara = [{} for _ in range(0,n)]

Examine Features

#List of features
terms = vectorizer.get_feature_names()

#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

#Keep in mind that the log base 2 of 1 is 0
print('Original sentence:', stories.iloc[0])
print('Tf_idf vector:', tfidf_bypara[10])

Original sentence: sentence    cried her mother; "come to me, and tell me  wh...
author                                              edgeworth
Name: 8214, dtype: object
Tf_idf vector: {'said': 0.3388014198339718, 'archer': 0.6611985801660282}

4. SVD Component Analysis (top)¶

from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

variances = []
r = range (1,7001,500)
for i in r:
    svd= TruncatedSVD(i)
    lsa = make_pipeline(svd, Normalizer(copy=False))
    # Run SVD on the training data, then project the training data.
    X_train_lsa = lsa.fit_transform(X_train_tfidf_csr)
    variance_explained=svd.explained_variance_ratio_
    total_variance = variance_explained.sum()
    variances.append(total_variance)
    print('# of Features: {}'.format(i),end=' | ')

# of Features: 1 | # of Features: 501 | # of Features: 1001 | # of Features: 1501 | # of Features: 2001 | # of Features: 2501 | # of Features: 3001 | # of Features: 3501 | # of Features: 4001 | # of Features: 4501 | # of Features: 5001 | # of Features: 5501 | # of Features: 6001 | # of Features: 6501 |

plt.scatter(r,variances)
plt.title('Truncated SVD: Explained Variance')

Text(0.5,1,'Truncated SVD: Explained Variance')

svd= TruncatedSVD(400)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf_csr)
variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()

#Looking at what sorts of paragraphs our solution considers similar
paras_by_component=pd.DataFrame(X_train_lsa,index=X_train.index)
paras_by_component['author'] = stories.author
paras_by_component['sentence'] = stories.sentence

pd.options.display.width = 100
pd.options.display.max_colwidth = 100

for i in range(5):
    component = i
    print('Component {}:'.format(component))
    df = pd.DataFrame(data = paras_by_component[[component,'author','sentence']]).sort_values(by=component,ascending=False)[0:40:5]
    df = df.reset_index();del df['index'];df.columns=['Similarity','Author','Sentence']
    display(df)
    print('')

Component 0:

Component 1:

Component 2:

Component 3:

Component 4:

r = range(0,20)
# Compute document similarity using LSA components
similarity = np.asarray(np.asmatrix(X_train_lsa) * np.asmatrix(X_train_lsa).T)
#Only taking the first 10 sentences
sim_matrix=pd.DataFrame(similarity,index=X_train).iloc[r[0]:r[-1]+1,r[0]:r[-1]+1]
#Making a plot
plt.figure(figsize=(10,8))
ax = sns.heatmap(sim_matrix,yticklabels=r)
plt.show()

#Generating a key for the plot.
print('Key:')
for i in r:
    j = i-r[0]
    print("{}-{}: {}".format(i,stories.author.iloc[i],sim_matrix.index[j][0:100]))

Key:
0-edgeworth: cried her mother; "come to me, and tell me  what's the matter."
1-edgeworth: it asked.
2-chesterton: Then MacIan stopped suddenly and planted his sword-point in the ground like one who plants his tent-
3-edgeworth: "What!
4-carroll: "Well, then," cried Tarlton, giving his hand to the boy next him, then to  the next, "your word and 
5-edgeworth: Without determining which species of love is the most flattering to  others, we can easily decide wh
6-edgeworth: "Shall we walk out before tea?"
7-chesterton: He was one of those amateurs who always know more than the professional.
8-chesterton: "Then, why," said the large man in the silk hat, trembling from head to foot, "why do you wear your 
9-edgeworth: "Well," said Susan, "I am not too proud to be obliged.
10-edgeworth: said Archer.
11-edgeworth: All I understand is the  common honesty of the matter.
12-chesterton: Beyond it lay the broad sombrero fallen from the head of Muscari, and beside it a sealed business le
13-chesterton: Anyone passing the place during a certain autumn evening, itself almost fairylike, might have seen a
14-edgeworth: "But I mean to do more than PAY YOU WITH WORDS," said Sir Arthur.
15-chesterton: There was not only Miss Drake of the motor-car, but also Miss Drake's chauffeur.
16-edgeworth: 'But they were IN the well,' Alice said to the Dormouse, not choosing to notice this last remark.
17-chesterton: Send up the very best wine and dinner at once.
18-edgeworth: Touch him  who dares!"
19-edgeworth: "It is all I wish," exclaimed Mr. Cox, "that I could find out and light  upon the man that has belie

5. Cluster Model - Gaussian Mixture (top)¶

Because four novels (and three authors) were merged into a single data set, a mixture model is helpful for distinguising between

%%time
from sklearn.mixture import GaussianMixture

gm_range = range(2,8,1)
gm_aic = []
gm_predict = []
gm_score = []

gm_x = X_train_lsa
gm_y = y_train
for i in gm_range:
    # Declare and fit the model.
    gm = GaussianMixture(n_components=i,n_init=10,init_params='random')
    gm.fit(gm_x)
    # gm_predict.append(km.predict_fit(X_test_tfidf))
    gm_aic.append(gm.aic(gm_x))
    gm_predict.append(gm.predict(gm_x))
    gm_score.append(gm.score(gm_x,gm_y))
    
    
    #Predicted clusters.
    print('{} completed'.format(i),end=' | ')

2 completed | 3 completed | 4 completed | 5 completed | 6 completed | 7 completed | CPU times: user 2h 50min 5s, sys: 13min 10s, total: 3h 3min 16s
Wall time: 1h 39min 39s

6. Cluster Analysis¶

AIC (Aikaike Information Criterion)¶

plt.scatter(range(2,8,1),gm_aic)
plt.savefig('aic.jpg')

V-Measurement Score¶

from sklearn.metrics import v_measure_score
v_score = []
for score in gm_predict:
    v_score.append(v_measure_score(y_train,score))
    
plt.plot(v_score);

Cross Tab¶

table = pd.crosstab(y_train, gm_predict[1],margins = True)
table.index = list(y_train.unique()) + ['total'];table

	Similarity	Author	Sentence
0	0.973342	edgeworth	said I.
1	0.973342	chesterton	He said something or nothing.
2	0.973342	edgeworth	said he.
3	0.973342	edgeworth	"Very well," said she.
4	0.973342	edgeworth	said the others.
5	0.973342	chesterton	he said.
6	0.973342	chesterton	he said.
7	0.971554	edgeworth	said the Eaglet.

	Similarity	Author	Sentence
0	0.982147	edgeworth	"Oh!
1	0.982147	chesterton	"Oh!"
2	0.982147	edgeworth	"Oh!
3	0.982147	edgeworth	Oh, how could you do so?"
4	0.982147	edgeworth	Oh!
5	0.982147	carroll	Oh!
6	0.982147	carroll	"Oh!"
7	0.982147	edgeworth	"Oh!"

	Similarity	Author	Sentence
0	0.989295	edgeworth	Just.
1	0.989295	edgeworth	Just.
2	0.989295	edgeworth	Just.
3	0.989295	edgeworth	Just.
4	0.989295	edgeworth	Just.
5	0.989295	edgeworth	Just.
6	0.989295	edgeworth	Just as much.
7	0.989295	edgeworth	Just.

	Similarity	Author	Sentence
0	0.991301	edgeworth	Lord J.
1	0.991301	edgeworth	Lord J.
2	0.991301	edgeworth	Lord J.
3	0.991301	edgeworth	Lord J.
4	0.991301	edgeworth	Lord J.
5	0.991301	chesterton	Lord!
6	0.991301	edgeworth	Lord J.
7	0.991301	edgeworth	Lord J.

	Similarity	Author	Sentence
0	0.999998	edgeworth	Wheel.
1	0.999998	edgeworth	Wheel.
2	0.999998	edgeworth	Wheel.
3	0.999998	edgeworth	Wheel.
4	0.999998	edgeworth	Wheel.
5	0.999998	edgeworth	Wheel.
6	0.999998	edgeworth	Wheel.
7	0.999998	edgeworth	Wheel.

col_0	0	1	2	All
edgeworth	856	460	309	1625
chesterton	4637	2203	1496	8336
carroll	4546	2835	2715	10096
total	10039	5498	4520	20057