Author: Ryan Harper
Data Source:
http://lang-8.com/ [scraped with Beautiful Soup]
Summary:
In my previous profession, I taught English to a diverse range of students of all ages, language background, and country origin. During my professional development, I started to observe that different students with different L1s (1st Language) tended to display different patterns of communication that appeared to have some connection to either education in their country of origin or a connection to the linguistic structure of their first language. Different ELL (English Language Learners) needed to focus on different aspects of the English language depending on their background. The purpose of this project is to use a large number of blog posts from a language practicing website and explore whether or not the L1 has any significant impact on the blog writing style of the English learner.
Part 1: Explore the data to find any noteworthy trends in linguistic structure:
- vocabulary (word freq, collocations, and cognates)
- syntax (sentence structure)
- grammar (i.e. grammar complexity of sentences)
- errors (types of errors)
- parts of speech (NLTK Abbreviations: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/)
- Word Frequency (ANC: http://www.anc.org/data/anc-second-release/frequency-data/)
Part 2: Use linguistic trends to determine whether or not a learner's first language can be predicted.
Variables:
id: User ID
time: Time the blog post was scraped (in order of user posted time)
title: Title of the blog post
content: The blog post
language: User's self-reported first language
Hypothesis:
L1 (first language) experience and academic environment influences ELLs' (English Language Learners') writing style. The L1 of ELLs can be predicted by looking at English blog posts and identifying patterns unique to their L1.
Observations:
Chinese learners use more reflexive pronouns than Japanese learners Japanese and Chinese learners appear to favor different prepositions Japanese and Chinese learners have a different range of subjectivity scores (from Textblob) K Nearest Neighbors does not appear to work for this NLP project Naive Bayes and Random Forest outperformed other models Logistic Regression occasionally has strong predictions (but the order of the first few ranked features do not appear significant)
Method:
Using multiple models, the aim of this project is to explore how different models can handle the data (target and features) and to see what information can be gained from using multiple different models. Ultimately, the goal is to determine which models are appropriate for a binary (discrete) target with features that are both qualitative (discrete) and quantitative (ranked/continuous).
# from nltk.corpus import brown
# nltk.download('brown')
# iPython/Jupyter Notebook
import time
from pprint import pprint
import warnings
from IPython.display import Image
# Data processing
import scipy
import pandas as pd
import plotly as plo
import numpy as np
import seaborn as sns
from collections import Counter
from functools import reduce
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# Statistics
from scipy import stats
from statsmodels import robust
from scipy.stats import ttest_ind, mannwhitneyu, median_test, f_oneway,mood, shapiro
# NLP
import textblob
from nltk.corpus import stopwords as sw
from nltk.util import ngrams
from nltk.corpus import brown
import nltk
import re
from nltk.tokenize import RegexpTokenizer
import difflib
from string import punctuation
# import altair as alt
# load and close files
def get_text(link):
with open(link) as f:
output = f.read()
return output
# Jupyter Settings and Imports
%pylab
%matplotlib inline
warnings.filterwarnings(action='once')
# Import data
blog = pd.read_csv('../data/language/blogdata-reduced.csv')
blog.info()
# POS Table for reference
poscv = pd.read_csv('../data/pos.csv')
poscv = poscv.iloc[0:17]
poscv.columns = ['Set1','Set 2']
# Import data
blog = pd.read_csv('../data/language/blogdata-reduced.csv')
# Clean Data
del blog['Unnamed: 0']
blog.language = blog.language.mask(blog.language == 'Mandarin', 'Traditional Chinese').replace(['Persian', 'Arabic',
'Bulgarian', 'Swedish', 'Slovenian', 'Slovak', 'Malay', 'Turkish','Romanian', 'Czech', 'Danish', 'Vietnamese',
'Norwegian','Serbian','Other language','Lithuanian', 'Ukrainian', 'Finnish','Estonian','Bengali','Russian',
'Spanish','French', 'German', 'Cantonese','Mongolian', 'Tagalog', 'Polish', 'Dutch','Italian', 'Portuguese(Brazil)',
'Thai', 'Indonesian', 'Cantonese','Urdu', 'Hungarian','Korean','English'], np.nan)
blog = blog.dropna().sample(frac=1)
del blog['title']
del blog['time']
blog.info(verbose=False, memory_usage=False,null_counts=True)
# Confirmation that there are no more null values
blog.isnull().values.any()
def lettercheck(val):
reLetters = re.compile('[^a-zA-Z]')
onlyletters = reLetters.sub('', val)
return len(onlyletters)/len(val)
blog['letters_per'] = blog.content.apply(lettercheck)
print('Removing Blogs with less than 70% letter percentage: {}'.format(blog.loc[blog['letters_per'] <= .7].content.count()))
blog = blog.loc[blog['letters_per'] > .7]
blog.info()
vals = list(blog.language.value_counts().values)
languages = list(blog.language.value_counts().index)
plt.figure(figsize(6,4))
plt.bar(languages,vals,edgecolor='black')
plt.title('Blogs by L1 Count')
plt.xticks(rotation='vertical')
plt.show()
print("Posts by 'Native' English Speakers: {}".format(blog.id.loc[blog.language == 'English'].count()))
Word Level Ranking
ANCI_WORDS = pd.read_csv('../data/language/ANC-written-count.txt',
sep='\t',
encoding='latin-1',
names=['word','stem','pos','freq'],header=None)
word_freq = list(zip(ANCI_WORDS['word'].values,ANCI_WORDS['freq'].values))
full_words_dict = {}
words_dict = {}
# full_freq
i = 0
for w in word_freq:
i = i + 1
if w[0] not in full_words_dict:
full_words_dict[w[0]] = w[1]
# basic_freq
i = 0
for w in word_freq:
i = i + 1
if w[0] not in words_dict:
if i < 500:
words_dict[w[0]] = 1
elif (i >= 500) & (i < 5000):
words_dict[w[0]] = 2
elif (i >= 5000) & (i < 10000):
words_dict[w[0]] = 3
elif (i >= 5000) & (i < 20000):
words_dict[w[0]] = 4
def full_freq_rating(l):
score = 0
c=0
for w in l:
if w in full_words_dict:
c = c + 1
score = score + full_words_dict[w]
if c == 0:
c=1
return score / c
def freq_rating(l):
score = 0
c=0
for w in l:
if w in words_dict:
c = c + 1
score = score + words_dict[w]
if c == 0:
c=1
return score / c
TextBlob
%%time
blob = blog.content.apply(lambda val: textblob.TextBlob(val))
def posbigram(val):
bigramlist,l = [],[]
for s in val.sentences:
ns = textblob.TextBlob(str(s)).tags
l = [v[1] for v in ns]
bigrm = list(nltk.bigrams(l))
for bigram in bigrm:
bigramlist.append('-'.join(bigram))
return bigramlist
def postrigram(val):
trigramlist, l = [],[]
for s in val.sentences:
ns = textblob.TextBlob(str(s)).tags
l = [v[1] for v in ns]
trigrm = list(nltk.trigrams(l))
for trigram in trigrm:
trigramlist.append('-'.join(trigram))
return trigramlist
def per_check(string_value, total):
percentage = len(string_value)
if percentage != 0:
percentage = float(total / percentage) * 100
else:
percentage = 0
return percentage
def punc_count(string_value):
count = 0
for c in string_value:
if c in punctuation:
count+= 1
return per_check(string_value, count)
def caplet_count(string_value):
count = 0
for c in string_value:
if c.isupper():
count+= 1
return per_check(string_value, count)
General Text Analysis
%%time
blog['wc'] = blob.apply(lambda val: len(val.words))
blog['sc'] = blob.apply(lambda val: len(val.sentences))
blog['tokens'] = blob.apply(lambda val: [w.lower() for w in val.words])
blog['sent_pol'] = blob.apply(lambda val: val.sentiment[0])
blog['sent_subj'] = blob.apply(lambda val: val.sentiment[1])
blog['cap_let'] = blob.apply(caplet_count)
blog['punc_count'] = blob.apply(punc_count)
blog['freq_score'] = blob.apply(lambda val: freq_rating(val.words))
blog['full_freq_score'] = blob.apply(lambda val: full_freq_rating(val.words))
Data Cleaning Round 2
blog = blog[blog['wc'] >= 4]
blog = blog[blog['full_freq_score'] > 1500]
Parts of Speech Tokens
%%time
blog['pos'] = blob.apply(lambda val: [v[1] for v in val.tags])
blog['pos2'] = blob.apply(posbigram)
blog['pos3'] = blob.apply(postrigram)
Backup File
Frequent Words Per Language
%%time
js = Counter(reduce((lambda x, y: x + y), blog.content[blog.language == 'Japanese'].str.lower().apply(nltk.word_tokenize))).most_common(1000)
cs = Counter(reduce((lambda x, y: x + y), blog.content[blog.language == 'Traditional Chinese'].str.lower().apply(nltk.word_tokenize))).most_common(1000)
cl,jl = [l[0] for l in cs],[l[0] for l in js]
cuw,juw = [item for item in cl if item not in jl],[item for item in jl if item not in cl]
def create_dummy_binary_df(df,col,name,keep=[],):
colset = set(df[col].sum())
finalsetlist = []
if len(keep) > 0:
colset = [k for k in keep if k in colset]
for c in colset:
colname = name+'_'+str(c)
df[colname] = df[col].apply(lambda val: val.count(c))
if df[colname].sum() < 1:
del df[colname]
else:
finalsetlist.append(colname)
print('Created dummy counter for {} features'.format(name))
return finalsetlist
def create_dummy_count_df(df,col,colset,name,keep=[]):
finalsetlist = []
if len(keep) > 0:
colset = [k for k in keep if k in colset]
for c in colset:
colname = name+'_'+str(c)
df[colname] = df[col].apply(lambda val: val.count(c))
if df[colname].sum() < 1:
del df[colname]
else:
finalsetlist.append(colname)
print('Created dummy counter for {} features'.format(name))
return finalsetlist
%%time
colset = set(blog['tokens'].sum())
%%time
prplist = list(ANCI_WORDS['word'][ANCI_WORDS.pos == 'PRP'])
pronouns = create_dummy_count_df(blog,'tokens',colset,'prp',prplist)
cclist = list(ANCI_WORDS['word'][ANCI_WORDS.pos == 'CC'])
coordinators = create_dummy_count_df(blog,'tokens',colset,'cc',cclist)
inlist = list(ANCI_WORDS['word'][ANCI_WORDS.pos == 'IN'])
preposition = create_dummy_count_df(blog,'tokens',colset,'prep',inlist)
adverblist = list(ANCI_WORDS['word'][ANCI_WORDS.pos == 'RB'])[0:50]
adverb = create_dummy_count_df(blog,'tokens',colset,'adv',adverblist)
punct = create_dummy_count_df(blog,'tokens',colset,'punct',list(punctuation))
cuw = create_dummy_count_df(blog,'tokens',colset,'cuw',cuw)
juw = create_dummy_count_df(blog,'tokens',colset,'juw',juw)
%%time
pos1set = set(blog['pos'].sum())
pos2set = set(blog['pos2'].sum())
pos3set = set(blog['pos3'].sum())
%%time
pos2 = create_dummy_count_df(blog,'pos2',pos2set,'pos2')
pos1 = create_dummy_count_df(blog,'pos',pos1set,'pos1')
%%time
pos3 = create_dummy_count_df(blog,'pos3',pos3set,'pos3')
letters1 = []
for let in 'abcdefghijklmnopqrstuvwxyz':
name = 'let1_'+let
blog[name] = blog.tokens.apply(lambda val: ''.join(val).count(let))
letters1.append(name)
%%time
alphabet = list('abcdefghijklmnopqrstuvwxyz')
letters2 = []
for let in alphabet:
for let2 in alphabet:
letters2.append(let+let2)
letters2name = []
for let in letters2:
name = 'let2_'+let
blog[name] = blog.tokens.apply(lambda val: ' '.join(val).count(let))
if blog[name].sum() < 10:
del blog[name]
else:
letters2name.append(name)
letters2 = letters2name
%%time
# For backup
blog.to_csv('blogfeatures.csv')
# For second notebook
%store blog