Part 1: Exploration
https://kimrharper.github.io/project-lang8-1.html

Part 2: Analysis
https://kimrharper.github.io/project-lang8-2.html

Part 3: Models
https://kimrharper.github.io/project-lang8-3.html

L1 Prediction from ELL Writing Samples¶

Part 1: Exploration¶

Author: Ryan Harper

Overview
Experiment
1. Cleaning Data
2. Exploring the Data

Overview (top)¶

Data Source:

http://lang-8.com/ [scraped with Beautiful Soup]

Lang-8

Summary:

In my previous profession, I taught English to a diverse range of students of all ages, language background, and country origin. During my professional development, I started to observe that different students with different L1s (1st Language) tended to display different patterns of communication that appeared to have some connection to either education in their country of origin or a connection to the linguistic structure of their first language. Different ELL (English Language Learners) needed to focus on different aspects of the English language depending on their background. The purpose of this project is to use a large number of blog posts from a language practicing website and explore whether or not the L1 has any significant impact on the blog writing style of the English learner.

Part 1: Explore the data to find any noteworthy trends in linguistic structure:
vocabulary (word freq, collocations, and cognates)
syntax (sentence structure)
grammar (i.e. grammar complexity of sentences)
errors (types of errors)
parts of speech (NLTK Abbreviations: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/)
Word Frequency (ANC: http://www.anc.org/data/anc-second-release/frequency-data/)

Part 2: Use linguistic trends to determine whether or not a learner's first language can be predicted.

Variables:

id: User ID
time: Time the blog post was scraped (in order of user posted time)
title: Title of the blog post
content: The blog post
language: User's self-reported first language

Experiment (top)¶

Hypothesis:

L1 (first language) experience and academic environment influences ELLs' (English Language Learners') writing style. The L1 of ELLs can be predicted by looking at English blog posts and identifying patterns unique to their L1.

Observations:

Chinese learners use more reflexive pronouns than Japanese learners
Japanese and Chinese learners appear to favor different prepositions
Japanese and Chinese learners have a different range of subjectivity scores (from Textblob)
K Nearest Neighbors does not appear to work for this NLP project
Naive Bayes and Random Forest outperformed other models
Logistic Regression occasionally has strong predictions (but the order of the first few ranked features do not appear significant)

Method:

Using multiple models, the aim of this project is to explore how different models can handle the data (target and features) and to see what information can be gained from using multiple different models. Ultimately, the goal is to determine which models are appropriate for a binary (discrete) target with features that are both qualitative (discrete) and quantitative (ranked/continuous).

1. Cleaning the Data (top)¶

# from nltk.corpus import brown
# nltk.download('brown')

# iPython/Jupyter Notebook
import time
from pprint import pprint
import warnings
from IPython.display import Image

# Data processing
import scipy
import pandas as pd
import plotly as plo
import numpy as np
import seaborn as sns
from collections import Counter
from functools import reduce
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

# Statistics
from scipy import stats
from statsmodels import robust
from scipy.stats import ttest_ind, mannwhitneyu, median_test, f_oneway,mood, shapiro

# NLP
import textblob
from nltk.corpus import stopwords as sw
from nltk.util import ngrams
from nltk.corpus import brown
import nltk
import re
from nltk.tokenize import RegexpTokenizer
import difflib
from string import punctuation

# import altair as alt

# load and close files
def get_text(link):
    with open(link) as f:
        output = f.read()
    return output

# Jupyter Settings and Imports
%pylab
%matplotlib inline 
warnings.filterwarnings(action='once')

Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib

# Import data
blog = pd.read_csv('../data/language/blogdata-reduced.csv')
blog.info()

# POS Table for reference
poscv = pd.read_csv('../data/pos.csv')
poscv = poscv.iloc[0:17]
poscv.columns = ['Set1','Set 2']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17141 entries, 0 to 17140
Data columns (total 6 columns):
Unnamed: 0    17141 non-null int64
id            17141 non-null int64
time          17141 non-null object
title         17141 non-null object
content       17141 non-null object
language      17141 non-null object
dtypes: int64(2), object(4)
memory usage: 803.6+ KB

# Import data
blog = pd.read_csv('../data/language/blogdata-reduced.csv')

# Clean Data
del blog['Unnamed: 0']
blog.language = blog.language.mask(blog.language == 'Mandarin', 'Traditional Chinese').replace(['Persian', 'Arabic',
        'Bulgarian', 'Swedish', 'Slovenian', 'Slovak', 'Malay', 'Turkish','Romanian', 'Czech', 'Danish', 'Vietnamese',
        'Norwegian','Serbian','Other language','Lithuanian', 'Ukrainian', 'Finnish','Estonian','Bengali','Russian', 
        'Spanish','French', 'German', 'Cantonese','Mongolian', 'Tagalog', 'Polish', 'Dutch','Italian', 'Portuguese(Brazil)', 
        'Thai', 'Indonesian', 'Cantonese','Urdu', 'Hungarian','Korean','English'], np.nan)
blog = blog.dropna().sample(frac=1)

del blog['title']
del blog['time']

blog.info(verbose=False, memory_usage=False,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14262 entries, 7046 to 11936
Columns: 3 entries, id to language
dtypes: int64(1), object(2)

# Confirmation that there are no more null values
blog.isnull().values.any()

False

def lettercheck(val):
    reLetters = re.compile('[^a-zA-Z]')
    onlyletters = reLetters.sub('', val)
    return len(onlyletters)/len(val)

blog['letters_per'] = blog.content.apply(lettercheck)
print('Removing Blogs with less than 70% letter percentage: {}'.format(blog.loc[blog['letters_per'] <= .7].content.count()))
blog = blog.loc[blog['letters_per'] > .7]

Removing Blogs with less than 70% letter percentage: 1227

blog.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13035 entries, 7046 to 11936
Data columns (total 4 columns):
id             13035 non-null int64
content        13035 non-null object
language       13035 non-null object
letters_per    13035 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 509.2+ KB

2. Exploring the Data (top)¶

vals = list(blog.language.value_counts().values)
languages = list(blog.language.value_counts().index)
plt.figure(figsize(6,4))
plt.bar(languages,vals,edgecolor='black')
plt.title('Blogs by L1 Count')
plt.xticks(rotation='vertical')
plt.show()

print("Posts by 'Native' English Speakers: {}".format(blog.id.loc[blog.language == 'English'].count()))

Posts by 'Native' English Speakers: 0

NLP: Spell Check, Tokenization, Collocations, Parts of Speech, and Syntax (top)¶

Word Level Ranking

ANCI_WORDS = pd.read_csv('../data/language/ANC-written-count.txt', 
                         sep='\t', 
                         encoding='latin-1', 
                         names=['word','stem','pos','freq'],header=None)
word_freq = list(zip(ANCI_WORDS['word'].values,ANCI_WORDS['freq'].values))

full_words_dict = {}
words_dict = {}

# full_freq
i = 0
for w in word_freq:
    i = i + 1
    if w[0] not in full_words_dict:
        full_words_dict[w[0]] = w[1]
        
# basic_freq
i = 0
for w in word_freq:
    i = i + 1
    if w[0] not in words_dict:
        if i < 500:
            words_dict[w[0]] = 1
        elif (i >= 500) & (i < 5000):
            words_dict[w[0]] = 2
        elif (i >= 5000) & (i < 10000):
            words_dict[w[0]] = 3
        elif (i >= 5000) & (i < 20000):
            words_dict[w[0]] = 4

def full_freq_rating(l):
    score = 0
    c=0
    for w in l:
        if w in full_words_dict:
            c = c + 1
            score = score + full_words_dict[w]
    if c == 0:
        c=1
    return score / c

def freq_rating(l):
    score = 0
    c=0
    for w in l:
        if w in words_dict:
            c = c + 1
            score = score + words_dict[w]
    if c == 0:
        c=1
    return score / c

TextBlob

%%time
blob = blog.content.apply(lambda val: textblob.TextBlob(val))

CPU times: user 246 ms, sys: 16.1 ms, total: 262 ms
Wall time: 264 ms

def posbigram(val):
    bigramlist,l = [],[]
    
    for s in val.sentences:
        ns = textblob.TextBlob(str(s)).tags
        l = [v[1] for v in ns]
        bigrm = list(nltk.bigrams(l))
        
        for bigram in bigrm:
            bigramlist.append('-'.join(bigram))
        
    return bigramlist

def postrigram(val):
    trigramlist, l = [],[]
    
    for s in val.sentences:
        ns = textblob.TextBlob(str(s)).tags
        l = [v[1] for v in ns]
        trigrm = list(nltk.trigrams(l))
        
        for trigram in trigrm:
            trigramlist.append('-'.join(trigram))
        
    return trigramlist

def per_check(string_value, total):
    percentage = len(string_value)
    if percentage != 0:
        percentage = float(total / percentage) * 100
    else:
        percentage = 0
    return percentage

def punc_count(string_value):
    count = 0
    for c in string_value:
        if c in punctuation:
            count+= 1
    return per_check(string_value, count)

def caplet_count(string_value):
    count = 0
    for c in string_value:
        if c.isupper():
            count+= 1
    return per_check(string_value, count)

General Text Analysis

%%time
blog['wc'] = blob.apply(lambda val: len(val.words))
blog['sc'] = blob.apply(lambda val: len(val.sentences))
blog['tokens'] = blob.apply(lambda val: [w.lower() for w in val.words])
blog['sent_pol'] = blob.apply(lambda val: val.sentiment[0])
blog['sent_subj'] = blob.apply(lambda val: val.sentiment[1])
blog['cap_let'] = blob.apply(caplet_count)
blog['punc_count'] = blob.apply(punc_count)

CPU times: user 43 s, sys: 694 ms, total: 43.7 s
Wall time: 44.1 s

blog['freq_score'] = blob.apply(lambda val: freq_rating(val.words))
blog['full_freq_score'] = blob.apply(lambda val: full_freq_rating(val.words))

Data Cleaning Round 2

blog = blog[blog['wc'] >= 4]
blog = blog[blog['full_freq_score'] > 1500]

Parts of Speech Tokens

%%time
blog['pos'] = blob.apply(lambda val: [v[1] for v in val.tags])
blog['pos2'] = blob.apply(posbigram)
blog['pos3'] = blob.apply(postrigram)

/anaconda3/lib/python3.6/site-packages/nltk/util.py:491: DeprecationWarning:

generator 'ngrams' raised StopIteration

CPU times: user 5min, sys: 7.53 s, total: 5min 8s
Wall time: 5min 12s

Backup File

Frequent Words Per Language

%%time
js = Counter(reduce((lambda x, y: x + y), blog.content[blog.language == 'Japanese'].str.lower().apply(nltk.word_tokenize))).most_common(1000)
cs = Counter(reduce((lambda x, y: x + y), blog.content[blog.language == 'Traditional Chinese'].str.lower().apply(nltk.word_tokenize))).most_common(1000)
cl,jl = [l[0] for l in cs],[l[0] for l in js]
cuw,juw = [item for item in cl if item not in jl],[item for item in jl if item not in cl]

CPU times: user 1min 4s, sys: 17.1 s, total: 1min 21s
Wall time: 1min 23s

2. Feature Processing: (top)¶

def create_dummy_binary_df(df,col,name,keep=[],):
    colset = set(df[col].sum())
    finalsetlist = []
    if len(keep) > 0:
        colset = [k for k in keep if k in colset]
    
    for c in colset:
        colname = name+'_'+str(c)
        df[colname] = df[col].apply(lambda val: val.count(c))
        
        if df[colname].sum() < 1:
            del df[colname]
        else:
            finalsetlist.append(colname)
        
    print('Created dummy counter for {} features'.format(name))
        
    return finalsetlist

def create_dummy_count_df(df,col,colset,name,keep=[]): 
    finalsetlist = []
    if len(keep) > 0:
        colset = [k for k in keep if k in colset]
    
    for c in colset:
        colname = name+'_'+str(c)
        df[colname] = df[col].apply(lambda val: val.count(c))
        
        if df[colname].sum() < 1:
            del df[colname]
        else:
            finalsetlist.append(colname)
        
    print('Created dummy counter for {} features'.format(name))
        
    return finalsetlist

%%time
colset = set(blog['tokens'].sum())

CPU times: user 1min 30s, sys: 25.8 s, total: 1min 56s
Wall time: 1min 59s

%%time
prplist = list(ANCI_WORDS['word'][ANCI_WORDS.pos == 'PRP'])
pronouns = create_dummy_count_df(blog,'tokens',colset,'prp',prplist)

cclist = list(ANCI_WORDS['word'][ANCI_WORDS.pos == 'CC'])
coordinators = create_dummy_count_df(blog,'tokens',colset,'cc',cclist)

inlist = list(ANCI_WORDS['word'][ANCI_WORDS.pos == 'IN'])
preposition = create_dummy_count_df(blog,'tokens',colset,'prep',inlist)

adverblist = list(ANCI_WORDS['word'][ANCI_WORDS.pos == 'RB'])[0:50]
adverb = create_dummy_count_df(blog,'tokens',colset,'adv',adverblist)

punct = create_dummy_count_df(blog,'tokens',colset,'punct',list(punctuation))

cuw = create_dummy_count_df(blog,'tokens',colset,'cuw',cuw)
juw = create_dummy_count_df(blog,'tokens',colset,'juw',juw)

Created dummy counter for prp features
Created dummy counter for cc features
Created dummy counter for prep features
Created dummy counter for adv features
Created dummy counter for punct features
Created dummy counter for cuw features
Created dummy counter for juw features
CPU times: user 12.6 s, sys: 191 ms, total: 12.8 s
Wall time: 12.7 s

%%time
pos1set = set(blog['pos'].sum())
pos2set = set(blog['pos2'].sum())
pos3set = set(blog['pos3'].sum())

CPU times: user 3min 33s, sys: 1min, total: 4min 33s
Wall time: 4min 32s

%%time
pos2 = create_dummy_count_df(blog,'pos2',pos2set,'pos2')
pos1 = create_dummy_count_df(blog,'pos',pos1set,'pos1')

Created dummy counter for pos2 features
Created dummy counter for pos1 features
CPU times: user 30.5 s, sys: 1.39 s, total: 31.9 s
Wall time: 31.5 s

%%time
pos3 = create_dummy_count_df(blog,'pos3',pos3set,'pos3')

Created dummy counter for pos3 features
CPU times: user 7min 54s, sys: 1min 46s, total: 9min 41s
Wall time: 9min 10s

letters1 = []
for let in 'abcdefghijklmnopqrstuvwxyz':
    name = 'let1_'+let
    blog[name] = blog.tokens.apply(lambda val: ''.join(val).count(let))
    letters1.append(name)

%%time
alphabet = list('abcdefghijklmnopqrstuvwxyz')
letters2 = []
for let in alphabet:
    for let2 in alphabet:
        letters2.append(let+let2)

letters2name = []        
for let in letters2:
    name = 'let2_'+let
    blog[name] = blog.tokens.apply(lambda val: ' '.join(val).count(let))
    if blog[name].sum() < 10:
        del blog[name]
    else:
        letters2name.append(name)        
letters2 = letters2name

CPU times: user 43.9 s, sys: 10.4 s, total: 54.2 s
Wall time: 51.9 s

%%time
# For backup
blog.to_csv('blogfeatures.csv')

CPU times: user 3min 39s, sys: 9.12 s, total: 3min 48s
Wall time: 1h 14min 39s

# For second notebook
%store blog

Stored 'blog' (DataFrame)

Part 1: Exploration https://kimrharper.github.io/project-lang8-1.html Part 2: Analysis https://kimrharper.github.io/project-lang8-2.html Part 3: Models https://kimrharper.github.io/project-lang8-3.html