Author: Ryan Harper
Data Set: The Oregon Department of Transportation crash analysis data - Available from ODOT
Background: Portland, Oregon is a big biking city in the US. PDOT reports that about 7% of the PDX workforce commute by bike. As a result, there have been an equally large number of reported cyclist accidents/collisions. In 2012, police attempted to reduce the number of accidents by passing out traffic violations to cyclists. In recent years, the Portland Department of Transportation attempted to make more roads bike friendly with the 20s Bikeway Project, but some local news reports still suggest that the rate of bike accidents is increasing.
Project: The goal of this project is to create a forecasting method that establishes a metric for measuring whether or not the number of bike accidents in subsequent years is higher or lower than the estimate. Ideally, this would be used to monitor what effects projects like the 20s Bikeway Project might have with regards to the frequency of bike accidents in Portland.
Hypothesis: I predicted that there will be a positive upward trend in bike accidents for Portland, Oregon (Multnomah County). I also predicted trends for the time of day, day of the week, and season of the year.
import warnings;warnings.filterwarnings('ignore')
import missingno as msno;from pprint import pprint;import pandas as pd;import matplotlib.pyplot as plt;import sklearn
import seaborn as sns; import numpy as np;from numpy.polynomial.polynomial import polyfit;from scipy.stats import linregress
from statsmodels.tsa.stattools import acf, pacf;from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
from statsmodels.tsa.arima_model import ARIMA,AR
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error;import matplotlib as mpl
from sklearn.linear_model import SGDRegressor;from fbprophet import Prophet;import statsmodels.api as sm
pd.set_option('display.show_dimensions',False)
accidentsdf = pd.read_csv('../data/odot/Multnomah.csv',low_memory=False);accidentsdf.head(8)
There are a lot of empty values and the crash id appears to exist in more than one row.
Let's look at just one crash id:
accidentsdf[accidentsdf['Crash ID'] == 1456243]
Missing Values Matrix
print('Original Data Set - Includes Missing Values');msno.matrix(accidentsdf,fontsize=12,figsize=(20,5),color=(.5,.6,.9));
In the graph above, white represents missing data, while purple represents real values. Crash ids (in the far left column) appear to include multiple rows for each vehicle/participant/observer, but only one row for each Crash id has data for a majority of the features. This means we can't simply analyze the features as is and have to reduce the data either to the specific participants (2-3 per accident) or to the general accident report (1 per accident).
To focus on bicycle accidents, I collect a few random features (unused for this project) and select only rows with a 'Pedalcyclist' count of 1 or more:
usedcol = ['Crash ID','Crash Hour','Crash Day','Crash Month','Crash Year','Total Pedalcyclist Count',
'Total Pedalcyclist Fatality Count','Total Pedalcyclist Non-Fatal Injury Count','Crash Type',
'Collision Type','Median Type','Crash Severity','Intersection Type','Crash ID',
'Nearest Intersecting Street Number','Road Character']
bikeaccident = accidentsdf[usedcol][accidentsdf['Total Pedalcyclist Count']>0].dropna().reset_index()
print('Training Set: {}\n'.format(len(bikeaccident)))
print('Cleaned Data Set - No Missing Values');msno.matrix(bikeaccident,labels=False,figsize=(8,3),color=(.5,.6,.9));
From this graph, we can see there are a total of 1473 reported ODOT bike related accidents in Multnomah county. The columns are now limited to less than 20 features. This visual looks much better as there is no missing data (represented by white).
Ok, now I want to explore the features. For example, is there a noticeable difference in time of day for which most accidents occur? My guess would be that most accidents take place when it is dark ~between 7pm-4am.
bikeaccident.index = pd.to_datetime(
bikeaccident['Crash Year']*10000+bikeaccident['Crash Month']*100+bikeaccident['Crash Day'],format='%Y%m%d')
# creating basic time series
ba = bikeaccident['Total Pedalcyclist Count'].groupby(pd.Grouper(freq='W')).count()
sns.distplot(ba,color='mediumblue');print('Total # of Bike Accidents in a Single Week')
print('Bike Accidents (by hour) w/ Outlier');sns.distplot(bikeaccident['Crash Hour'],color = 'mediumblue');plt.show()
It seems like there is a random value of 99 that occurs in the data 7 times. This is making the data harder to see, so let's remove it (temporarily). My guess is that 99 represents an unknown hour of the accident as 0 represents midnight.
sns.distplot(bikeaccident['Crash Hour'][bikeaccident['Crash Hour']< 99],color='mediumblue');plt.xlim(0,24);print('Bike Accidents (by hour)');plt.show()
My prediction was wrong. It looks like most accidents happen around evening rush hour (4pm-6pm) with another peak at lunch time, and a third peak during morning rush hour(7am-9am). The data do not appear to be parametric.
So what about bike deaths? Bike Injuries?
cols = ['Total Pedalcyclist Fatality Count','Total Pedalcyclist Non-Fatal Injury Count']
plt.figure(figsize=(20,10))
plt.subplot(2,3,1)
plt.hist(bikeaccident['Crash Hour'][(bikeaccident[cols[0]]>0)],color='mediumblue',edgecolor='white',alpha=.6)
plt.xlim(0,24);plt.xlabel('Time of Day (24h)');plt.ylabel('Count')
plt.title('Fatal Injuries 2012-2015 ({})'.format(bikeaccident['index'][bikeaccident[cols[0]]>0].count()), size=15)
plt.subplot(2,3,2)
plt.hist(bikeaccident['Crash Hour'][(bikeaccident[cols[1]]>0)],bins=100,color='mediumblue',edgecolor='white',alpha=.6)
plt.xlim(0,24);plt.xlabel('Time of Day (24h)');plt.ylabel('Count')
plt.title('Non-Fatal Injuries 2012-2015 ({})'.format(bikeaccident['index'][bikeaccident[cols[1]]>0].count()),size=15)
plt.show()
3 bike deaths happened at ~9pm, 2 deaths at ~11am, and 1 death at ~noon.
NOTE: There is a notice by ODOT with regards to removing confidential data involving fatal accidents so it is likely that the ODOT data set does not accurately reflect the total number of bike deaths from 2012-2015.
ba = bikeaccident['Total Pedalcyclist Count'].groupby(pd.Grouper(freq='W')).count()
bi = bikeaccident['Total Pedalcyclist Count'][bikeaccident['Total Pedalcyclist Non-Fatal Injury Count']>0].groupby(pd.Grouper(freq='W')).count()
bi.sum(),ba.sum(),bi.sum()/ba.sum()
Bike injuries account for about 97% of the bike accident data (1431 out of 1473).
plt.figure(figsize=(18,4))
plt.plot(ba.values,color='pink')
plt.plot(bi.values,color='darkblue',linestyle='-',alpha=.9,marker='.')
plt.xlabel('Week # (2012-2015)')
plt.legend(['Total','Injured']); print('Bike Accidents: Total vs Injured');plt.show()
The total number of bike accidents is pretty similar to the number of reported injuries per accident. It's probably safe to assume that accidents without injuries don't get reported as frequently. We can infer that this data is a sample subset of a greater unknown.
Functions:
norm = lambda val: (val-val.min())/(val.max()-val.min())
variance = lambda val: np.sum(((val-val.mean())**2)/(len(val)-1))
dev = lambda val: variance(val)**.5
Population: Source
pop = pd.read_csv('../data/odot/co-est2016-alldata.csv',encoding = "ISO-8859-1")
keep = ['STNAME','CTYNAME','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015','POPESTIMATE2016']
pop_or = pop[keep][(pop['STNAME'] == 'Oregon')]
pop_or_mcnty = pop_or[pop_or['CTYNAME'] == 'Multnomah County'].drop(['STNAME','CTYNAME'],axis=1)
pop_year = list(pop_or_mcnty.T.values)
try:
pop_ba = pd.DataFrame(ba)
s,c=[],0
for i in range(52*4):
if i%52==0:
s.append(pop_year[c][0])
c+=1
else:
s.append(np.nan)
s.append(pop_year[-1][0])
pop_ba['pop'] = s
pop_ba['pop'] = pop_ba['pop'].interpolate('linear')
plt.figure(figsize=(18,4))
plt.plot(norm(pop_ba['pop']),color='mediumblue')
plt.plot(norm(pop_ba['Total Pedalcyclist Count']),color='gray')
print('Population and bike Accidents (Min-Max)')
plt.legend(['Multnomah Population','Bike Accident Count (Weekly)'])
plt.show()
except:
pass
Now I will take the rolling mean of the data set to see if any patterns emerge:
rm = 20
plt.figure(figsize=(18,4))
plt.plot(ba.index,ba.values,color='gray')
plt.plot(ba.index,ba.rolling(window=rm,center=False,axis=0).mean().values,color='mediumblue')
plt.legend(['Actual','Rolling Mean [{}]'.format(rm)]);print('Linear Regression Plot - Full Time Series');plt.show()
The linear regression line appears to have a sinosoidal distribution for bike accidents from 2012-2015. This suggests that there is a seasonal trend in the data. I suspect that the trend will be on an annual basis.
cmap=plt.cm.Blues_r;years = [2012,2013,2014,2015]
yr_plot=[ba.iloc[ba.index.searchsorted(pd.datetime(y, 1, 1)):ba.index.searchsorted(pd.datetime(y, 12, 31))].values for y in years]
plt.figure(figsize=(18,4))
print('Bike Accidents by Year (2012-2015)')
for i,year in enumerate(yr_plot):
plt.plot(year,color=cmap(i/(len(years)+2)))
plt.legend(years);plt.xlabel('Week #',fontsize=12);plt.ylabel('Accident Count',fontsize=12);plt.show()
Looking at this graph, there is definitely a seasonal (annual) trend in the data. Summer appears to be the peak time for bike accidents. There could be a number of potential reasons for this: good weather, vacation, and/or new bicyclists. This seasonality is important because it means we can be more accurate in our predictions when we account for seasonality.
weeks_in_a_year = 52
decomposition = sm.tsa.seasonal_decompose(ba, freq=weeks_in_a_year)
trend,seasonal,residuals = decomposition.trend,decomposition.seasonal,decomposition.resid
decomposition.plot()
print('Seasonal Decomposition')
plt.show()
Seasonal Decomposition shows that there is both a a seasonal pattern and a downwards trend in the data. I expected an upwards trend in bike accidents because of a rise in population in Portland, so this surprises me.
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries,graph_set=False):
if graph_set:
#Determing rolling statistics
rolmean=timeseries.rolling(window=12).mean()
rolstd = timeseries.rolling(window=12).std()
#Plot rolling statistics:
fig = plt.figure(figsize=(12, 8))
orig = plt.plot(timeseries, color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show()
#Perform Dickey-Fuller test:
print('Results of Dickey-Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
test_stationarity(ba)
The Dickey-Fullter Test returned a p-value of less than .05 which rejects the null hypothesis that the data is stationary.
seasonalshift = ba-ba.shift(52)
f1shift = ba.diff()
plt.figure(figsize=(20,7))
plt.plot(ba.iloc[52:].values,marker='.',label='Actual',alpha=1)
plt.plot(seasonalshift.iloc[52:].values,label='Seasonal Shift',color='black',marker='x',alpha=.2)
plt.plot(f1shift.iloc[52:].values,label='First Difference Shift',color='mediumblue',marker='o',alpha=.2)
plt.legend()
plt.show()
Both the Seasonal and F1 shifts show stationarity while the actual data set has a very slight downwards trend (with the minimum # of accidents never falling below 0). From these observations, I can conclude:
The goal of this section is to explore a few different regression methods and to evaluate which methods seem to perform better on a sample test set.
Functions
RMSE = lambda y_predict, y_actual: np.sqrt(np.sum(np.square(np.subtract(y_predict,y_actual)))/len(y_actual))
def visual_validate(y_pred,y,name,printRMSE=True):
plt.figure(figsize=(6,2))
plt.title('MODEL: {} | RMSE: {:0.2f}'.format(name,RMSE(y_pred,y)))
plt.plot(y,color='gray')
plt.plot(y_pred,color='mediumblue',linestyle='--')
The RMSE function will evaluate and compare the results of each model while the visual function will plot the results.
Test/Train
lag_split =10
# split into train and test sets
X,y = np.array(range(len(ba.values[1:]))),ba.values[1:]
train, test = [X[:-lag_split],y[:-lag_split]],[X[-lag_split:],y[-lag_split:]]
train_X, train_y = train[0],train[1]
test_X, test_y = test[0],test[1]
N = len(y)
e = np.random.random_sample(N)-.5
The lagsplit variable determines how big my practice test set will be. I add the noise values(e) to prevent overfitting in my regression models.
Now I want to build my baseline estimator (a straight line). I will create my own linear regression model and compare that to other linear regression estimators.
Formulas
Line: $ y = a + bx $
Standard Deviation: $ \sigma=\sqrt {\sum{\frac{(x-\bar{x})^2}n}} $
Pearson's Correllation Coefficient: $ r= \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x-\bar{x})^2}\sum{(y-\bar{y})^2}}} $
Regression Slope: $ b=r\frac{\sigma_x}{\sigma_y} $
Root Mean Squared Error: $ \sqrt{\frac{\sum_{i=1}^n {(\hat{y}_{i}- y_{i})}^2}{N}} $
Functions
mean = lambda val: val.sum()/len(val)
dif = lambda val: val-mean(val)
r_value = lambda valx,valy:((dif(valx)*dif(valy)).sum())/((((dif(valx)**2).sum())*((dif(valy)**2).sum()))**.5)
std = lambda val:(mean((val-mean(val))**2))**.5
slope = lambda valx,valy: r_value(valx,valy)*std(valy)/std(valx)
intercept = lambda valx,valy: mean(valy) - (slope(valx,valy)*mean(valx))
Calculations
x_mean,y_mean = mean(X),mean(y)
x_std,y_std = std(X),std(y)
r = r_value(X,y)
b = slope(X,y)
a = intercept(X,y)
print('Slope: {}\nIntercept: {}\nBaseline RMSE: {}'.format(b,a,RMSE(a+b*X,y)))
The RMSE value feels pretty high. Considering the data appears have a repeating pattern. I feel confident that I can get a lower RMSE value.
print('Statsmodels: Slope={} Intercept={}'.format(linregress(X,y).slope,linregress(X,y).intercept))
print('Numpy: Slope={} Intercept={}'.format(polyfit(X, y, 1)[1],polyfit(X, y, 1)[0]))
My calculations are the same as Statsmodels and Numpy's internal math! This is a good sign
Graph
plt.figure(figsize=(18,4));plt.plot(X,y,color='gray');plt.plot(X,a+b*X,color = 'darkblue',linestyle='--')
plt.plot(ba.rolling(window=25,center=False,axis=0).mean().values)
plt.legend(['Actual','Regression','Rolling Mean [25]']);print('Linear Regression Plot - Full Time Series');
plt.show()
The linear regression line appears to follow a downward trend in bike accidents from 2012-2015 and confirms previous observations with regards to this negative trend.
Baseline
visual_validate(a+b*X[test_X],y[test_X],'Linear Regression [Baseline]')
e,y_avg= np.random.random_sample(N)-.5,np.zeros(N)
lags = 3
for i in range(lags, N):
if i < len(train_y):
y_avg[i] = (train_y[i]+train_y[i-1]+train_y[i-2])/3
else:
y_avg[i] = (y_avg[i-1]+y_avg[i-2]+y_avg[i-3])/3
prediction = y_avg[test_X]
visual_validate(prediction,test_y,'RM(3)')
Auto and Partial Auto Correlation Functions
Formulas
ACF (1 variable): $ r_{k} = \frac{\sum_{i=1}^{n-k}(y_{i} - \bar{y})(y_{i+k} - \bar{y})} {\sum_{i=1}^{n}(Y_{i} - \bar{y})^{2} } $
PACF: $ \frac{cov(y,x_3|x_1,x_2)}{\sqrt{var(y|x_1,x_2)\cdot var(x_3|x_1,x_2)}} $
Gradient Descent
values = pd.DataFrame(ba.values)
values[0] = pd.to_numeric(values[0],errors='coerce')
lag_values = pd.concat([values.shift(-3),values.shift(-2),values.shift(-1),values], axis=1).dropna()
lag_values.columns = ['t', 1,2,3]
lag_values.corr()
x_merged = lag_values.drop(['t',3],axis=1).values
y_actual = lag_values['t'].values
theta = np.random.randn(2,1)
sgd = SGDRegressor(max_iter=1000)
sgd.fit(x_merged,y_actual)
sgd_intercept = sgd.intercept_
sgd_coefs = sgd.coef_
This Gradient Descent regressor calculates the coefficients for a regression with 2 lags. I will test these parameters in an AR(2) model to see if it has predictive power.
Graph
fig = plt.figure(figsize=(15,9))
for num in range(1,16):
ax = fig.add_subplot(3,5,num)
pd.plotting.lag_plot(ba,lag=num,c='mediumblue')
ax.set_title('Lag {}'.format(num))
plt.tight_layout()
plt.show()
Lag one appears to be borderline linear (albeit with a small r-correlation value). Others appear more likely to be elliptical in nature which suggests a sinusoidal distribution. There are no other discernible differences between each lag.
fig, ax = plt.subplots(figsize=(15, 4));plot_pacf(y,color='mediumblue',lags=50,ax=ax);plt.show()
fig, ax = plt.subplots(figsize=(15, 4));plot_acf(y,color='mediumblue',lags=50,ax=ax);plt.show()
It appears that only the first 2-3 lags in PACF and lags 1-6 in ACF hold predictive power. Most other lags fall within the zone of low probability (the blue shaded area).
# Various algorithms for finding weight coefficients
solvers = ['lbfgs', 'bfgs', 'newton', 'nm', 'cg', 'ncg', 'powell']
df_resid,plot_aic,plot_bic,plot_RMSE = [],[],[],[]
for ar_val in range(1,15):
order = (ar_val,0,0)
model = ARIMA(list(train_y), order=order)
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test_y))
plot_aic.append(model_fit.aic)
plot_bic.append(model_fit.bic)
df_resid.append(model_fit.df_resid)
plot_RMSE.append(RMSE(forecast[0],test_y))
df_resid,plot_aic,plot_bic,plot_RMSE = norm(np.array(df_resid)),norm(np.array(plot_aic)),norm(np.array(plot_bic)),norm(np.array(plot_RMSE))
print('AR Lag Plot')
plt.figure(figsize=(10,3))
plt.plot(plot_aic,color='mediumblue')
plt.plot(plot_bic,color='purple')
# plt.plot(plot_RMSE,color='darkred')
plt.legend(['AIC','BIC'])
plt.show()
df_resid,plot_aic,plot_bic,plot_RMSE = [],[],[],[]
for ma_val in range(1,10):
try:
order = (0,0,ma_val)
model = ARIMA(list(train_y), order=order)
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test_y))
plot_aic.append(model_fit.aic)
plot_bic.append(model_fit.bic)
df_resid.append(model_fit.df_resid)
plot_RMSE.append(RMSE(forecast[0],test_y))
except Exception as e:
print(e)
df_resid,plot_aic,plot_bic,plot_RMSE = norm(np.array(df_resid)),norm(np.array(plot_aic)),norm(np.array(plot_bic)),norm(np.array(plot_RMSE))
print('MA Lag Plot')
plt.figure(figsize=(10,3))
plt.plot(plot_aic,color='mediumblue')
plt.plot(plot_bic,color='purple')
# plt.plot(plot_RMSE,color='darkred')
plt.legend(['AIC','BIC'])
plt.show()
Based on AIC and BIC log likelihood values, AR lags appear to be most effective at 1 and 4 (with another dip at 14--not shown here). MA lags appear to be most effective at 1,4,6.
def arima_it(o,name):
history = [y_val for y_val in train_y]
for t in range(len(test_y)):
model = ARIMA(history, order=o)
model_fit = model.fit()
forecast = model_fit.forecast(steps=1)[0]
history.append(forecast[0])
result = np.array(history)[test_X]
visual_validate(result,test_y,name,False)
$ \begin{equation*} y_{t}=\phi_{0}+\phi_{1}y_{t-1}+\phi_{2}y_{t-2}+...+\epsilon_{t} \end{equation*} $
Basic AR Model (1):
visual_validate(y[np.array(test_X)-1],y[test_X],'AR(1) [No Coefficients]')
AR(2) - SGD vs LBFGS
lag=2
order=(lag,0,0)
arima_it(order,'AR(2)-lbfgs')
# Using SGD for weights with my attempt at building an AR model
y_reg = np.zeros(N)
lags = 2
coefs = sgd_coefs
intercept = sgd_intercept
for i in range(lags, N):
if i < len(train_y):
y_reg[i] = intercept+coefs[0]*x_merged[i][0]+coefs[1]*x_merged[i][1] + e[i]
else:
y_reg[i] = intercept+coefs[0]*y_reg[i-1]+coefs[1]*y_reg[i-2] + e[i]
prediction = y_reg[test_X]
visual_validate(prediction,test_y,'AR(2)-SGD')
My AR(2) gradient descent prediction looks pretty similar to Statsmodels AR(2) prediction. For repeatability, however, I will continue using Statsmodels's AR model. In the future, I would like to explore the different solvers used for calculating coefficients.
AR(n)
for i in [1,4,6,10]:
lag=i
order=(lag,0,0)
arima_it(order,'AR({})'.format(i))
AR Model (14)
model = AR(train_y)
model_fit = model.fit()
predictions = model_fit.predict(start=len(train_y), end=len(train_y)+len(test_y)-1, dynamic=False)
visual_validate(predictions,test_y,'AR({}) - Auto Pick'.format(model_fit.k_ar))
The AR model defaulted to 14 lags. 14weeks/4.3 is approximately 3 months. This could indicate that a three month season has an impact on the overal timeseries and is worth exploring in a future project.
$ \begin{equation*} y_{t}=\theta_{0}+\theta_{1}e_{t-1}+\theta_{2}e_{t-2}+...+\epsilon_{t} \end{equation*} $
for i in [1,4,6,10]:
lag=i
order=(0,0,lag)
arima_it(order,'MA({})-Error Regression'.format(i))
There seems to be improvement in the prediction as the lag increases. However, the run time for the model per lag seems to be exponential and I don't want to overfit the data.
# Various algorithms for finding weight coefficients
solvers = ['lbfgs', 'bfgs', 'newton', 'nm', 'cg', 'ncg', 'powell']
# order shape is restricted by size of data
order = (4,1,1)
model = ARIMA(list(train_y), order=order)
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test_y))
visual_validate(forecast[0],test_y,'ARIMA{}'.format(order))
# AR lag chosen via AR's recommended lag, MA is highest optimal
order=(14,1,7)
model = ARIMA(list(train_y), order=order)
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test_y))
visual_validate(forecast[0],test_y,'ARIMA{}'.format(order))
Both ARIMA predictions appear to overestimate the number of accidents that occur. However, the second model does seem to predict the spoke that occurs between weeks 2-4 and again in week 8-9.
ba_season = ba.loc[ba.index < '2015-11']
order = (4,1,6)
seasonal_order = (0,0,0,52)
start_params = []
mod = SARIMAX(
ba_season,
order=order,
seasonal_order=seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
sarimax_forecast = results.forecast(len(test_y)).values
visual_validate(sarimax_forecast,test_y,'SARIMAX{}x{}'.format(order,seasonal_order))
prophet_df = pd.DataFrame(ba);prophet_df.columns = ['y']
prophet_df['ds'] = prophet_df.index
prophet_df = prophet_df.reset_index()
del prophet_df['index']
prophet_df = prophet_df[['ds','y']].iloc[train_X]
m = Prophet(weekly_seasonality=False, daily_seasonality=False,seasonality_prior_scale=0.1)
m.fit(prophet_df)
future = m.make_future_dataframe(periods=len(test_y))
p_forecast = m.predict(future)
visual_validate(p_forecast.yhat.values[test_X],test_y,'PROPHET')
# m.plot_components(p_forecast);
Prophet does not appear to do very well with the practice test set
Based on my models, it seems like simple AR models performed better than combined ARIMA. Lags at 4 and 14 appear to be the strongest for AR. Prophet and ARIMA did not perform well on the practice test set so my assumption would be that an AR(14) or AR(4) model will perform best. In this section, I will retrain my models on the entire dataset between 2012-2015 and predict 2016. I will compare all of the models in one final plot to see which models were the most effective.
# removing first week of 2016 from train_set
ba = ba.loc[ba.index < '2016']
y=ba.values
prophet_df = pd.DataFrame(ba)
prophet_df.columns = ['y']
prophet_df['ds'] = prophet_df.index
prophet_df = prophet_df.reset_index()
del prophet_df['index']
prophet_df = prophet_df[['ds','y']]
model_predictions,model_name,RMSE_score = [],[],[]
test_accidentsdf = pd.read_csv('../data/odot/Multnomah-2016.csv',low_memory=False);accidentsdf.head(8)
usedcol = ['Crash ID','Crash Hour','Crash Day','Crash Month','Crash Year','Total Pedalcyclist Count',
'Total Pedalcyclist Fatality Count','Total Pedalcyclist Non-Fatal Injury Count','Crash Type',
'Collision Type','Median Type','Crash Severity','Intersection Type','Crash ID',
'Nearest Intersecting Street Number','Road Character']
bikeaccident_test = test_accidentsdf[usedcol][test_accidentsdf['Total Pedalcyclist Count']>0].dropna().reset_index()
bikeaccident_test.index = pd.to_datetime(
bikeaccident_test['Crash Year']*10000+bikeaccident_test['Crash Month']*100+bikeaccident_test['Crash Day'],
format='%Y%m%d')
ba_test = bikeaccident_test['Total Pedalcyclist Count'].groupby(pd.Grouper(freq='W')).count()
[OLS]
X_final = np.array([i+y[-1] for i in range(0,len(ba_test))])
model_predictions.append(list(a+X_final*b))
RMSE_score.append(RMSE(a+X_final*b,ba_test.values))
model_name.append('Linear Regression')
[AR]
lags=[14]
for i in lags:
order = (i,0,0)
model = ARIMA(list(y), order=order)
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(ba_test))
model_predictions.append(forecast[0])
RMSE_score.append(RMSE(forecast[0],ba_test.values))
model_name.append('AR({})'.format(i))
[ARIMA]
orders = [(14,1,7)]
for o in orders:
model = ARIMA(list(y), order=o)
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(ba_test))
model_predictions.append(forecast[0])
RMSE_score.append(RMSE(forecast[0],ba_test.values))
model_name.append('ARIMA{}'.format(o))
[SARIMAX]
order = (4,1,6)
seasonal_order = (0,0,0,52)
mod = SARIMAX(
ba.loc[ba.index < '2016'],
order=order,
seasonal_order=seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
sarimax_forecast = results.forecast(len(ba_test)).values
model_predictions.append(sarimax_forecast)
RMSE_score.append(RMSE(sarimax_forecast,ba_test.values))
model_name.append('SARIMAX{}x{}'.format(order,seasonal_order))
[PROPHET]
prophet_df_test = pd.DataFrame(ba_test);prophet_df_test.columns = ['y']
prophet_df_test['ds'] = prophet_df_test.index;prophet_df_test = prophet_df_test.reset_index();del prophet_df_test['index'];prophet_df_test = prophet_df_test[['ds','y']]
m = Prophet(weekly_seasonality=False, daily_seasonality=False,seasonality_prior_scale=0.1)
m.fit(prophet_df)
future = m.make_future_dataframe(periods=52)
p_forecast = m.predict(future)[-(len(ba_test)):]
model_predictions.append(p_forecast.yhat.values)
RMSE_score.append(RMSE(p_forecast.yhat.values,ba_test.values))
model_name.append('Prophet')
[ENSEMBLE(MEAN)]
total = np.zeros(52)
set_model = [np.array(i) for i in model_predictions]
for i in set_model:
total+=i
ensemble = total/len(model_predictions)
model_predictions.append(ensemble)
RMSE_score.append(RMSE(ensemble,ba_test.values))
model_name.append('Ensemble (Mean)')
plt.figure(figsize=(20,10))
plt.plot(ba_test.values,label='Actual',color='gray',alpha=.5)
for i in range(len(model_predictions)-1):
plt.plot(model_predictions[i],label=model_name[i],marker='o',linewidth=.5,markersize=6,alpha=.7)
plt.plot(model_predictions[-1],label=model_name[-1],marker='.',linewidth=1,markersize=3,alpha=1,color='black')
plt.legend()
print('Final Predictions for 2016')
plt.show()
for i in range(len(model_name)):
print('{}:{:.2f}'.format(model_name[i],RMSE_score[i]))
Now let's see what was different about 2016:
cmap=plt.cm.Blues_r;years = [2012,2013,2014,2015]
yr_plot=[ba.iloc[ba.index.searchsorted(pd.datetime(y, 1, 1)):ba.index.searchsorted(pd.datetime(y, 12, 31))].values for y in years]
years.append(2016)
yr_plot.append(ba_test.values)
plt.figure(figsize=(18,4))
print('Bike Accidents by Training Years (2012-2015) and Target Year (2016)')
for i,year in enumerate(yr_plot):
if (years[i] == 2016):
plt.plot(year,color='red')
else:
plt.plot(year,color=cmap(i/(len(years)+2)))
plt.legend(years);plt.xlabel('Week #',fontsize=12)
plt.ylabel('Accident Count',fontsize=12)
plt.show()
It looks like bike accidents dropped off entirely during the summer (somewhere around June-July). Overall, the number of accidents seem to be much less frequent than previous years.
Final Observations: ARIMA(14,1,7) and Prophet appear to be the most effective in predicting 2016! AR(4) and AR(14) appear to perform similar to my linear regression prediction. Prophet is likely the most successful, because it took into account seasonality and trend. ARIMA only accounted for trend.
Difficulties: There were a number of difficulties that I encountered with forecasting. The biggest issue was working with Statsmodels's ARIMA model. Customizing the orders (p,d,q) led to a number of different error types and ultimately I could not test some of my hypotheses regarding the number of lags do to these errors. Additionally, I'm still exploring ways of improving predictions via removing seasonality and do not know the best ways to transform/back-transform the data.
Future Work: While some of the predictions are reasonable. I need to evaluate the different solver methods for determining ARIMA's coefficients. I also need to implement SARIMAX to account for the seasonality within the data. My assumption is that SARIMAX would be a better predictor than the ARIMA models. My methods for determining the best lag fit did not ultimately help me in finding the best model. As such, I need to better explore the relationship between my predictions and the number of lags used.