from utils.mf_exploratory import *

train_numeric.csv loaded...
train_date.csv loaded...
train_cat.csv loaded...

Bosch Manufacturing¶

Part 1: Exploratory Analysis¶

[Part 2] [Part 3]

Author: Ryan Harper

Data Source: Bosch Dataset via Kaggle

Background: Bosch is a home appliance and industrial tools manufacturing company. In 2017, Bosch supplied Kaggle.com with manufacturing data to promote a competition. The goal of the competition was to determine factors that influence whether or not the product passes the final response stage of manufacturing and to predict which products are likely to fail based on this manufacturing process.

The Data: Early exploration of this data will use a subset of the big data provided by Bosch. The data subset is provided by Hitesh, John, and Matthew via PDX Data Science Meetup. The data subset is divided into 2 groups of 3 files (3 training, 3 test). Each group has one csv file each for numerical features ('numeric'), dates ('date'), and the manufacturing path ('cat'). The data subset includes a larger percentage of products that failed the response test, but not much more is known about this subsampling method.

Assumptions: ID # represents a specific product and that there is only one product. The differences in assembly are due to customization and/or differences between lines.

Goal: Predict which products will fail the response test.

1. Numerical Data Exploration¶

A. Dataframe Visualization¶

As the data is extremely large, it is important to explore the structure of the data to get a feel for what is happening. The numerical data appears to be the most important, so I will start with examining the numerical data.

There are a lot of missing data and null (NaN) values. It is not obvious if this is due to recording errors (which I suspect is not the case), or if it is related to the structuring of the data. I need to continue visualizing the data.

# visualize numerical data with missingno
plt.figure()
msno.matrix(mf_num_data, figsize=(20,6))
plt.title('Train_Numeric.csv Data Visualization',color=(0.239, 0.474, 1), size=20)
plt.ylabel('Rows',size=15)
plt.xlabel('Columns',size=15)
plt.show();

<Figure size 432x288 with 0 Axes>

The graph above shows which values are included (purple) and which values are missing (white). The final column on the right side of the graph shows how many values are included in each row. Every row appears to have at most 20-30% of the columns filled in with values. This graph confirms that the data set is very sparse and reaffirms explanations of the data from the Kaggle site and from Hitesh's presentation on the data.

B. Features¶

# show data with pandas
mf_num_data.head(3)

Split the column names to look at Line/Station/Features:

# Create column lists and breakup the strings
columns_set = list(mf_num_data.columns)[1:-1]
breakup_strings = [i.split('_') for i in columns_set]

# store the values in separate containers
line_count = set([i[0] for i in breakup_strings[0:-1]])
station_count = set([i[1] for i in breakup_strings[0:-1]])
feature_count = set([i[2] for i in breakup_strings[0:-1]])

print('Non-Continuous Columns: {}'.format([mf_num_data.columns[0],mf_num_data.columns[-1]]))
print('Unique lines: {}\nUnique Stations: {}\nUnique Feature Measurements: {}'.format(len(line_count),len(station_count),len(feature_count)))

Non-Continuous Columns: ['Id', 'Response']
Unique lines: 4
Unique Stations: 50
Unique Feature Measurements: 967

The ID column lists the product ID for each row. The final Response column is binary (0 or 1) and indicates the ultimate failure or success of the product. There appear to be a lot of different feature measurements located in only 4 lines and 50 stations. This suggests a degree of overlap with the columns and also suggests a directional flow of the data.

C. Failure Rate Vizualization¶

Calculate Failure Rate and Show Visual:

vals = [get_ratio(i) for i in mf_num_data.columns[1:-1]]                 # calculate ratio of failures
sorted_corr = pd.DataFrame(vals).sort_values(by=[1],ascending=False)     # sort values and push to df

plt.figure(figsize=(20,4))
sns.barplot(x=0,y=1,data=sorted_corr,palette="cool_r",linewidth=.1,edgecolor=".8",orient='v')
plt.xticks([])
plt.xlabel('Features',size=14)
plt.ylabel('% Failure',size=14)
plt.ylim(0,.15)
plt.title('Failure % per Feature (Train_Numeric.csv)*',size=17)
plt.text(1,-.02, '*:L3_S32_F3850 (45%) was an outlier and was excluded for visualization purposes')
plt.show()

It helps to visualize each feature by the percentage failure rate. Each column represents a cluster of features. With the exception of L3_S32_F3850, it is clear that most columns have a failure rate between 0.14 and 0.01. Early exploration of the data definitely suggests that their is a distribution of features with very different results.

D. Graph/Network Data¶

row_count = len(mf_num_data)
# Create a network set, convert it to an adjacency matrix, and then save the matrix as a txt file:
create_adjacency_matrix(mf_num_data)

# Calculate the overall failure rate of a line/station:
station_count, station_feature_group, station_name = network_failure_rate(mf_num_data)

# Find list of Starting Stations:
first_stations = set(['_'.join(mf_num_data.iloc[i].dropna().index[1].split('_')[0:2]) for i in range(row_count)])

Load and Create NetworkX Graph:

# Convert txt file to networkx graph
G=nx.read_adjlist("utils/mf_exploratory_adjaceny.txt", create_using=nx.DiGraph) # A=to_agraph(G)
    
# Map colors to G - this part is messy
station_dict = {}
for i in range(len(station_count)):
    station_dict[station_name[i]] = station_count[i]             # assign count num to stations
min_tick,max_tick = remove_outlier(station_count)  
norm = mpl.colors.Normalize(vmin=min_tick, vmax=max_tick)        # normalize cm gradient range                                
m = cm.ScalarMappable(norm=norm, cmap=cm.cool)                   # rescale cm gradient range to new cmap 
colr_check = [m.to_rgba(station_dict[n]) for n in list(G.nodes)] # converts count torgba

# Run Visualization
if import_fail: # load premade image if pygraphviz can't load 
    display(Image(filename='network_modeling3.png')) 
    
else: # if pygraphviz does load
    pos = graphviz_layout(G, prog='dot') 
    
    pos_shift = {}
    for k, v in pos.items():
        pos_shift[k] = (v[0], v[1]+3) # offset on the y axis for features
    
    # Begin Visualization
    fig = plt.figure(figsize=(60,50))
    plt.title('Bosch Manufacturing Network',size=60,color='darkblue')
    
    # draw edges
    nx.draw_networkx_edges(G,
                           pos=pos, 
                           alpha=1,
                           width=.7,
                           arrows=False,
                           arrowsize=10,
                           edge_color=range(len(G.edges)),
                           edge_cmap=plt.cm.winter)

    # draw nodes
    nx.draw_networkx_nodes(G, 
                           pos=pos, 
                           node_size=2000, 
                           node_shape = 's',
                           node_color=colr_check,
                           alpha=1)

    # draw labels
    nx.draw_networkx_labels(G,
                            pos=pos,
                            font_color='white')
    
    
    props = dict(boxstyle='round', facecolor='lavender', alpha=0.8)
    
    for k in station_feature_group.keys():
            if k in ['L1_S25']:
                subset= station_feature_group[k][:200]
                subset='\n'.join(subset)
            elif k in ['L1_S24']:
                subset= station_feature_group[k][:140]
                subset='\n'.join(subset)
            else:
                subset='\n'.join(station_feature_group[k])
        
            plt.annotate(subset, xy=pos_shift[k], xytext=(0,30),
                    textcoords='offset points',
                    color='b', size=7,
                    arrowprops=dict(
                        arrowstyle='simple,tail_width=0.3,head_width=0.6,head_length=0.6',
                        facecolor='black',),
                    bbox=props)

    
    # Cleaning visuals
    plt.xticks([]); plt.yticks([]); plt.tight_layout()
    
    # Colorbar
    m.set_array([])                                # needs empty list
    cbaxes = fig.add_axes([.93,.65, 0.005, 0.3]) 
    cbar = plt.colorbar(m,cax=cbaxes)
    cbar.ax.set_title("Station Failure %",size=17)

    plt.savefig('visuals/network_modeling3.png',bbox_inches='tight')
    plt.show()

Experimental Method for Building Network Chain (Unused):

2. Timeseries Data¶

Because this data set represents a manufacturing chain, time is an essential metric. Machine parts could break down during the production, which might ultimately affect a larger batch of the product. The goal of this section is to see if time has any relationship (correlation) with the failure rate of the product.

# show data
mf_date_data.head(3)

A. Time Series Visualization¶

Final Time Distribution:

In order to visualize the time data, I decided to capture the final time recorded from each row. By sampling the data via the final time that was recorded, it allows us to visualize the time value as a distribution. Below is a visualization of the time data starting at time 0 and ending near 2000:

plot_time_series(False)
plt.title('Time series of train_date.csv for final time and response (Regular Count)')
plt.savefig('visuals/timeseries-regular.png')
plt.show()

The sample distributions are by counts. This histogram reinforces my understanding of the data as it shows a marginally smaller subsample of failed products.

This graph shows the normed distributions of the two samples. In this graph, Failure and Success have relatively different trends. The Success sample group seems to have a smaller cyclical range than the Failure sample group. The peak for failed products appears to be around 700-800. It would be helpful to know the unit of measurement for the time stamp to consider seasonality (by time of day, month, and/or year), but it is currently not known.

Method for Getting Distribution Data for creating a Factorization (Unused):

C. Categorical Data¶

Now that the numerical data and the date data have been visualized, the next step is to take a look at the categorical data.

mf_cat_data.head(3)

The nature of the shape of the categorical data is a little bit more tricky than the numerical and date data sets. As such, an exploration of the categorical will be postponed until after I am able to build a full model with predictions. I want to be able to explore how the categorical data will affect my model only after it has been built.

D. Failed Product Exploration¶

Because the data set needs to focus on the failed products, it is important to analyze the context for which the failed products appears.

Shown below is the collection of failed products:

mf_num_data[mf_num_data['Response'] == 1].head()

features = mf_num_data.columns[1:-1]
success_mean = [mf_num_data[f][mf_num_data['Response']==0].mean() for f in features]
fail_mean = [mf_num_data[f][mf_num_data['Response']==1].mean() for f in features]
real_mean = [mf_num_data[f].mean() for f in features]

mean_check = pd.DataFrame(data=[success_mean,fail_mean,real_mean]).T
mean_check['features'] = features
mean_check.columns = ['success_m','fail_m','real_mean','features']
mean_check_reduced = mean_check[0:15]

fig = plt.figure(figsize=(20,10))
sns.barplot(data=mean_check_reduced, x='features', y='fail_m', color='purple',)
sns.barplot(data=mean_check_reduced, x='features', y='success_m', color='lightblue')
plt.title('Comparing Mean Values of Failed vs Successful')
fail = mpatches.Patch(color='purple', label='Fail')
success = mpatches.Patch(color='lightblue', label='Success')
plt.legend(handles=[fail,success])
plt.show()

	Id	L0_S0_F0	L0_S0_F2	L0_S0_F4	L0_S0_F6	L0_S0_F8	L0_S0_F10	L0_S0_F12	L0_S0_F14	L0_S0_F16	...	L3_S50_F4245	L3_S50_F4247	L3_S50_F4249	L3_S50_F4251	L3_S50_F4253	L3_S51_F4256	L3_S51_F4258	L3_S51_F4260	L3_S51_F4262	Response
31	1053	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0
36	1250	0.075	0.101	-0.179	-0.216	-0.013	0.070	-0.022	-0.152	0.087	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0
38	1350	0.069	0.041	0.330	0.330	-0.100	-0.294	0.008	0.088	-0.092	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0
52	1793	0.003	-0.026	0.330	0.294	0.074	0.161	0.022	0.128	-0.199	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0
67	2347	-0.114	-0.161	0.330	0.330	-0.013	0.116	0.045	0.288	0.036	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0

	Id	L0_S0_F0	L0_S0_F2	L0_S0_F4	L0_S0_F6	L0_S0_F8	L0_S0_F10	L0_S0_F12	L0_S0_F14	L0_S0_F16	...	L3_S50_F4245	L3_S50_F4247	L3_S50_F4249	L3_S50_F4251	L3_S50_F4253	L3_S51_F4256	L3_S51_F4258	L3_S51_F4260	L3_S51_F4262
0	23	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	71	-0.167	-0.168	0.276	0.33	0.074	0.161	0.052	0.248	0.163	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	76	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Id	L0_S0_D1	L0_S0_D3	L0_S0_D5	L0_S0_D7	L0_S0_D9	L0_S0_D11	L0_S0_D13	L0_S0_D15	L0_S0_D17	...	L3_S50_D4248	L3_S50_D4250	L3_S50_D4252	L3_S50_D4254	L3_S51_D4255	L3_S51_D4257	L3_S51_D4259	L3_S51_D4261	L3_S51_D4263
0	23	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	71	1458.06	1458.06	1458.06	1458.06	1458.06	1458.06	1458.06	1458.06	1458.06	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	76	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Id	L0_S1_F25	L0_S1_F27	L0_S1_F29	L0_S1_F31	L0_S2_F33	L0_S2_F35	L0_S2_F37	L0_S2_F39	L0_S2_F41	...	L3_S49_F4227	L3_S49_F4229	L3_S49_F4230	L3_S49_F4232	L3_S49_F4234	L3_S49_F4235	L3_S49_F4237	L3_S49_F4239	L3_S49_F4240
0	23	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	71	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	76	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Id	L0_S1_F25	L0_S1_F27	L0_S1_F29	L0_S1_F31	L0_S2_F33	L0_S2_F35	L0_S2_F37	L0_S2_F39	L0_S2_F41	...	L3_S49_F4227	L3_S49_F4229	L3_S49_F4230	L3_S49_F4232	L3_S49_F4234	L3_S49_F4235	L3_S49_F4237	L3_S49_F4239	L3_S49_F4240
0	23	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	71	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	76	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Id	L0_S1_F25	L0_S1_F27	L0_S1_F29	L0_S1_F31	L0_S2_F33	L0_S2_F35	L0_S2_F37	L0_S2_F39	L0_S2_F41	...	L3_S49_F4227	L3_S49_F4229	L3_S49_F4230	L3_S49_F4232	L3_S49_F4234	L3_S49_F4235	L3_S49_F4237	L3_S49_F4239	L3_S49_F4240
0	23	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	71	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	76	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN