In [1]:
from utils.mf_exploratory import *
train_numeric.csv loaded...
train_date.csv loaded...
train_cat.csv loaded...

Bosch Manufacturing

Part 1: Exploratory Analysis


Author: Ryan Harper

Data Source: Bosch Dataset via Kaggle

Background: Bosch is a home appliance and industrial tools manufacturing company. In 2017, Bosch supplied Kaggle.com with manufacturing data to promote a competition. The goal of the competition was to determine factors that influence whether or not the product passes the final response stage of manufacturing and to predict which products are likely to fail based on this manufacturing process.

The Data: Early exploration of this data will use a subset of the big data provided by Bosch. The data subset is provided by Hitesh, John, and Matthew via PDX Data Science Meetup. The data subset is divided into 2 groups of 3 files (3 training, 3 test). Each group has one csv file each for numerical features ('numeric'), dates ('date'), and the manufacturing path ('cat'). The data subset includes a larger percentage of products that failed the response test, but not much more is known about this subsampling method.

Assumptions: ID # represents a specific product and that there is only one product. The differences in assembly are due to customization and/or differences between lines.

Goal: Predict which products will fail the response test.

1. Numerical Data Exploration

A. Dataframe Visualization

As the data is extremely large, it is important to explore the structure of the data to get a feel for what is happening. The numerical data appears to be the most important, so I will start with examining the numerical data.

There are a lot of missing data and null (NaN) values. It is not obvious if this is due to recording errors (which I suspect is not the case), or if it is related to the structuring of the data. I need to continue visualizing the data.

In [2]:
# visualize numerical data with missingno
plt.figure()
msno.matrix(mf_num_data, figsize=(20,6))
plt.title('Train_Numeric.csv Data Visualization',color=(0.239, 0.474, 1), size=20)
plt.ylabel('Rows',size=15)
plt.xlabel('Columns',size=15)
plt.show();
<Figure size 432x288 with 0 Axes>

The graph above shows which values are included (purple) and which values are missing (white). The final column on the right side of the graph shows how many values are included in each row. Every row appears to have at most 20-30% of the columns filled in with values. This graph confirms that the data set is very sparse and reaffirms explanations of the data from the Kaggle site and from Hitesh's presentation on the data.

B. Features

In [3]:
# show data with pandas
mf_num_data.head(3)
Out[3]:
Id L0_S0_F0 L0_S0_F2 L0_S0_F4 L0_S0_F6 L0_S0_F8 L0_S0_F10 L0_S0_F12 L0_S0_F14 L0_S0_F16 ... L3_S50_F4245 L3_S50_F4247 L3_S50_F4249 L3_S50_F4251 L3_S50_F4253 L3_S51_F4256 L3_S51_F4258 L3_S51_F4260 L3_S51_F4262 Response
0 23 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
1 71 -0.167 -0.168 0.276 0.33 0.074 0.161 0.052 0.248 0.163 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
2 76 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0

3 rows × 970 columns

Split the column names to look at Line/Station/Features:

In [4]:
# Create column lists and breakup the strings
columns_set = list(mf_num_data.columns)[1:-1]
breakup_strings = [i.split('_') for i in columns_set]

# store the values in separate containers
line_count = set([i[0] for i in breakup_strings[0:-1]])
station_count = set([i[1] for i in breakup_strings[0:-1]])
feature_count = set([i[2] for i in breakup_strings[0:-1]])

print('Non-Continuous Columns: {}'.format([mf_num_data.columns[0],mf_num_data.columns[-1]]))
print('Unique lines: {}\nUnique Stations: {}\nUnique Feature Measurements: {}'.format(len(line_count),len(station_count),len(feature_count)))
Non-Continuous Columns: ['Id', 'Response']
Unique lines: 4
Unique Stations: 50
Unique Feature Measurements: 967

The ID column lists the product ID for each row. The final Response column is binary (0 or 1) and indicates the ultimate failure or success of the product. There appear to be a lot of different feature measurements located in only 4 lines and 50 stations. This suggests a degree of overlap with the columns and also suggests a directional flow of the data.

C. Failure Rate Vizualization

Calculate Failure Rate and Show Visual:

In [5]:
vals = [get_ratio(i) for i in mf_num_data.columns[1:-1]]                 # calculate ratio of failures
sorted_corr = pd.DataFrame(vals).sort_values(by=[1],ascending=False)     # sort values and push to df

plt.figure(figsize=(20,4))
sns.barplot(x=0,y=1,data=sorted_corr,palette="cool_r",linewidth=.1,edgecolor=".8",orient='v')
plt.xticks([])
plt.xlabel('Features',size=14)
plt.ylabel('% Failure',size=14)
plt.ylim(0,.15)
plt.title('Failure % per Feature (Train_Numeric.csv)*',size=17)
plt.text(1,-.02, '*:L3_S32_F3850 (45%) was an outlier and was excluded for visualization purposes')
plt.show()

It helps to visualize each feature by the percentage failure rate. Each column represents a cluster of features. With the exception of L3_S32_F3850, it is clear that most columns have a failure rate between 0.14 and 0.01. Early exploration of the data definitely suggests that their is a distribution of features with very different results.

D. Graph/Network Data

In [6]:
row_count = len(mf_num_data)
# Create a network set, convert it to an adjacency matrix, and then save the matrix as a txt file:
create_adjacency_matrix(mf_num_data)

# Calculate the overall failure rate of a line/station:
station_count, station_feature_group, station_name = network_failure_rate(mf_num_data)

# Find list of Starting Stations:
first_stations = set(['_'.join(mf_num_data.iloc[i].dropna().index[1].split('_')[0:2]) for i in range(row_count)])

Load and Create NetworkX Graph:

In [7]:
# Convert txt file to networkx graph
G=nx.read_adjlist("utils/mf_exploratory_adjaceny.txt", create_using=nx.DiGraph) # A=to_agraph(G)
    
# Map colors to G - this part is messy
station_dict = {}
for i in range(len(station_count)):
    station_dict[station_name[i]] = station_count[i]             # assign count num to stations
min_tick,max_tick = remove_outlier(station_count)  
norm = mpl.colors.Normalize(vmin=min_tick, vmax=max_tick)        # normalize cm gradient range                                
m = cm.ScalarMappable(norm=norm, cmap=cm.cool)                   # rescale cm gradient range to new cmap 
colr_check = [m.to_rgba(station_dict[n]) for n in list(G.nodes)] # converts count torgba
In [8]:
# Run Visualization
if import_fail: # load premade image if pygraphviz can't load 
    display(Image(filename='network_modeling3.png')) 
    
else: # if pygraphviz does load
    pos = graphviz_layout(G, prog='dot') 
    
    pos_shift = {}
    for k, v in pos.items():
        pos_shift[k] = (v[0], v[1]+3) # offset on the y axis for features
    
    # Begin Visualization
    fig = plt.figure(figsize=(60,50))
    plt.title('Bosch Manufacturing Network',size=60,color='darkblue')
    
    # draw edges
    nx.draw_networkx_edges(G,
                           pos=pos, 
                           alpha=1,
                           width=.7,
                           arrows=False,
                           arrowsize=10,
                           edge_color=range(len(G.edges)),
                           edge_cmap=plt.cm.winter)

    # draw nodes
    nx.draw_networkx_nodes(G, 
                           pos=pos, 
                           node_size=2000, 
                           node_shape = 's',
                           node_color=colr_check,
                           alpha=1)

    # draw labels
    nx.draw_networkx_labels(G,
                            pos=pos,
                            font_color='white')
    
    
    props = dict(boxstyle='round', facecolor='lavender', alpha=0.8)
    
    for k in station_feature_group.keys():
            if k in ['L1_S25']:
                subset= station_feature_group[k][:200]
                subset='\n'.join(subset)
            elif k in ['L1_S24']:
                subset= station_feature_group[k][:140]
                subset='\n'.join(subset)
            else:
                subset='\n'.join(station_feature_group[k])
        
            plt.annotate(subset, xy=pos_shift[k], xytext=(0,30),
                    textcoords='offset points',
                    color='b', size=7,
                    arrowprops=dict(
                        arrowstyle='simple,tail_width=0.3,head_width=0.6,head_length=0.6',
                        facecolor='black',),
                    bbox=props)

    
    # Cleaning visuals
    plt.xticks([]); plt.yticks([]); plt.tight_layout()
    
    # Colorbar
    m.set_array([])                                # needs empty list
    cbaxes = fig.add_axes([.93,.65, 0.005, 0.3]) 
    cbar = plt.colorbar(m,cax=cbaxes)
    cbar.ax.set_title("Station Failure %",size=17)

    plt.savefig('visuals/network_modeling3.png',bbox_inches='tight')
    plt.show()

Experimental Method for Building Network Chain (Unused):

mf_network = mf_num_data.where(mf_num_data.isnull(), 1).fillna(0).astype(int) del mf_network['Id'] del mf_network['Response']

2. Timeseries Data

Because this data set represents a manufacturing chain, time is an essential metric. Machine parts could break down during the production, which might ultimately affect a larger batch of the product. The goal of this section is to see if time has any relationship (correlation) with the failure rate of the product.

In [9]:
# show data
mf_date_data.head(3)
Out[9]:
Id L0_S0_D1 L0_S0_D3 L0_S0_D5 L0_S0_D7 L0_S0_D9 L0_S0_D11 L0_S0_D13 L0_S0_D15 L0_S0_D17 ... L3_S50_D4248 L3_S50_D4250 L3_S50_D4252 L3_S50_D4254 L3_S51_D4255 L3_S51_D4257 L3_S51_D4259 L3_S51_D4261 L3_S51_D4263 Response
0 23 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
1 71 1458.06 1458.06 1458.06 1458.06 1458.06 1458.06 1458.06 1458.06 1458.06 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
2 76 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0

3 rows × 1158 columns

A. Time Series Visualization

Final Time Distribution:

In order to visualize the time data, I decided to capture the final time recorded from each row. By sampling the data via the final time that was recorded, it allows us to visualize the time value as a distribution. Below is a visualization of the time data starting at time 0 and ending near 2000:

In [10]:
plot_time_series(False)
plt.title('Time series of train_date.csv for final time and response (Regular Count)')
plt.savefig('visuals/timeseries-regular.png')
plt.show()

The sample distributions are by counts. This histogram reinforces my understanding of the data as it shows a marginally smaller subsample of failed products.

plot_time_series(True) plt.title('Time series of train_date.csv for final time and response (Gaussian Density Estimate)') plt.savefig('visuals/timeseries-gde.png') plt.show()

This graph shows the normed distributions of the two samples. In this graph, Failure and Success have relatively different trends. The Success sample group seems to have a smaller cyclical range than the Failure sample group. The peak for failed products appears to be around 700-800. It would be helpful to know the unit of measurement for the time stamp to consider seasonality (by time of day, month, and/or year), but it is currently not known.

Method for Getting Distribution Data for creating a Factorization (Unused):

sns.distplot(time_series_plot['final_time'][time_series_plot['Response']==0]).get_lines()[0].get_data()

C. Categorical Data

Now that the numerical data and the date data have been visualized, the next step is to take a look at the categorical data.

In [12]:
mf_cat_data.head(3)
Out[12]:
Id L0_S1_F25 L0_S1_F27 L0_S1_F29 L0_S1_F31 L0_S2_F33 L0_S2_F35 L0_S2_F37 L0_S2_F39 L0_S2_F41 ... L3_S49_F4227 L3_S49_F4229 L3_S49_F4230 L3_S49_F4232 L3_S49_F4234 L3_S49_F4235 L3_S49_F4237 L3_S49_F4239 L3_S49_F4240 Response
0 23 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
1 71 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
2 76 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0

3 rows × 2142 columns

The nature of the shape of the categorical data is a little bit more tricky than the numerical and date data sets. As such, an exploration of the categorical will be postponed until after I am able to build a full model with predictions. I want to be able to explore how the categorical data will affect my model only after it has been built.

D. Failed Product Exploration

Because the data set needs to focus on the failed products, it is important to analyze the context for which the failed products appears.

mf_num_data[mf_num_data['response'] == 0].describe()

Shown below is the collection of failed products:

In [13]:
mf_num_data[mf_num_data['Response'] == 1].head()
Out[13]:
Id L0_S0_F0 L0_S0_F2 L0_S0_F4 L0_S0_F6 L0_S0_F8 L0_S0_F10 L0_S0_F12 L0_S0_F14 L0_S0_F16 ... L3_S50_F4245 L3_S50_F4247 L3_S50_F4249 L3_S50_F4251 L3_S50_F4253 L3_S51_F4256 L3_S51_F4258 L3_S51_F4260 L3_S51_F4262 Response
31 1053 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
36 1250 0.075 0.101 -0.179 -0.216 -0.013 0.070 -0.022 -0.152 0.087 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
38 1350 0.069 0.041 0.330 0.330 -0.100 -0.294 0.008 0.088 -0.092 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
52 1793 0.003 -0.026 0.330 0.294 0.074 0.161 0.022 0.128 -0.199 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
67 2347 -0.114 -0.161 0.330 0.330 -0.013 0.116 0.045 0.288 0.036 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0

5 rows × 970 columns

In [14]:
features = mf_num_data.columns[1:-1]
success_mean = [mf_num_data[f][mf_num_data['Response']==0].mean() for f in features]
fail_mean = [mf_num_data[f][mf_num_data['Response']==1].mean() for f in features]
real_mean = [mf_num_data[f].mean() for f in features]
In [15]:
mean_check = pd.DataFrame(data=[success_mean,fail_mean,real_mean]).T
mean_check['features'] = features
mean_check.columns = ['success_m','fail_m','real_mean','features']
mean_check_reduced = mean_check[0:15]

fig = plt.figure(figsize=(20,10))
sns.barplot(data=mean_check_reduced, x='features', y='fail_m', color='purple',)
sns.barplot(data=mean_check_reduced, x='features', y='success_m', color='lightblue')
plt.title('Comparing Mean Values of Failed vs Successful')
fail = mpatches.Patch(color='purple', label='Fail')
success = mpatches.Patch(color='lightblue', label='Success')
plt.legend(handles=[fail,success])
plt.show()