Capstone #1: Prep Course Capstone Analytic Report

Exploratory Analysis of Box Office Movies by Genre¶

Author: Ryan Harper

Overview¶

Data Source: https://www.kaggle.com/tmdb/tmdb-movie-metadata/data
Original Source: https://www.themoviedb.org

The data set in this notebook contains information about movies listed on IMDB. It includes categorical variables (genre, language, country), discrete variables (ratings), and continuous variables (budget and revenue). The data was pulled from www.themoviedb.org and posted on the www.kaggle.com website. This analysis looks at the movie's genre, budget, and gross profits to help investors decide which movies are more likely to be financially successful at the box office.

Questions:

Question 1: Do movie genres vary from one another? In what ways?
Question 2: What is the distribution of gross per genre?
Question 3: Is there a relationship between budget and gross profits per genre?

Hypotheses:

Hypothesis 1: There is a difference between Action and Animation genre samples for gross profits.
Hypothesis 2: Budget and genre correlate with gross.
Hypothesis 3: Animation films have the best gross profit potential depending on budget.

Results¶

Answers:

Answer 1: Sample sizes vary significantly between genres (may reflect the overall industry or just a result of scraped data). There are significant differences between genres with regards to average gross profits but this might also be impacted by average budget. Animation films appear to have the best gross profit between the inner and upper quartile range. Romance, Horror, and Music genres seem to trend towards a positive gross profit albeit their upper quartile ranges are smaller than other high grossing genres. The lower quartile range of Action and Adventure genres seem to show a larger range of financial loss.

Answer 2: Based on the histogram plots, most genres appear to be left skewed. Exponential curves in QQ plots indicates most genres are left skewed. Animation appears more linear so it could have a normal distribution.

Answer 3: Action, Fantasy, Science Fiction genres have the highest correlation between budget and profit. Western, Horror, War, and History genres have low correlation between budget and profit

Part 1: Cleaning Data¶

# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import calendar
import seaborn as sns
from matplotlib.pyplot import subplots, show
import ast
import math
from scipy.stats import ttest_ind, mannwhitneyu, median_test
import missingno as msno #Missing data visualization module for Python.
from IPython.display import display
from IPython.core.debugger import Tracer

%matplotlib inline

# change optional settings for libraries
sns.set_style("whitegrid")
pd.set_option('show_dimensions', False)
np.warnings.filterwarnings('ignore')

Created 'gross' column and deleted unused columns:

# import data, add gross profit column, and delete homepage/overview columns
moviedata = pd.read_csv('../data/tmdb_5000_movies.csv', encoding = "ISO-8859-1") 
moviedata['gross'] = moviedata['revenue'] - moviedata['budget']
del moviedata['homepage']
del moviedata['overview']
del moviedata['original_title']

Values in some columns printed out the data as a dictionary within a string:

#rows include values that are strings with dictionaries inside
print(moviedata.iloc[1]['genres'])
print(moviedata.iloc[1]['production_countries'])
print(moviedata.iloc[1]['production_companies'])

[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}]
[{"iso_3166_1": "US", "name": "United States of America"}]
[{"name": "Walt Disney Pictures", "id": 2}, {"name": "Jerry Bruckheimer Films", "id": 130}, {"name": "Second Mate Productions", "id": 19936}]

Functions to interpret data embedded in the dict-like strings:

def find_set(seriesvalue):
    # function for returning list of unique values in column
    unique=[]
    for seriesvalue in seriesvalue:
        seriesvalue = ast.literal_eval(seriesvalue)
        for fulldict in seriesvalue:
            # The key for the dict values is 'name'
            if 'name' in fulldict:
                unique.append(fulldict['name'])          
    return set(unique)

def name_fix(seriesvalue):
    # function for converting dictionaries to simple lists in columns
    try:
        seriesvaluelist=[]
        seriesvalue = ast.literal_eval(seriesvalue)
        for valuedict in seriesvalue:
            if 'name' in valuedict:
                seriesvaluelist.append(valuedict['name'])
        return ', '.join(seriesvaluelist)
    except (SyntaxError, ValueError, TypeError) as e:
        return seriesvalue

# creates list of unique values for specific columns with string data
un_genre = list(find_set(moviedata.genres))
un_country = list(find_set(moviedata.production_countries))
un_language = list(find_set(moviedata.spoken_languages))
un_keywords = list(find_set(moviedata.keywords))
un_companies = list(find_set(moviedata.production_companies))

# runs the name fix function to make the dataset easier to read
for column in moviedata:
    moviedata[column] = moviedata[column].apply(name_fix)

# check for missing data
msno.matrix(moviedata,color=(.1, .6, .8))

<matplotlib.axes._subplots.AxesSubplot at 0x1a17e13ba8>

Data Summary:

# reporting of data
print("TOTAL MOVIES: {}".format(moviedata['id'].count()))
print("COUNTRIES: {}, GENRES: {}, LANGUAGES: {}, COMPANIES: {}".format(len(un_country),len(un_genre),len(un_language),len(un_companies)))
print("BUDGET MAX: ${:,.2f} REVENUE MAX: ${:,.2f} GROSSING MAX: ${:,.2f}".format(moviedata['budget'].max(),moviedata['revenue'].max(),moviedata['gross'].max()))    
print("BUDGET MEAN: ${:,.2f} REVENUE MEAN: ${:,.2f} GROSSING MEAN: ${:,.2f}\n".format(moviedata['budget'].mean(),moviedata['revenue'].mean(),moviedata['gross'].mean()))          
display(moviedata.iloc[0:1])

TOTAL MOVIES: 4803
COUNTRIES: 88, GENRES: 20, LANGUAGES: 62, COMPANIES: 5017
BUDGET MAX: $380,000,000.00 REVENUE MAX: $2,787,965,087.00 GROSSING MAX: $2,550,965,087.00
BUDGET MEAN: $29,045,039.88 REVENUE MEAN: $82,260,638.65 GROSSING MEAN: $53,215,598.78

There are 4,803 movies, 20 genres, 88 countries, 62 languages, and 5,017 companies.
The highest budget is \$380,000,000.00. <br>The highest revenue is \$2,787,965,087.00.
The highest grossing movie was \$2,550,965,087.00 (Avatar).

Movies and genres without budget or revenue data needed to be removed from the analysis:

# remove data without budget/revenue
moviedata = moviedata.loc[(moviedata['budget'] != 0) & (moviedata['revenue'] != 0)]
if 'Foreign' in un_genre:
    un_genre.remove('Foreign')
    
if 'TV Movie' in un_genre:
    un_genre.remove('TV Movie')

Looped through genre list to create genre columns with boolean values:

# creates temp columns for True/False check of each genre and deletes main genre column
for column in un_genre:
    moviedata[column] = moviedata.genres.apply(lambda value: column in value)    
del moviedata['genres']

# creates new dataframes for each genre and puts into dict
genres = {}

for column in un_genre:
    df = moviedata[moviedata[column] == True]
    genres[column] = df.drop(un_genre, axis=1, inplace=False)

Part 2: Genre Summary Statistics¶

Sort genres into new dataframes and calculate the average, count, standard deviation of numerical data:

# creates a new dataframe based on genres
genrestats = {}
intcolumns = ['budget','revenue','gross','vote_average']
for col in intcolumns:
    poplist = []
    for genre in un_genre:
        poplist.append(moviedata[genre] == True)

    avglist = []
    devlist = []
    cntlist = []
    for ser in poplist:
        avglist.append(moviedata.loc[ser, col].mean())
        devlist.append(moviedata.loc[ser, col].std())
        cntlist.append(moviedata.loc[ser, col].count())
        
    genrestats[col+'_avg'] = avglist
    genrestats[col+'_std'] = devlist
    genrestats[col+'_cnt'] = cntlist

Pie Chart of Genres and Genre DF Desription

dfGenre = pd.DataFrame(genrestats, index=un_genre)

# create color range
NUM_COLORS = 22
cm = plt.get_cmap('Blues_r')
colors = [cm(.7*i/NUM_COLORS) for i in range(NUM_COLORS)][4:22]

# Plot
fig=plt.figure(figsize=(7,7))
patches, texts = plt.pie(dfGenre['gross_cnt'].sort_values(), labels=dfGenre['gross_cnt'].sort_values().index, colors=colors, 
                                    rotatelabels=True,startangle=140,labeldistance=1.01)
[text.set_fontsize(index/3 + 7) for index, text in enumerate(texts)]

fig.suptitle('Pie Chart of Movie Count Per Genre', fontsize=17, y=.95)
plt.show()
display(dfGenre,dfGenre.describe())

Counts for each genre are the same for budget, revenue, and gross so there does not appear to be immediate issues with null values.

Answer to Q1: Sample sizes vary significantly between genres (may reflect the overall industry or just a result of scraped data). There are significant differences between genres with regards to average gross profits but this might also be impacted by average budget.

Part 3: Distribution of Gross Per Genre

A. Histogram

# hist plot of diff genres gross profits
i = 1
fig=plt.figure(figsize=(12,12))

xmin = moviedata['gross'].min()
xmax = moviedata['gross'].max()
for genre in genres:
    plt.subplot(5, 5, i)
    
    i = i + 1
    plt.hist(genres[genre]['gross'],bins=10)
    plt.xlim(xmin,xmax)
    plt.xlabel('gross profits')
    plt.ylabel('frequency (by bins)')
    plt.title(genre)

fig.suptitle('Histograms of Each Genre (Gross)', fontsize=18, y=1.03)
plt.tight_layout()
plt.show()

Answer to Q2: Based on the histogram plots, most genres appear to be left skewed.

B. QQ Gaussian Plot

def plot_qq(series, loc=1,color='royalblue'):
    # creating random normal sampling for qq plot
    norm = np.random.normal(0, 1, series.count())

    # Sorting the values in ascending order.
    norm.sort()
    series = series.sort_values()
    
    # Plotting a genre sample against norm in qqplot.
    plt.subplot(5, 5, loc)
    plt.plot(norm, series, "o", color =color)
    
# --currently not used
def plot_qqneat(series):
    # example qq plot
    a = np.random.normal(5,5,250)
    b = np.random.rayleigh(5,250)

    percs = np.linspace(0,100,21)
    qn_a = np.percentile(a, percs)
    qn_b = np.percentile(b, percs)

    plt.plot(qn_a,qn_b, ls="", marker="o")

    x = np.linspace(np.min((qn_a.min(),qn_b.min())), np.max((qn_a.max(),qn_b.max())))
    plt.plot(x,x, color="k", ls="--")

# qq plots of genres
i = 1
fig=plt.figure(figsize=(10,10))

for genre in genres:
    if genre == 'Animation':
        plot_qq(genres[genre]['gross'],i,'orangered')
    else:
        plot_qq(genres[genre]['gross'],i)
    i = i + 1
    plt.title(genre)

fig.suptitle('QQ Gaussian Plots Per Genre', fontsize=20, y=1.03)
plt.tight_layout()
plt.show()

Answer to Q2: Exponential curves in QQ plots indicates most genres are left skewed. Animation appears more linear so it could have a normal distribution.

C. Box Plot

def ser_con(genre): return genres[genre]['gross']

plt.figure(figsize=(8,8))
plt.boxplot(list(map(ser_con, un_genre)))
# Tracer()()
plt.xticks(list(range(1,19)), un_genre, rotation=90, fontsize=13)
plt.ylim(moviedata['gross'].min(), moviedata['gross'].max()*.45)
plt.tight_layout()
plt.ylabel('gross profit')
plt.title('Boxplot of Gross Profit Per Genre', fontsize=17,y=1.02)
plt.show()

Answer to Q1: Animation films appear to have the best gross profit between the inner and upper quartile range. Romance, Horror, and Music genres seem to trend towards a positive gross profit albeit their upper quartile ranges are smaller than other high grossing genres. The lower quartile range of Action and Adventure genres seem to show a larger range of financial loss.

Part 4: Comparing Genre Samples¶

Hypothesis 1: There is a difference between gross for Action and Animation genres

A. Descriptive Comparison

# reviewing Action for outliers
print('Action mean: {}'.format(genres['Action']['gross'].mean()))
print('Action median: {}'.format(genres['Action']['gross'].median()))
print('Action standard deviation: {}'.format(genres['Action']['gross'].std()))
print('Action median absolute deviation: {} \n'.format(genres['Action']['gross'].mad()))

# reviewing Animation for outliers
print('Animation mean: {}'.format(genres['Animation']['gross'].mean()) )
print('Animation median: {}'.format(genres['Animation']['gross'].median()))
print('Animation standard deviation: {}'.format(genres['Animation']['gross'].std()))
print('Animation median absolute deviation: {} \n'.format(genres['Animation']['gross'].mad()) )

Action mean: 114424359.2124183
Action median: 45189732.0
Action standard deviation: 208691170.02048752
Action median absolute deviation: 131949754.11423638 

Animation mean: 198090651.15957448
Animation median: 125484869.5
Animation standard deviation: 239603756.6458052
Animation median absolute deviation: 187096755.07435438

Answer to Q1: When compared to the action genre, the Animation genre has a comparatively large median absolute deviation and will be more inclusive of outliers

B. T-test (Parametric A/B Independent Samples)

#Ex T-Test with scipy for parametric, (equal_var=True because same population)
t, p = ttest_ind(genres['Animation']['gross'], genres['Action']['gross'], equal_var=True)
print('tvalue: {}, pvalue:{}'.format(t,p))

tvalue: 4.87831783017246, pvalue:1.2269553049791973e-06

P-value is less than 5%. Indicates variation between the Animation and Action. T-value is 4.878 (a little low)

# T-Test Raw Calculation for parametric data (assuming different population?)

# Compute the difference between the two sample means.
diff=dfGenre.loc['Animation','gross_avg'] - dfGenre.loc['Action','gross_avg']

# size of samples
size = np.array([dfGenre.loc['Animation','gross_cnt'], dfGenre.loc['Action','gross_cnt']])
# sample distribution
sd = np.array([dfGenre.loc['Animation','gross_std'], dfGenre.loc['Action','gross_std']])

# The squared standard deviations are divided by the sample size and summed, then we take
# the square root of the sum. 
diff_se = (sum(sd ** 2 / size)) ** 0.5

print(diff/diff_se)

4.454278535154209

T-value is 4.45 and is close to scipy ttest results.

C. Mann Whitney U test (2 Non-Normally Distributed Independent Samples)

#Ex U-test with scip for nonparametric
mannwhitneyu(genres['Animation']['gross'], genres['Action']['gross'], use_continuity=True, alternative=None)

MannwhitneyuResult(statistic=66371.5, pvalue=2.9822954316046327e-07)

MannWhitney U test has a p-value that is well below 1%. Indicates likelihood of variability in movie budgets between Action and Crime genres.

D. Mood’s Median test (2+ Non-Normally Distributed Independent Samples)

stat, p, med, tbl = median_test(genres['Action']['gross'],genres['Animation']['gross'],genres['Fantasy']['gross'],genres['Science Fiction']['gross'])
print(stat,med)
print(p)

30.642906869873734 51753202.0
1.0106915946046876e-06

P-value for mood's median test is less than 0.05 for a comparison of Action, Animation, Fantasy, and Science Fiction.

Part 5: Visualization of Budget and Gross Per Genre</span>

Hypothesis 2: Budget and genre correlates with gross.

Hypothesis 3: Animation films have the best gross profit potential depending on budget amount.

A. Scatter Plot

# plot of expense(x) to gross (y)
correlation = {}
i = 1
fig=plt.figure(figsize=(12,12))

xmin = moviedata['budget'].min()
xmax = moviedata['budget'].max()*.3
ymin = moviedata['gross'].min()
ymax = moviedata['gross'].max()*.3

for genre in genres:
    plt.subplot(5, 5, i)
    i = i + 1
    plt.scatter(genres[genre]['budget'],genres[genre]['gross'],.5)
    
    plt.xlim(xmin,xmax)
    plt.ylim(ymin,ymax)
    plt.title(genre)
    plt.xlabel('budget')
    plt.ylabel('gross')
    
    correlation[genre] = (genres[genre]['budget'].corr(genres[genre]['gross']))

fig.suptitle('Scatter Plots Per Genre (Gross vs Budget)', fontsize=18, y=1.03)
plt.tight_layout()
plt.show()

Answer to Q3: The scatter plots show a better visualization of the outliers. A comparatively large number of Western films seem to have lost money. War films appear to have a wider range of profit (and or loss).

B. Correlation Heat Map

corr = np.array((list(correlation.values())))
labels = (np.asarray(["{}\n\n{:.2f}".format(string, value) \
        for string, value in zip(list(correlation.keys()),corr)])).reshape(3, 6)

fig, ax = plt.subplots(figsize=(8,4))
sns.heatmap(corr.reshape(3,6), 
            annot=labels, 
            square=False, 
            ax=ax, 
            fmt="", 
            xticklabels=False, 
            yticklabels=False, 
            cmap="Blues", 
            vmin=.3, 
            vmax=.6)
plt.tight_layout()
plt.title('Correlation of Budget and Gross Profit (Per Genre)')
plt.show()

Answer to Q3: Genres that are dark blue (Action, Fantasy, Science Fiction) have the highest correlation between budget and profit.
Genres that are white (Western, Horror, War, and History) have low correlation between budget and profit.

Part 6: Reflections¶

Further Research:

Can gross profits for a movie be predicted using features like budget and genre?

Model 1: Use a linear regression model on budget and genre to predict gross profits. I would split genres (with sample sizes greater than 500) using train_test_split and plot a linear regression.

Model 2: Use Naive Bayes neural network with a few features from the IDMB set (including genre) to predict gross profits and to determine which features are the most significant. I would clean up the other features in the dataset (language, production studio, and runtime) and run the NBNN to see if box office profits can be determined.

Assumptions:

Movies were randomly sampled from IMDB or all movies were chosen
IMDB budget and revenue data are reliably sourced

Difficulties:

Using the scientific method
Organizing presentation of the data
Financial data (budget/revenue/gross) was left skewed (nonparametric)
Interpreting T-value
Sample sizes for genre varied dramatically
Plotting 20 genres with subplots was logistically difficult
Scatter plots for budget and gross profits are misleading (with regards to the y-axis and diminishing returns)

NOTES

Samples Comparison Test Chart: https://courses.thinkful.com/data-201v1/assignment/5.5.1

Misc Work¶

Check Normality for Each Genre: Vote\_Average¶

A. Histogram

# hist plot of diff genres gross profits
i = 1
fig=plt.figure(figsize=(12,12))

xmin = moviedata['vote_average'].min()
xmax = moviedata['vote_average'].max()
for genre in genres:
    plt.subplot(5, 5, i)
    
    i = i + 1
    plt.hist(genres[genre]['vote_average'],bins=10)
    plt.xlim(xmin,xmax)
    plt.xlabel('Voter_Average')
    plt.ylabel('frequency (by bins)')
    plt.title(genre)

fig.suptitle('Histograms of Each Genre (Vote_Average)', fontsize=18, y=1.03)
plt.tight_layout()
plt.show()

Most Histograms show normality. Some Western and Documentary genres appear to be skewed right.

B. QQ Gaussian Plot

# qq plots of genres for vote_averages
i = 1
fig=plt.figure(figsize=(10,10))

for genre in genres:
    plot_qq(genres[genre]['vote_average'],i)
    i = i + 1
    plt.title(genre)

fig.suptitle('QQ Gaussian Plots Vote_Average Per Genre', fontsize=20, y=1.03)
plt.tight_layout()
plt.show()

QQ plots of genres by vote _average are linear and match the normal sample (with the exception of War and Western)

C. T-test (Parametric A/B Independent Samples) https://courses.thinkful.com/data-201v1/assignment/5.5.1

#Ex T-Test with scipy for parametric, (equal_var=True because same population)
t, p = ttest_ind(genres['Crime']['vote_average'], genres['Action']['vote_average'], equal_var=True)
print('tvalue: {}, pvalue:{}'.format(t,p))

tvalue: 6.671594491458395, pvalue:3.6009251585734e-11

# T-Test Raw Calculation for parametric data

# Compute the difference between the two sample means.
diff=dfGenre.loc['Crime','vote_average_avg'] - dfGenre.loc['Action','vote_average_avg']

# size of samples
size = np.array([dfGenre.loc['Crime','vote_average_cnt'], dfGenre.loc['Action','vote_average_cnt']])
# sample distribution
sd = np.array([dfGenre.loc['Crime','vote_average_std'], dfGenre.loc['Action','vote_average_std']])

# The squared standard deviations are divided by the sample size and summed, then we take
# the square root of the sum. 
diff_se = (sum(sd ** 2 / size)) ** 0.5

print(diff/diff_se)

6.761172745982536

T-value is 6.71 and p-value is less than 5%. Indicates variation between the Crime and Action.

	budget_avg	budget_cnt	budget_std	gross_avg	gross_cnt	gross_std	revenue_avg	revenue_cnt	revenue_std	vote_average_avg	vote_average_cnt	vote_average_std
Animation	8.082671e+07	188	5.291246e+07	1.980907e+08	188	2.396038e+08	2.789174e+08	188	2.670186e+08	6.446277	188	0.896659
Comedy	3.615545e+07	1110	3.503618e+07	7.232391e+07	1110	1.244804e+08	1.084794e+08	1110	1.444405e+08	6.080631	1110	0.833709
Romance	2.829866e+07	574	2.749864e+07	6.426381e+07	574	1.266034e+08	9.256247e+07	574	1.403881e+08	6.351916	574	0.804387
Action	6.239055e+07	918	5.628230e+07	1.144244e+08	918	2.086912e+08	1.768149e+08	918	2.463544e+08	6.129194	918	0.847626
Thriller	4.095393e+07	935	3.892819e+07	6.923375e+07	935	1.334463e+08	1.101877e+08	935	1.578563e+08	6.206845	935	0.801762
Family	6.824175e+07	365	5.294374e+07	1.575881e+08	365	2.198945e+08	2.258299e+08	365	2.499535e+08	6.185479	365	0.900569
Science Fiction	6.281405e+07	431	5.898003e+07	1.259568e+08	431	2.416463e+08	1.887708e+08	431	2.792290e+08	6.163109	431	0.908440
Fantasy	7.564092e+07	342	6.354801e+07	1.626755e+08	342	2.575484e+08	2.383164e+08	342	2.981470e+08	6.167836	342	0.902954
Horror	2.055763e+07	332	2.355402e+07	4.716959e+07	332	6.874178e+07	6.772722e+07	332	7.839788e+07	5.902410	332	0.823151
Music	2.396980e+07	111	2.209262e+07	5.509401e+07	111	8.818269e+07	7.906381e+07	111	9.800218e+07	6.526126	111	0.718675
Documentary	5.422902e+06	38	7.435102e+06	2.040826e+07	38	3.394283e+07	2.583116e+07	38	3.607979e+07	6.842105	38	0.753597
Western	3.548164e+07	57	4.968987e+07	3.029081e+07	57	9.920226e+07	6.577245e+07	57	1.128346e+08	6.671930	57	0.890175
Crime	3.475472e+07	521	3.295340e+07	5.314782e+07	521	9.873133e+07	8.790254e+07	521	1.204301e+08	6.434165	521	0.807622
History	3.775487e+07	145	3.495982e+07	4.016241e+07	145	8.023771e+07	7.791728e+07	145	9.174953e+07	6.800690	145	0.661070
Adventure	7.689239e+07	661	6.277496e+07	1.707142e+08	661	2.566121e+08	2.476065e+08	661	2.953000e+08	6.235250	661	0.862386
War	4.098456e+07	120	3.746079e+07	5.979416e+07	120	1.087265e+08	1.007787e+08	120	1.224240e+08	6.792500	120	0.786991
Mystery	3.876578e+07	265	3.420625e+07	6.330812e+07	265	1.150279e+08	1.020739e+08	265	1.340032e+08	6.364151	265	0.776768
Drama	3.002485e+07	1441	3.167900e+07	5.194328e+07	1441	1.111584e+08	8.196813e+07	1441	1.282132e+08	6.593546	1441	0.812451

	budget_avg	budget_cnt	budget_std	gross_avg	gross_cnt	gross_std	revenue_avg	revenue_cnt	revenue_std	vote_average_avg	vote_average_cnt	vote_average_std
count	1.800000e+01	18.000000	1.800000e+01	1.800000e+01	18.000000	1.800000e+01	1.800000e+01	18.000000	1.800000e+01	18.000000	18.000000	18.000000
mean	4.444062e+07	475.222222	4.016308e+07	8.647719e+07	475.222222	1.451377e+08	1.309178e+08	475.222222	1.667123e+08	6.383009	475.222222	0.821611
std	2.155805e+07	398.711093	1.561344e+07	5.395883e+07	398.711093	7.159614e+07	7.463068e+07	398.711093	8.254546e+07	0.274666	398.711093	0.067904
min	5.422902e+06	38.000000	7.435102e+06	2.040826e+07	38.000000	3.394283e+07	2.583116e+07	38.000000	3.607979e+07	5.902410	38.000000	0.661070
25%	3.120732e+07	155.750000	3.199760e+07	5.224442e+07	155.750000	9.884906e+07	7.978989e+07	155.750000	1.147335e+08	6.172247	155.750000	0.790684
50%	3.826033e+07	353.500000	3.624848e+07	6.378596e+07	353.500000	1.197541e+08	1.014263e+08	353.500000	1.371957e+08	6.358034	353.500000	0.817801
75%	6.270817e+07	639.250000	5.293592e+07	1.230737e+08	639.250000	2.170936e+08	1.857819e+08	639.250000	2.490537e+08	6.576691	639.250000	0.883228
max	8.082671e+07	1441.000000	6.354801e+07	1.980907e+08	1441.000000	2.575484e+08	2.789174e+08	1441.000000	2.981470e+08	6.842105	1441.000000	0.908440