Author: Ryan Harper
Data Source: https://www.kaggle.com/tmdb/tmdb-movie-metadata/data
Original Source: https://www.themoviedb.org
The data set in this notebook contains information about movies listed on IMDB. It includes categorical variables (genre, language, country), discrete variables (ratings), and continuous variables (budget and revenue). The data was pulled from www.themoviedb.org and posted on the www.kaggle.com website. This analysis looks at the movie's genre, budget, and gross profits to help investors decide which movies are more likely to be financially successful at the box office.
Questions:
Question 1: Do movie genres vary from one another? In what ways?
Question 2: What is the distribution of gross per genre?
Question 3: Is there a relationship between budget and gross profits per genre?
Hypotheses:
Hypothesis 1: There is a difference between Action and Animation genre samples for gross profits.
Hypothesis 2: Budget and genre correlate with gross.
Hypothesis 3: Animation films have the best gross profit potential depending on budget.
Answers:
Answer 1: Sample sizes vary significantly between genres (may reflect the overall industry or just a result of scraped data). There are significant differences between genres with regards to average gross profits but this might also be impacted by average budget. Animation films appear to have the best gross profit between the inner and upper quartile range. Romance, Horror, and Music genres seem to trend towards a positive gross profit albeit their upper quartile ranges are smaller than other high grossing genres. The lower quartile range of Action and Adventure genres seem to show a larger range of financial loss.
Answer 2: Based on the histogram plots, most genres appear to be left skewed. Exponential curves in QQ plots indicates most genres are left skewed. Animation appears more linear so it could have a normal distribution.
Answer 3: Action, Fantasy, Science Fiction genres have the highest correlation between budget and profit. Western, Horror, War, and History genres have low correlation between budget and profit
# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import calendar
import seaborn as sns
from matplotlib.pyplot import subplots, show
import ast
import math
from scipy.stats import ttest_ind, mannwhitneyu, median_test
import missingno as msno #Missing data visualization module for Python.
from IPython.display import display
from IPython.core.debugger import Tracer
%matplotlib inline
# change optional settings for libraries
sns.set_style("whitegrid")
pd.set_option('show_dimensions', False)
np.warnings.filterwarnings('ignore')
Created 'gross' column and deleted unused columns:
# import data, add gross profit column, and delete homepage/overview columns
moviedata = pd.read_csv('../data/tmdb_5000_movies.csv', encoding = "ISO-8859-1")
moviedata['gross'] = moviedata['revenue'] - moviedata['budget']
del moviedata['homepage']
del moviedata['overview']
del moviedata['original_title']
Values in some columns printed out the data as a dictionary within a string:
#rows include values that are strings with dictionaries inside
print(moviedata.iloc[1]['genres'])
print(moviedata.iloc[1]['production_countries'])
print(moviedata.iloc[1]['production_companies'])
Functions to interpret data embedded in the dict-like strings:
def find_set(seriesvalue):
# function for returning list of unique values in column
unique=[]
for seriesvalue in seriesvalue:
seriesvalue = ast.literal_eval(seriesvalue)
for fulldict in seriesvalue:
# The key for the dict values is 'name'
if 'name' in fulldict:
unique.append(fulldict['name'])
return set(unique)
def name_fix(seriesvalue):
# function for converting dictionaries to simple lists in columns
try:
seriesvaluelist=[]
seriesvalue = ast.literal_eval(seriesvalue)
for valuedict in seriesvalue:
if 'name' in valuedict:
seriesvaluelist.append(valuedict['name'])
return ', '.join(seriesvaluelist)
except (SyntaxError, ValueError, TypeError) as e:
return seriesvalue
# creates list of unique values for specific columns with string data
un_genre = list(find_set(moviedata.genres))
un_country = list(find_set(moviedata.production_countries))
un_language = list(find_set(moviedata.spoken_languages))
un_keywords = list(find_set(moviedata.keywords))
un_companies = list(find_set(moviedata.production_companies))
# runs the name fix function to make the dataset easier to read
for column in moviedata:
moviedata[column] = moviedata[column].apply(name_fix)
# check for missing data
msno.matrix(moviedata,color=(.1, .6, .8))
Data Summary:
# reporting of data
print("TOTAL MOVIES: {}".format(moviedata['id'].count()))
print("COUNTRIES: {}, GENRES: {}, LANGUAGES: {}, COMPANIES: {}".format(len(un_country),len(un_genre),len(un_language),len(un_companies)))
print("BUDGET MAX: ${:,.2f} REVENUE MAX: ${:,.2f} GROSSING MAX: ${:,.2f}".format(moviedata['budget'].max(),moviedata['revenue'].max(),moviedata['gross'].max()))
print("BUDGET MEAN: ${:,.2f} REVENUE MEAN: ${:,.2f} GROSSING MEAN: ${:,.2f}\n".format(moviedata['budget'].mean(),moviedata['revenue'].mean(),moviedata['gross'].mean()))
display(moviedata.iloc[0:1])
There are 4,803 movies, 20 genres, 88 countries, 62 languages, and 5,017 companies.
The highest budget is \$380,000,000.00. <br>The highest revenue is \$2,787,965,087.00.
The highest grossing movie was \$2,550,965,087.00 (Avatar).
Movies and genres without budget or revenue data needed to be removed from the analysis:
# remove data without budget/revenue
moviedata = moviedata.loc[(moviedata['budget'] != 0) & (moviedata['revenue'] != 0)]
if 'Foreign' in un_genre:
un_genre.remove('Foreign')
if 'TV Movie' in un_genre:
un_genre.remove('TV Movie')
Looped through genre list to create genre columns with boolean values:
# creates temp columns for True/False check of each genre and deletes main genre column
for column in un_genre:
moviedata[column] = moviedata.genres.apply(lambda value: column in value)
del moviedata['genres']
# creates new dataframes for each genre and puts into dict
genres = {}
for column in un_genre:
df = moviedata[moviedata[column] == True]
genres[column] = df.drop(un_genre, axis=1, inplace=False)
Sort genres into new dataframes and calculate the average, count, standard deviation of numerical data:
# creates a new dataframe based on genres
genrestats = {}
intcolumns = ['budget','revenue','gross','vote_average']
for col in intcolumns:
poplist = []
for genre in un_genre:
poplist.append(moviedata[genre] == True)
avglist = []
devlist = []
cntlist = []
for ser in poplist:
avglist.append(moviedata.loc[ser, col].mean())
devlist.append(moviedata.loc[ser, col].std())
cntlist.append(moviedata.loc[ser, col].count())
genrestats[col+'_avg'] = avglist
genrestats[col+'_std'] = devlist
genrestats[col+'_cnt'] = cntlist
Pie Chart of Genres and Genre DF Desription
dfGenre = pd.DataFrame(genrestats, index=un_genre)
# create color range
NUM_COLORS = 22
cm = plt.get_cmap('Blues_r')
colors = [cm(.7*i/NUM_COLORS) for i in range(NUM_COLORS)][4:22]
# Plot
fig=plt.figure(figsize=(7,7))
patches, texts = plt.pie(dfGenre['gross_cnt'].sort_values(), labels=dfGenre['gross_cnt'].sort_values().index, colors=colors,
rotatelabels=True,startangle=140,labeldistance=1.01)
[text.set_fontsize(index/3 + 7) for index, text in enumerate(texts)]
fig.suptitle('Pie Chart of Movie Count Per Genre', fontsize=17, y=.95)
plt.show()
display(dfGenre,dfGenre.describe())
Counts for each genre are the same for budget, revenue, and gross so there does not appear to be immediate issues with null values.
Answer to Q1: Sample sizes vary significantly between genres (may reflect the overall industry or just a result of scraped data). There are significant differences between genres with regards to average gross profits but this might also be impacted by average budget.
A. Histogram
# hist plot of diff genres gross profits
i = 1
fig=plt.figure(figsize=(12,12))
xmin = moviedata['gross'].min()
xmax = moviedata['gross'].max()
for genre in genres:
plt.subplot(5, 5, i)
i = i + 1
plt.hist(genres[genre]['gross'],bins=10)
plt.xlim(xmin,xmax)
plt.xlabel('gross profits')
plt.ylabel('frequency (by bins)')
plt.title(genre)
fig.suptitle('Histograms of Each Genre (Gross)', fontsize=18, y=1.03)
plt.tight_layout()
plt.show()
Answer to Q2: Based on the histogram plots, most genres appear to be left skewed.
B. QQ Gaussian Plot
def plot_qq(series, loc=1,color='royalblue'):
# creating random normal sampling for qq plot
norm = np.random.normal(0, 1, series.count())
# Sorting the values in ascending order.
norm.sort()
series = series.sort_values()
# Plotting a genre sample against norm in qqplot.
plt.subplot(5, 5, loc)
plt.plot(norm, series, "o", color =color)
# --currently not used
def plot_qqneat(series):
# example qq plot
a = np.random.normal(5,5,250)
b = np.random.rayleigh(5,250)
percs = np.linspace(0,100,21)
qn_a = np.percentile(a, percs)
qn_b = np.percentile(b, percs)
plt.plot(qn_a,qn_b, ls="", marker="o")
x = np.linspace(np.min((qn_a.min(),qn_b.min())), np.max((qn_a.max(),qn_b.max())))
plt.plot(x,x, color="k", ls="--")
# qq plots of genres
i = 1
fig=plt.figure(figsize=(10,10))
for genre in genres:
if genre == 'Animation':
plot_qq(genres[genre]['gross'],i,'orangered')
else:
plot_qq(genres[genre]['gross'],i)
i = i + 1
plt.title(genre)
fig.suptitle('QQ Gaussian Plots Per Genre', fontsize=20, y=1.03)
plt.tight_layout()
plt.show()
Answer to Q2: Exponential curves in QQ plots indicates most genres are left skewed. Animation appears more linear so it could have a normal distribution.
C. Box Plot
def ser_con(genre): return genres[genre]['gross']
plt.figure(figsize=(8,8))
plt.boxplot(list(map(ser_con, un_genre)))
# Tracer()()
plt.xticks(list(range(1,19)), un_genre, rotation=90, fontsize=13)
plt.ylim(moviedata['gross'].min(), moviedata['gross'].max()*.45)
plt.tight_layout()
plt.ylabel('gross profit')
plt.title('Boxplot of Gross Profit Per Genre', fontsize=17,y=1.02)
plt.show()
Answer to Q1: Animation films appear to have the best gross profit between the inner and upper quartile range. Romance, Horror, and Music genres seem to trend towards a positive gross profit albeit their upper quartile ranges are smaller than other high grossing genres. The lower quartile range of Action and Adventure genres seem to show a larger range of financial loss.
Hypothesis 1: There is a difference between gross for Action and Animation genres
A. Descriptive Comparison
# reviewing Action for outliers
print('Action mean: {}'.format(genres['Action']['gross'].mean()))
print('Action median: {}'.format(genres['Action']['gross'].median()))
print('Action standard deviation: {}'.format(genres['Action']['gross'].std()))
print('Action median absolute deviation: {} \n'.format(genres['Action']['gross'].mad()))
# reviewing Animation for outliers
print('Animation mean: {}'.format(genres['Animation']['gross'].mean()) )
print('Animation median: {}'.format(genres['Animation']['gross'].median()))
print('Animation standard deviation: {}'.format(genres['Animation']['gross'].std()))
print('Animation median absolute deviation: {} \n'.format(genres['Animation']['gross'].mad()) )
Answer to Q1: When compared to the action genre, the Animation genre has a comparatively large median absolute deviation and will be more inclusive of outliers
B. T-test (Parametric A/B Independent Samples)
#Ex T-Test with scipy for parametric, (equal_var=True because same population)
t, p = ttest_ind(genres['Animation']['gross'], genres['Action']['gross'], equal_var=True)
print('tvalue: {}, pvalue:{}'.format(t,p))
P-value is less than 5%. Indicates variation between the Animation and Action. T-value is 4.878 (a little low)
# T-Test Raw Calculation for parametric data (assuming different population?)
# Compute the difference between the two sample means.
diff=dfGenre.loc['Animation','gross_avg'] - dfGenre.loc['Action','gross_avg']
# size of samples
size = np.array([dfGenre.loc['Animation','gross_cnt'], dfGenre.loc['Action','gross_cnt']])
# sample distribution
sd = np.array([dfGenre.loc['Animation','gross_std'], dfGenre.loc['Action','gross_std']])
# The squared standard deviations are divided by the sample size and summed, then we take
# the square root of the sum.
diff_se = (sum(sd ** 2 / size)) ** 0.5
print(diff/diff_se)
T-value is 4.45 and is close to scipy ttest results.
C. Mann Whitney U test (2 Non-Normally Distributed Independent Samples)
#Ex U-test with scip for nonparametric
mannwhitneyu(genres['Animation']['gross'], genres['Action']['gross'], use_continuity=True, alternative=None)
MannWhitney U test has a p-value that is well below 1%. Indicates likelihood of variability in movie budgets between Action and Crime genres.
D. Mood’s Median test (2+ Non-Normally Distributed Independent Samples)
stat, p, med, tbl = median_test(genres['Action']['gross'],genres['Animation']['gross'],genres['Fantasy']['gross'],genres['Science Fiction']['gross'])
print(stat,med)
print(p)
P-value for mood's median test is less than 0.05 for a comparison of Action, Animation, Fantasy, and Science Fiction.
Hypothesis 2: Budget and genre correlates with gross.
Hypothesis 3: Animation films have the best gross profit potential depending on budget amount.
A. Scatter Plot
# plot of expense(x) to gross (y)
correlation = {}
i = 1
fig=plt.figure(figsize=(12,12))
xmin = moviedata['budget'].min()
xmax = moviedata['budget'].max()*.3
ymin = moviedata['gross'].min()
ymax = moviedata['gross'].max()*.3
for genre in genres:
plt.subplot(5, 5, i)
i = i + 1
plt.scatter(genres[genre]['budget'],genres[genre]['gross'],.5)
plt.xlim(xmin,xmax)
plt.ylim(ymin,ymax)
plt.title(genre)
plt.xlabel('budget')
plt.ylabel('gross')
correlation[genre] = (genres[genre]['budget'].corr(genres[genre]['gross']))
fig.suptitle('Scatter Plots Per Genre (Gross vs Budget)', fontsize=18, y=1.03)
plt.tight_layout()
plt.show()
Answer to Q3: The scatter plots show a better visualization of the outliers. A comparatively large number of Western films seem to have lost money. War films appear to have a wider range of profit (and or loss).
B. Correlation Heat Map
corr = np.array((list(correlation.values())))
labels = (np.asarray(["{}\n\n{:.2f}".format(string, value) \
for string, value in zip(list(correlation.keys()),corr)])).reshape(3, 6)
fig, ax = plt.subplots(figsize=(8,4))
sns.heatmap(corr.reshape(3,6),
annot=labels,
square=False,
ax=ax,
fmt="",
xticklabels=False,
yticklabels=False,
cmap="Blues",
vmin=.3,
vmax=.6)
plt.tight_layout()
plt.title('Correlation of Budget and Gross Profit (Per Genre)')
plt.show()
Answer to Q3: Genres that are dark blue (Action, Fantasy, Science Fiction) have the highest correlation between budget and profit.
Genres that are white (Western, Horror, War, and History) have low correlation between budget and profit.
Further Research:
Can gross profits for a movie be predicted using features like budget and genre?
Model 1: Use a linear regression model on budget and genre to predict gross profits. I would split genres (with sample sizes greater than 500) using train_test_split and plot a linear regression.
Model 2: Use Naive Bayes neural network with a few features from the IDMB set (including genre) to predict gross profits and to determine which features are the most significant. I would clean up the other features in the dataset (language, production studio, and runtime) and run the NBNN to see if box office profits can be determined.
Assumptions:
Movies were randomly sampled from IMDB or all movies were chosen IMDB budget and revenue data are reliably sourced
Difficulties:
Using the scientific method Organizing presentation of the data Financial data (budget/revenue/gross) was left skewed (nonparametric) Interpreting T-value Sample sizes for genre varied dramatically Plotting 20 genres with subplots was logistically difficult Scatter plots for budget and gross profits are misleading (with regards to the y-axis and diminishing returns)
NOTES
Samples Comparison Test Chart: https://courses.thinkful.com/data-201v1/assignment/5.5.1
A. Histogram
# hist plot of diff genres gross profits
i = 1
fig=plt.figure(figsize=(12,12))
xmin = moviedata['vote_average'].min()
xmax = moviedata['vote_average'].max()
for genre in genres:
plt.subplot(5, 5, i)
i = i + 1
plt.hist(genres[genre]['vote_average'],bins=10)
plt.xlim(xmin,xmax)
plt.xlabel('Voter_Average')
plt.ylabel('frequency (by bins)')
plt.title(genre)
fig.suptitle('Histograms of Each Genre (Vote_Average)', fontsize=18, y=1.03)
plt.tight_layout()
plt.show()
Most Histograms show normality. Some Western and Documentary genres appear to be skewed right.
B. QQ Gaussian Plot
# qq plots of genres for vote_averages
i = 1
fig=plt.figure(figsize=(10,10))
for genre in genres:
plot_qq(genres[genre]['vote_average'],i)
i = i + 1
plt.title(genre)
fig.suptitle('QQ Gaussian Plots Vote_Average Per Genre', fontsize=20, y=1.03)
plt.tight_layout()
plt.show()
QQ plots of genres by vote _average are linear and match the normal sample (with the exception of War and Western)
C. T-test (Parametric A/B Independent Samples) https://courses.thinkful.com/data-201v1/assignment/5.5.1
#Ex T-Test with scipy for parametric, (equal_var=True because same population)
t, p = ttest_ind(genres['Crime']['vote_average'], genres['Action']['vote_average'], equal_var=True)
print('tvalue: {}, pvalue:{}'.format(t,p))
# T-Test Raw Calculation for parametric data
# Compute the difference between the two sample means.
diff=dfGenre.loc['Crime','vote_average_avg'] - dfGenre.loc['Action','vote_average_avg']
# size of samples
size = np.array([dfGenre.loc['Crime','vote_average_cnt'], dfGenre.loc['Action','vote_average_cnt']])
# sample distribution
sd = np.array([dfGenre.loc['Crime','vote_average_std'], dfGenre.loc['Action','vote_average_std']])
# The squared standard deviations are divided by the sample size and summed, then we take
# the square root of the sum.
diff_se = (sum(sd ** 2 / size)) ** 0.5
print(diff/diff_se)
T-value is 6.71 and p-value is less than 5%. Indicates variation between the Crime and Action.