I am learning Python coding and data science because I am interested in applying these skills to work with bigger data sets and to analyze such data to make better, more informed, and timely operating and policy decisions in finance and economics.
As a student enrolled in Flatiron School's Data Science Bootcamp, I am enjoying learning these new skills and am already finding value in their application to solve real-world questions and challenges.
My first project examines the movie industry for a client that is interested in entering the business. Descriptive data analysis shows that the movie industry is profitable but there is significant variation in performance across films and production studios. The client can use this analysis to understand the key trends in the movie industry, identify its main competitors, and determine the types of films they will be creating. This analysis also serves as a baseline for deeper dives on the movie industry.
The project explored more than ten zipped, movie-related data sets from four sources. Data files provide a wide range of information over the last 20 years about individual movies' box office revenues, budgets, genres, as well as about production studios and associated cast and crew members.
The project applies exploratory data analysis and examines trends of key metrics over time. This provides an insightful overview of the evolution of the performance of the movie industry. Several data files contain useful information that complements the other files, while some data was duplicated. This required extensive cleaning and joining data sets. I have experience preparing and analyzing data sets, mostly in Excel; but I have never worked with so many large files simultaneously nor done this with Python code. It was fun to tackle this in a new way, and I particularly enjoyed analyzing the data and generating chart output that presents the results.
One of the first findings is that there are large outliers in the data and there is notable variation across movies for most indicators. Given these characteristics, I wanted to explore the gross revenue and ROI of the top grossing movies (i.e., those in the 99th percentile). A reasonable hypothesis is that the movies with the highest gross revenue would be among the ones with the highest ROI. This is not the case. To show this cleanly, I plotted the top movies' gross revenue and ROI on two different axes.
The chart shows that top grossing movies are profitable, but these movies do not necessarily have the highest ROI. Blockbuster office performance does not translate to higher ROI, implying that cost control is an important determinant of the bottom line.
#Create figure fig, ax1 = plt.subplots(figsize=(14,9)) #Assign chart variables title = df['title'] ww_gross = df['worldwide_gross_m'] ROIp = df['ROIpct'] #Identify two y-axes using the same x-axis (i.e., the second (left) y-axis will use the same x-axis ax2 = ax1.twinx() #Create standard bar chart of gross revenue on the left y-axis ax1.bar(title, ww_gross, color='lightsteelblue') #Add line plot of ROI to same chart on the right y-axis. Set markers to '.' and remove line. ax2.plot(title, ROIp, marker = '.', markersize = 12, color='navy', linestyle='None') #X-axis label formatting: rotate and center ax1.set_xticklabels(title, rotation=90, ha='center') #Y-axis label formatting: Set labels and change colors of labels to match chart content ax1.set_ylabel('Gross Revenue (Millions $)', color='gray') ax2.set_ylabel('ROI (Percent)', color='navy') #Y-axis tick marks: Set min, max, intervals ax1.set_yticks(np.arange (0, 3250, 250)) ax2.set_yticks(np.arange (0, 3250, 250)) plt.show()
Another valuable takeaway from the data is the share of movies that are profitable versus unprofitable. I was curious to see what percentage of movies fall within given profitability ranges—for this, I set out to make a stacked percentage chart. This analysis shows that the movie industry is a profitable, but challenging, business. Forty percent of movies generate healthy return on investment exceeding 100%, while 25% generate positive but lower returns below 100%. Notably, 35% of movies lose money.
After some initial troubleshooting on my code, I approached this as follows:
#Prepare data for stacked 100% bar chart. Create df grouped by year and count of ROI buckets. Convert count of ROI buckets to percent of total count. df = ((df_roi.groupby(['year', 'ROI_buckets'])['ROI_buckets'].count() /df_roi.groupby(['year'])['ROI_buckets'].count()))*100 #Set color map and select number of colors from color map viridis = cm.get_cmap('viridis', 9) #Create stacked bar chart ax = df.unstack().plot.bar(stacked = True, figsize=(14,10), color=viridis.colors) #Set title, x-label, y-label ax.set_title('ROI - Movies in 90th percentile', fontsize = 18) ax.set_xlabel('Year', fontsize = 14) ax.set_ylabel('Percent of movies (%)', fontsize = 14) #Set y-axis ticks: min, max, interval ax.yaxis.set_ticks(np.arange(0, 110, 10) #Set tick marks on right side. ax.tick_params(labeltop=False, labelright=True) #Reverse legend order and set legend location handles, labels = ax.get_legend_handles_labels() ax.legend(reversed(handles), reversed(labels), loc='center left', bbox_to_anchor=(1.05,0.5)) plt.show()
Looking forward to continuing to learn and share data science insights--more to come!