DEV Community

ddey117
ddey117

Posted on • Updated on

Insights on Data Understanding and Processing Using Pandas From a New Student of Data Science

      

Introduction

      Hello, and thanks for reading my first blog as a student of Data Science at Flatiron School. I am writing this blog today to absorb and share the process I went through to complete my first large project in a new field.
      For my first project, I was given a fun scenario in which Microsoft hired me to perform data analysis in order to advise them in order for their new movie studio business to be successful. Before diving into the project, I was given access to a repository that contained a large amount of unfamiliar and unexplored data. A large percentage of my time was dedicated to familiarizing myself with this data in order to gain useful insights on how to visualize the data in meaningful ways. Therefore, a large focus on this blog will be on my process for understanding the data available to me.
      The language used for this project was python with a strong emphasis on the Pandas library. This library is a powerful tool for managing, exploring, and manipulating large sets of data.

Data Understanding

      Understanding the data made the rest of the project very straightforward and I am glad I took the time to familiarize myself with every element of every dataset I had at my disposal. Of course processing and cleaning the data took a fair amount of effort and time, but understanding the relationships between all of the data from the very beginning made this process easier and saved time in the end. Towards the end of the project, having a better understanding of the limits and strengths of the Pandas library proved to be just as useful.
      
      Through a process of loading CSV and other delimitated text files, I went right to work previewing all of the data by calling info() and head() methods in Pandas. By exploring the data this way, I was able to create my own table and Schema to visualize all of the data and all of the relationships that exist within the data. Creating the tables in markdown by hand in my notebook and later moving it to a separate markdown file was crucial in my ability to make meaningful connections and insights between the most import elements in every dataset. Below is an example from the table I created:
Table_example

Spending time writing out each element of the IMDb dataset made it very clear that set contained relational data and it was easy to pick out the primary keys of 'tconst' and 'nconst', the unique identifiers IMDb uses for its movie titles and crew names, respectively. Thus, later when I wanted to process and combine different sets of the data, it was really easy to do so using these keys and other relationships I understood more fully by forcing myself to organizing the data in a way that made the most sense to me.

The full table that I created for the data can be found here: [table].(https://github.com/ddey117/Microsoft_Movie_Analysis/blob/b498cd22d591b9d5a5ad80e14def8b58b1e8ff15/Dataset_Tables.md)

      After dedicating a large amount of time to get more personal with the data and really understand each little piece of the puzzle, I also created a relational schema to quickly overview everything as I worked through the rest of the project.

Schema

Power of Pandas

      While creating this incredibly useful framework and reference for data understanding indeed made my life easier, I also owe a lot of credit to the Pandas library for simplifying the process of manipulating the data further. Learning Pandas for the first time I was very tempted to use for loops to access information in my dataframes. However, Pandas has a lot of built in functionality to avoid having to use complicated for loops. In fact, because Pandas was designed to take advantage of the power and speed of vectorization derived from numpy, it is generally advised to avoid using python iteration when you can probably accomplish the task quicker using vectorization. For now I will just like to show some examples of how Pandas was useful for me personally in the course of completing this project. Below is the documentation for a short program I wrote to clean up some of the data I wanted to use for my project:

Program Documentation

Instead of writing a tedious for loop to accomplish this task, Pandas only required a single line of code:
progline

That single line of code made cleaning large amounts of data very simple to avoid any issues later on when analyzing the data.

      Later on, when attempting to combine different tables, I ran into another issue that I was able to solve again with very little code. I wanted to join two dataframes by means of working titles. The problem was that in one of the tables, some movies contained a date in paranthesis while the same title in the other table did not. Thus, merge conflicts were sure to happen if I did not try to rectify this issue.
     My first attempt was a very messy for loop that accomplished the task somewhat, but with a lot of errors and hard coding that was very crude and hard to follow. After exploring stack overflow to see how to maybe clean up my loop, I actually found a much better solution: Regular Expressions!
     Before this project I had no first hand experience really using this powerful tool. However, I found a link to a useful sandbox for exploring and learning Regex: regex101.
      Through a little bit of research and some trial and error at regex101.com, I came up with a much more sophisticated solution:
regexSolution

     Being able to solve a problem in a single line again instead of many convoluted lines in a for loop was awesome. Even after figuring this out, I somehow convinced myself the nested for loop I used later was still the right way to go and pandas definitely didn't have any methods to make my life easier. This turned out to be false as there seems to always be a method to make your life easier! Below is a table I wanted to access information about movie genres from:
genres

      I wanted to access the genres individually and count how many times they appear in the dataframe. However, as you can see in the table, each element of the genre pandas series is a string of comma separated genres. . Below is my original solution using nested iteration:
iteration
      First, I created a variable to grab the genres information as a pandas series. Next, I initialized a dictionary to fill with each genre and it's corresponding count in the series. After that I iterated to first create a list of genres so that I could iterate over each of those lists again to fill my dictionary. I then converted my dictionary back into a pandas dataframe to have a neat table of genres and genre counts. This worked but it involved way more steps and datatype conversions than I would like to have. In comes my more elegant solution thanks to a tip about the 'explode' pandas method:
explode
I accomplished exactly what I wanted to do again but with only three lines of code and without having to convert from and to different pandas dataframes and dictionaries. Cleaning up your work is always a great feeling. If you ever find yourself iterating in pandas I advise you to first spend a little time and effort to discover a method to accomplish it more efficiently. If you take advantage of the powerful inbuilt functionality of the Pandas library, you are going to have a much better time coding with it.

Side Notes

      Something I didn't get to cover was the use of python libraries in conjunction with APIs to retrieve and organize data. After gaining a lot of experience through hands on practice by working on this project, I ended up starting my own side project using the GBIF RESTful JASON based API. I go back to my scientific roots to explore population data for Mexican Honey Wasps using a lot of the same skills I learned from working on this movie studio project. I hope to share this fun little data science experiment with everyone in a separate blog.

Conclusion

     Hands on experience with the python Pandas library helped me to start to better parse out when to iterate and when to take advantage of numpy vectorization. I also gained a ton of appreciation for Pandas power for organizing and accesing large amounts of data. I also shared my own method for getting a personal connection and deep understanding of the data through calling simple methods to peak at the data and creating tables and a Schema before diving into any manipulation of the data. It is time consuming but I think it is worth it as it reduces the chances of missing any connections that could lead to important insights later on.

      

Top comments (0)