loading...

30 Days of Python πŸ‘¨β€πŸ’» - Day 27 - ML & Data Science I

arindamdawn profile image Arindam Dawn Originally published at tabandspace.com ・11 min read

It is time to dig into some real Machine Learning and Data Science coding stuffs. Today I mainly focused on getting started with the Jupyter Notebook workflow and creating a basic project to understand how it works. Finally search for some data set and then follow the basic principles of Machine Learning on it to generate useful information from it. I will also share the notebook I created. The great thing about Jupyter Notebooks it can be literally organized like a blog post or article along with the interactive code, data and other information.

Working with Jupyter Notebooks

I would like to provide a reference to some cool resources to understand the Jupyter Notebook interface, installation guide and its workflow overview.

Since I am a windows user, I would like to provide a quick to tip:

In windows, open Anaconda Prompt from the start menu, navigate to the directory where you want to create jupyter projects, then run the command jupyter notebook. It will open up the notebook in the browser.

As per the basic steps of Machine Learning and Data Science, we shall be creating the project and create a readable notebook that documents the entire process which can then be shared with anyone.

Basics of Data Science and ML using Netflix Shows project

The basic steps of ML and Data Science are:

  • Importing data from some source
  • Cleaning up the data to remove any irrelevant data if needed
  • Splitting up data into Training Set and Test Set.
  • Creating a model or an algorithm or a function
  • Checking the output
  • Improve and repeat the above steps

We shall explore the first two steps in this basic project

1. Importing Data and manipulation

The first and the most important thing for Machine Learning and Data Science is the data itself. To obtain good meaningful conclusions, we must have good data sets. This input data can be collected in a number of ways - from databases, by scraping websites, public APIs or public shared data sets.

Kaggle is a popular website among Machine Learning and Data Science enthusiasts where tons of publicly shared data sets can be found.

I decided to search for a Netflix Shows data set and found this one from Kaggle - https://www.kaggle.com/shivamb/netflix-shows. It contains the data in a CSV format which will be used for this project. After downloading the file, it can be placed in the root directory of the project. I have named it netflix_titles.csv

Since this data is in a kind of tabular format meaning it is arranged in rows and columns, pandas is a great open-source library to process this kind of data and analyze it. It comes along with the Anaconda toolkit, so it can be used directly in the notebook.

import pandas as pd
data_frame = pd.read_csv('netflix_titles.csv')
data_frame.head(10) # show first 10 results
# prints the data frame in as a table

show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...
5 80163890 TV Show Apaches NaN Alberto Ammann, Eloy AzorΓ­n, VerΓ³nica Echegui,... Spain September 8, 2017 2016 TV-MA 1 Season Crime TV Shows, International TV Shows, Spanis... A young journalist is forced into a life of cr...
6 70304989 Movie Automata Gabe IbÑñez Antonio Banderas, Dylan McDermott, Melanie Gri... Bulgaria, United States, Spain, Canada September 8, 2017 2014 R 110 min International Movies, Sci-Fi & Fantasy, Thrillers In a dystopian future, an insurance adjuster f...
7 80164077 Movie Fabrizio Copano: Solo pienso en mi Rodrigo Toro, Francisco Schultz Fabrizio Copano Chile September 8, 2017 2017 TV-MA 60 min Stand-Up Comedy Fabrizio Copano takes audience participation t...
8 80117902 TV Show Fire Chasers NaN NaN United States September 8, 2017 2017 TV-MA 1 Season Docuseries, Science & Nature TV As California's 2016 fire season rages, brave ...
9 70304990 Movie Good People Henrik Ruben Genz James Franco, Kate Hudson, Tom Wilkinson, Omar... United States, United Kingdom, Denmark, Sweden September 8, 2017 2014 R 90 min Action & Adventure, Thrillers A struggling couple can't believe their luck w...
data_frame.info()
# shows information about column data types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6234 non-null   int64 
 1   type          6234 non-null   object
 2   title         6234 non-null   object
 3   director      4265 non-null   object
 4   cast          5664 non-null   object
 5   country       5758 non-null   object
 6   date_added    6223 non-null   object
 7   release_year  6234 non-null   int64 
 8   rating        6224 non-null   object
 9   duration      6234 non-null   object
 10  listed_in     6234 non-null   object
 11  description   6234 non-null   object
dtypes: int64(2), object(10)
memory usage: 584.6+ KB
data_frame.shape
# provides information of rows and columns as a tuple
(6234, 12)
data_frame.describe()
# shows some basic description
show_id release_year
count 6.234000e+03 6234.00000
mean 7.670368e+07 2013.35932
std 1.094296e+07 8.81162
min 2.477470e+05 1925.00000
25% 8.003580e+07 2013.00000
50% 8.016337e+07 2016.00000
75% 8.024489e+07 2018.00000
max 8.123573e+07 2020.00000
data_frame['title'].head() # lists a specific column data with first 5 entries (head)
0    Norm of the North: King Sized Adventure
1                 Jandino: Whatever it Takes
2                         Transformers Prime
3           Transformers: Robots in Disguise
4                               #realityhigh
Name: title, dtype: object
# Filtering Data
data_frame[data_frame['country'] == 'India'].head()
show_id type title director cast country date_added release_year rating duration listed_in description
35 81154455 Movie Article 15 Anubhav Sinha Ayushmann Khurrana, Nassar, Manoj Pahwa, Kumud... India September 6, 2019 2019 TV-MA 125 min Dramas, International Movies, Thrillers The grim realities of caste discrimination com...
37 81052275 Movie Ee Nagaraniki Emaindi Tharun Bhascker Vishwaksen Naidu, Sushanth Reddy, Abhinav Goma... India September 6, 2019 2018 TV-14 133 min Comedies, International Movies In Goa and in desperate need of cash, four chi...
41 70303496 Movie PK Rajkumar Hirani Aamir Khan, Anuskha Sharma, Sanjay Dutt, Saura... India September 6, 2018 2014 TV-14 146 min Comedies, Dramas, International Movies Aamir Khan teams with director Rajkumar Hirani...
58 81155784 Movie Watchman A. L. Vijay G.V. Prakash Kumar, Samyuktha Hegde, Suman, Ra... India September 4, 2019 2019 TV-14 93 min Comedies, Dramas, International Movies Rushing to pay off a loan shark, a young man b...
99 80225885 TV Show Bard of Blood NaN Emraan Hashmi, Viineet Kumar, Sobhita Dhulipal... India September 27, 2019 2019 TV-MA 1 Season International TV Shows, TV Action & Adventure,... Years after a disastrous job in Balochistan, a...
# Sorting Data
data_frame.sort_values('release_year', ascending=False).head()
show_id type title director cast country date_added release_year rating duration listed_in description
3467 81011449 TV Show Medical Police NaN Erinn Hayes, Rob Huebel, Malin Akerman, Rob Co... United States January 10, 2020 2020 TV-MA 1 Season Crime TV Shows, TV Action & Adventure, TV Come... Doctors Owen Maestro and Lola Spratt leave Chi...
3249 81006825 Movie All the Freckles in the World YibrΓ‘n Asuad HΓ‘nssel Casillas, Loreto Peralta, Andrea Sutto... Mexico January 3, 2020 2020 TV-14 90 min Comedies, International Movies, Romantic Movies Thirteen-year-old JosΓ© Miguel is immune to 199...
3220 80997687 TV Show Dracula NaN Claes Bang, Dolly Wells, John Heffernan United Kingdom January 4, 2020 2020 TV-14 1 Season British TV Shows, International TV Shows, TV D... The Count Dracula legend transforms with new t...
3427 81060049 Movie Leslie Jones: Time Machine David Benioff, D.B. Weiss Leslie Jones United States January 14, 2020 2020 TV-MA 66 min Stand-Up Comedy From trying to seduce Prince to battling sleep...
3436 80239306 TV Show The Healing Powers of Dude NaN Jace Chapman, Larisa Oleynik, Tom Everett Scot... NaN January 13, 2020 2020 TV-G 1 Season Kids' TV, TV Comedies, TV Dramas When an 11-year-old boy with social anxiety di...

This is a great cheat-sheet for Data Science with Python which lists all the commonly used Pandas methods and properties along with other libraries for data science as well.

2. Cleaning Data

The next step is cleaning up data and removing any kinds of information that is not required for the analysis.
Let's consider an example use case where we want to find which Netflix comedy movies and shows that are suitable for all ages(TV-G rating)

# Let's select the relevant columns for analysis
df_shows = pd.DataFrame(data_frame, columns=['title','rating', 'listed_in'])
# filter comedy shows
df_comedy_shows = df_shows[df_shows['listed_in'].str.contains('Comed')]
df_comedy_shows.head()
title rating listed_in
0 Norm of the North: King Sized Adventure TV-PG Children & Family Movies, Comedies
1 Jandino: Whatever it Takes TV-MA Stand-Up Comedy
4 #realityhigh TV-14 Comedies
7 Fabrizio Copano: Solo pienso en mi TV-MA Stand-Up Comedy
10 JoaquΓ­n Reyes: Una y no mΓ‘s TV-MA Stand-Up Comedy
# filter shows for all ages
df_all_ages = df_comedy_shows[df_comedy_shows['rating']=='TV-G']
df_all_ages.head()
title rating listed_in
1034 Luccas Neto in: Summer Camp TV-G Children & Family Movies, Comedies
1043 A Holiday Engagement TV-G Children & Family Movies, Comedies, Romantic M...
1205 A Fairly Odd Summer TV-G Children & Family Movies, Comedies
1206 Bella and the Bulldogs TV-G Kids' TV, TV Comedies
1211 Jinxed TV-G Children & Family Movies, Comedies

The Github repository for this notebook can be found here

Resources:

That's all for today's post. Tomorrow will continue exploring more on the other steps of machine learning and data science and perform a visual analysis of data by building charts and diagrams along with creating machine learning models.

Have a great one!

Discussion

pic
Editor guide