DEV Community

Greg
Greg

Posted on • Updated on • Originally published at gregondata.com

Finding popular data science podcasts via web scraping

The article will go over the process I used to create the list of podcasts you see below. If you're just here for the podcasts, then have at it...

the most popular data science podcasts

title author avg_rtg rtg_ct episodes
Lex Fridman Podcast Lex Fridman 4.9 2400 126
Machine Learning Guide OCDevel 4.9 626 30
Data Skeptic Kyle Polich 4.4 431 300
Data Stories Enrico Bertini and Moritz Stefaner 4.5 405 162
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) Sam Charrington 4.7 300 300
DataFramed DataCamp 4.9 188 59
The AI Podcast NVIDIA 4.5 162 125
SuperDataScience Kirill Eremenko 4.6 161 300
Partially Derivative Partially Derivative 4.8 141 101
Machine Learning Stanford 3.9 138 20
Talking Machines Tote Bag Productions 4.6 133 106
AI in Business Daniel Faggella 4.4 102 100
Learning Machines 101 Richard M. Golden, Ph.D., M.S.E.E., B.S.E.E. 4.4 87 82
storytelling with data podcast Cole Nussbaumer Kna 4.9 80 33
Data Crunch Data Crunch Corporation 4.9 70 64
Data Viz Today Alli Torban 5.0 64 62
Artificial Intelligence MIT 4.1 61 31
O'Reilly Data Show Podcast O'Reilly Media 4.2 59 60
Machine Learning – Software Engineering Daily Machine Learning – Software Engineering Daily 4.5 59 115
Data Science at Home Francesco Gadaleta 4.2 58 100
Data Engineering Podcast Tobias Macey 4.7 58 150
Big Data Ryan Estrada 4.6 58 13
Follow the Data Podcast Bloomberg Philanthropies 4.3 57 82
Making Data Simple IBM 4.3 56 104
Analytics on Fire Mico Yuk 4.4 51 48
Learn to Code in One Month Learn to Code 4.9 50 26
Becoming A Data Scientist Podcast Renee Teate 4.5 49 21
Practical AI: Machine Learning & Data Science Changelog Media 4.5 48 105
The Present Beyond Measure Show: Data Visualization, Storytelling & Presentation for Digital Marketers Lea Pica 4.9 44 58
The Data Chief Mission 4.9 43 16
AI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion Cognilytica 4.2 42 161
Data Driven Data Driven 4.9 41 257
HumAIn Podcast - Artificial Intelligence, Data Science, and Developer Education David Yakobovitch 4.8 39 78
Data Gurus Sima Vasa 5.0 39 106
Masters of Data Podcast Sumo Logic hosted by Ben Newton 5.0 38 74
The PolicyViz Podcast The PolicyViz Podcast 4.7 36 180
The Radical AI Podcast Radical AI 4.9 34 35
Women in Data Science Professor Margot Gerritsen 4.9 28 24
Towards Data Science The TDS team 4.6 26 50
Data in Depth Mountain Point 5.0 22 24
Data Science Imposters Podcast Antonio Borges and Jordy Estevez 4.4 22 88
The Artists of Data Science Harpreet Sahota 4.9 19 41
#DataFemme Dikayo Data 5.0 17 30
The Banana Data Podcast Dataiku 4.9 15 33
Experiencing Data with Brian T. O'Neill Brian T. O'Neill from Designing for Analytics 4.9 14 13
Secrets of Data Analytics Leaders Eckerson Group 4.8 13 82
Data Journeys AJ Goldstein 5.0 13 26
Data Driven Discussions Outlier.ai 5.0 12 8
Data Futurology - Leadership And Strategy in Artificial Intelligence, Machine Learning, Data Science Felipe Flores 4.4 11 135
Artificially Intelligent Christian Hubbs and Stephen Donnelly 4.9 11 100

why i want to find data science podcasts

This would normally be at the top of an article on finding data science podcasts. Well it would be at the top of any article. But realistically, most people are finding this from google, and they're just looking for the answer that's at the top of the page. If you type in 'the most popular data science podcasts', you really don't want to have to scroll down endlessly to find the answer you're looking for. So to make their experience better, we're just leaving the answer up there. And giving them sass. Lots of sass.

Anyways, I really like listening to things. While newsletters are great for keeping up with current events and blogs are great for learning specific things, podcasts have a special place in my heart for allowing me to aimlessly learn something new every day. The format really lends itself to delivering information efficiently, but in a way where you can multitask. Pre-COVID, my morning commute was typically full of podcasts. While COVID has rendered my commute a nonexistent affair, I still try to listen to at least a podcast a day if I can manage it. My view is that 30 minutes of learning a day will really add up in the long run, and I feel that podcasts are a great way to get there.

Now that we've been through my love affair with podcasts, you can imagine my surprise when I started looking for a few data science ones to subscribe to and I didn't find a tutorial on how to use web scraping to find the most popular data science podcasts to listen to. I know, crazy. There's a web scraping tutorial on everything under the sun except for - seemingly - podcasts. I mean there's probably not one on newsletters either, but we'll leave that alone for now...

So if no one else is crazy enough to write about finding data science podcasts with web scraping, then...

gameplanning the process

By now we're almost certainly rid of those savages who are only here for the answer (gasp, how could they), so we'll go into the little process I went through to gather the data. It's not particularly long, and took me probably an hour to put it together, so it should be a good length for an article.

I'm using python here with an installation of Anaconda (which is a common package management / deployment system for python). I'll be running this in a Jupyter notebook, since its a one-off task that I don't need to use ever again... hopefully.

In terms of what I'm going to do, I'll run a few google keyword searches which are limited to the 'https://podcasts.apple.com/us/podcast/' domain and scrape the results for the first few pages. From there I'll just be scraping the apple podcast page to get the total number of ratings and the average rating. Yea, the data will be biased, but its a quick and dirty way to get the answer I'm looking for.

code to find top data science podcasts - version 1

# import default python packages
import urllib
import requests
import time
Enter fullscreen mode Exit fullscreen mode

The above packages are included in python, the below ones aren't always included. If you don't have them installed, you'll have to download them. You can find out how to use pip to do it or conda.

# import non-standard python packages
# if you dont have these installed, install them with pip or conda
from bs4 import BeautifulSoup
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Now that the packages have been imported, you should define your user agent. First off, because its polite if you're scraping anything. Secondly, google gives different results for mobile and desktop searches. This isn't actually my user-agent, I took it from another tutorial since I'm a bit lazy. I actually use linux...

# define your desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
Enter fullscreen mode Exit fullscreen mode

Alright now we're going to define the queries we want to run. And then create a function that spits out the URL we want to scrape on google. I'm putting the queries in a kwargs format, since I want to put them through a function. That means I can just loop through the list of kwargs and get the results that the function returns.

# Queries
list_kwargs = [
    {"string": 'data podcast'},
    {"string": 'data podcast', "pg": 2},
    {"string": 'data podcast', "pg": 3},
    {"string": 'data science podcast'},
    {"string": 'data engineering podcast'},
    {"string": 'data visualization podcast'},
]

def string_to_podcast_query(string, pg=None):
    query = urllib.parse.quote_plus(f'site:https://podcasts.apple.com/us/podcast/ {string}')
    if pg != None:
        query = query + "&start=" + str(10*(pg-1))
    return f"https://google.com/search?hl=en&lr=en&q={query}", string

# define the headers we will add to all of our requests
headers = {"user-agent" : USER_AGENT}

# set up an empty list to push results to
results = []

# cycle through the list of queries 
for x in list_kwargs:
    # return the query url and the search term that was used to create it (for classification later)
    url, search_term = string_to_podcast_query(**x)

    # make a get request to the url, include the headers with our user-agent
    resp = requests.get(url, headers=headers)

    # only proceed if you get a 200 code that the request was processed correctly
    if resp.status_code == 200:
        # feed the request into beautiful soup
        soup = BeautifulSoup(resp.content, "html.parser")

    # find all divs (a css element that wraps page areas) within google results
    for g in soup.find_all('div', class_='r'):
        # within the results, find all the links 
        anchors = g.find_all('a')
        if anchors:
            # get the link and title, add them to an object, and append that to the results array
            link = anchors[0]['href']
            title = g.find('h3').text
            item = {
                "title": title,
                "link": link,
                "search_term": search_term
            }
            results.append(item)

    # sleep for 2.5s between requests.  we don't want to annoy google and deal with recaptchas
    time.sleep(2.5)
Enter fullscreen mode Exit fullscreen mode

Alright, now we have the google results back - nice. From here, lets put that in a pandas dataframe and filter it a bit.

google_results_df = pd.DataFrame(results)

# create a filter for anything that is an episode.  They should contain a ' | '.
# drop any duplicate results as well.
google_results_df['is_episode'] = google_results_df['title'].str.contains(' | ',regex=False)
google_results_df = google_results_df.drop_duplicates(subset='title')

google_results_podasts = google_results_df.copy().loc[google_results_df['is_episode']==False]
Enter fullscreen mode Exit fullscreen mode

Ok cool, we have a list of podcasts. Lets define our apple podcasts scraper.

def podcast_scrape(link):
    # get the link, use the same headers as had previously been defined.
    resp = requests.get(link, headers=headers)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")

    # find the figcaption element on the page
    rtg_soup = soup.find("figcaption", {"class": "we-rating-count star-rating__count"})
    # the text will return an avg rating and a number of reviews, split by a •
    # we'll spit that out, so '4.3 • 57 Ratings' becomes '4.3', '57 Ratings'
    avg_rtg, rtg_ct = rtg_soup.get_text().split(' • ')
    # then we'll take numbers from the rtg_ct variable by splitting it on the space
    rtg_ct = rtg_ct.split(' ')[0]

    # find the title in the document, get the text and strip out whitespace
    title_soup = soup.find('span', {"class":"product-header__title"})
    title = title_soup.get_text().strip()
    # find the author in the document, get the text and strip out whitespace
    author_soup = soup.find('span', {"class":"product-header__identity podcast-header__identity"})
    author = author_soup.get_text().strip()

    # find the episode count div, then the paragraph under that, then just extract the # of episodes
    episode_soup = soup.find('div', {"class":"product-artwork__caption small-hide medium-show"})
    episode_soup_p = episode_soup.find('p')
    episode_ct = episode_soup_p.get_text().strip().split(' ')[0]

    # format the response as a dict, return that response as the result of the function
    response = {
        "title": title,
        "author": author,
        "link": link,
        "avg_rtg": avg_rtg,
        "rtg_ct": rtg_ct,
        "episodes": episode_ct
    }
    return response
Enter fullscreen mode Exit fullscreen mode

Cool, we now have a podcast scraper. You can try it with the below code.

podcast_scrape('https://podcasts.apple.com/us/podcast/follow-the-data-podcast/id1104371750')


{'title': 'Follow the Data Podcast',
'author': 'Bloomberg Philanthropies',
'link': 'https://podcasts.apple.com/us/podcast/follow-the-data-podcast/id1104371750',
'avg_rtg': '4.3',
'rtg_ct': '57'}
Enter fullscreen mode Exit fullscreen mode

Back to the code. Lets now loop through all the podcast links we have.

# define the result array we'll fill during the loop
podcast_summ = []
for link in google_results_podcasts['link']:
    # use a try/except, since there are a few episodes still in the list that will cause errors if we don't do this.  This way, if there is an error we just wont add anything to the array.
    try:
        # get the response from our scraper and append it to our results
        pod_resp = podcast_scrape(link)
        podcast_summ.append(pod_resp)
    except:
        pass
    # wait for 5 seconds to be nice to apple
    time.sleep(5)
Enter fullscreen mode Exit fullscreen mode

Now to put everything into a dataframe and do a little bit of sorting and filtering.

pod_df = pd.DataFrame(podcast_summ)

# Remove non-english podcasts, sorry guys...
pod_df = pod_df.loc[~pod_df['link'].str.contains('l=')]
pod_df.drop_duplicates(subset='link', inplace=True)

# merge with the original dataframe (in case you want to see which queries were responsible for which podcasts)
merge_df = google_results_podcasts.merge(pod_df,on='link',suffixes=('_g',''))
merge_df.drop_duplicates(subset='title', inplace=True)

# change the average rating and rating count columns from strings to numbers
merge_df['avg_rtg'] = merge_df['avg_rtg'].astype('float64')
merge_df['rtg_ct'] = merge_df['rtg_ct'].astype('int64')

# sort by total ratings and then send them to a csv
merge_df.sort_values('rtg_ct',ascending=False).to_csv('podcasts.csv')
Enter fullscreen mode Exit fullscreen mode

From here I exported the file to csv and did a bit of cheating where I combined the title and link to create a <a hrer="link">title</a>, but that's mainly because I got a bit lazy...

Anyways, that was the full process in creating the above list of data science podcasts. You now have the top podcasts, sorted by total reviews. I considered also using castbox as a source of scraping (since they have an approximation of subscribers / downloads), but I couldn't find any good way to search for generally popular podcasts. Or podcasts that contained a certain word.

The first version of this article stopped here and showed results from this code

code to find top data science podcasts - version 2

Well, that was fine, but I think its actually lacking a bit. There seem to be a few podcasts that I've stumbled across that are missing which I was hoping this would capture. So we're going to switch some stuff up. First, I'm going to use a mobile user agent to tell Google I'm searching from my phone.

Why? Well Google shows different results for desktop searches vs mobile searches, so if we're looking to find the best podcasts, we want to be where most of the searches are actually happening. And since you basically always listen to podcasts on your phone, it probably makes sense to search from your phone... The code for that is below, the main changes are in

# Mobile Search Version
headers = {"user-agent" : MOBILE_USER_AGENT}

results = []
for x in list_kwargs:
    url, search_term = string_to_podcast_query(**x)
    resp = requests.get(url, headers=headers)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")

    for g in soup.find_all('div', class_='mnr-c'): # updated target class
        anchors = g.find_all('a')
        if anchors:
            link = anchors[0]['href']
            title = anchors[0].find_all('div')[1].get_text().strip() # updated title crawler
            item = {
                "title": title,
                "link": link,
                "search_term": search_term
            }
            results.append(item)

    time.sleep(2.5)
Enter fullscreen mode Exit fullscreen mode

What else did I switch up? I switched the Google queries up a bit and added a few more. I figure if I'm actually trying to find the best podcasts, it makes sense to search for them. That way, you get the ones that typically show up on these types of blog lists.

# Queries
list_kwargs = [
    {"string": 'best data podcast'},
    {"string": 'best data podcast', "pg": 2},
    {"string": 'best data podcast', "pg": 3},
    {"string": 'best data podcast', "pg": 4},
    {"string": 'best data science podcast'},
    {"string": 'best data science podcast', "pg": 2},
    {"string": 'best data science podcast', "pg": 3},
    {"string": 'best artificial intelligence podcast'},
    {"string": 'best machine learning podcast'},
    {"string": 'best data engineering podcast'},
    {"string": 'best data visualization podcast'},
]
Enter fullscreen mode Exit fullscreen mode

And that's it - all of the changes I made for the second version. The results are updated up top, but it gets a more complete

code to find top data science podcasts - version 3

And I'm an idiot. 'Fixing' my queries to only find the 'best data science podcasts' ended up making me miss a few of the good ones I found earlier. So I'm going to do as any good data scientist does and just combine the results of both sets of queries...

# Queries
list_kwargs = [
{"string": 'best data podcast'},
{"string": 'best data podcast', "pg": 2},
{"string": 'best data podcast', "pg": 3},
{"string": 'best data podcast', "pg": 4},
{"string": 'best data science podcast'},
{"string": 'best data science podcast', "pg": 2},
{"string": 'best data science podcast', "pg": 3},
{"string": 'best artificial intelligence podcast'},
{"string": 'best machine learning podcast'},
{"string": 'best data engineering podcast'},
{"string": 'best data visualization podcast'},
]
Enter fullscreen mode Exit fullscreen mode




closing note

This is a cross-post from my blog. My current readership is a solid 0 views per month, so I thought it might be worth actually sharing it here...

Top comments (0)