DEV Community

Cover image for Scrape Organic News from Brave Search with Python
Dimitry Zub ☀️
Dimitry Zub ☀️

Posted on • Updated on • Originally published at serpapi.com

Scrape Organic News from Brave Search with Python

This blog post will show you how to scrape title, link, displayed link, source website, thumbnail, date the news was posted from Organic News results from Brave Search.


What is Brave Search

For the sake of non-duplicating content, I already wrote about what is Brave search in the previous Brave blog post.

Intro

This blog post is a continuation of the Brave Search web scraping series. Here you'll see how to scrape Organic News Results from Brave Search using Python with beautifulsoup, requests, lxml libraries.

Note: HTML layout might be changed in the future thus some of CSS selectors might not work. Let me know if something isn't working.

Prerequisites

pip install requests
pip install lxml 
pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Make sure you have a basic knowledge of the libraries mentioned above, since this blog post is not exactly a tutorial for beginners, so be sure you have a basic familiarity with them. I'll try my best to show in code that it's not that difficult.

Also, make sure you have a basic understanding of CSS selectors because of select()/select_one() beautifulsoup methods that accepts CSS selectors. CSS selectors reference.

Imports

from bs4 import BeautifulSoup
import requests, lxml, json
Enter fullscreen mode Exit fullscreen mode

What will be scraped

What is being scraped (Brave Search News results)

Process

Continuing Dune adventure let's scrape news about Dune movie from the Brave search.

As usually, we need to find a container with needed data first, in order to iterate over each element afterwards:

container with needed data

Screenshot translates to this:

for news_result in soup.select('#news-carousel .card'):
    # further code..
Enter fullscreen mode Exit fullscreen mode

After picking a container, we need to grab other elements, such as title, link, displayed link, source website, and a thumbnail with appropriate CSS selectors:


Code

from bs4 import BeautifulSoup
import requests, lxml, json

headers = {
  'User-agent':
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  'q': 'dune 2021',
  'source': 'web'
}

def get_organic_news_results():

  html = requests.get('https://search.brave.com/search', headers=headers, params=params)
  soup = BeautifulSoup(html.text, 'lxml')

  data = []

  for news_result in soup.select('#news-carousel .card'):
    title = news_result.select_one('.title').text.strip()
    link = news_result['href']
    time_published = news_result.select_one('.card-footer__timestamp').text.strip()
    source = news_result.select_one('.anchor').text.strip()
    favicon = news_result.select_one('.favicon')['src']
    thumbnail = news_result.select_one('.img-bg')['style'].split(', ')[0].replace("background-image: url('", "").replace("')", "")

    data.append({
      'title': title,
      'link': link,
      'time_published': time_published,
      'source': source,
      'favicon': favicon,
      'thumbnail': thumbnail
    })

  print(json.dumps(data, indent=2, ensure_ascii=False))


get_organic_news_results()

---------------
# part of the output
'''
[
  {
    "title": "Zendaya talks potential 'Dune' sequel, what she admires about Tom ...",
    "link": "https://www.goodmorningamerica.com/culture/story/zendaya-talks-potential-dune-sequel-admires-tom-holland-80555190",
    "time_published": "17 hours ago",
    "source": "goodmorningamerica.com",
    "favicon": "https://imgr.search.brave.com/NygzuIHo7PzzX-7H4OjswMN4xwJ7u3_eEXq55_xXDog/fit/32/32/ce/1/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvZDQwMjIyNDJk/MjRjZGRmNjI4NmY2/NzUzY2I5YTkyMzIz/YTM4OTJiOTM3YjBm/NDk3OTVjNTIwOTY0/Nzg0YmUwYy93d3cu/Z29vZG1vcm5pbmdh/bWVyaWNhLmNvbS8",
    "thumbnail": "https://imgr.search.brave.com/z-Za3HgnUCgTAP8vloSHS33eC0UkjIM8JsMdngGw_Rk/fit/200/200/ce/1/aHR0cHM6Ly9zLmFi/Y25ld3MuY29tL2lt/YWdlcy9HTUEvemVu/ZGF5YS1maWxlLWd0/eS1qZWYtMjExMDEz/XzE2MzQxMzkxNzQw/MjNfaHBNYWluXzE2/eDlfOTkyLmpwZw"
  }
 ...
]
'''
Enter fullscreen mode Exit fullscreen mode

Code in the online IDESelectorGadget

Outro

If you have any questions or suggestions, or something isn't working correctly, feel free to drop a comment in the comment section.

If you want to access that feature via SerpApi, upvote on the Support Brave Search feature request, which is currently under review.

Yours,
Dimitry, and the rest of SerpApi Team.

Discussion (0)