Intro
In this blog post, we'll go through the process of extracting Bing News using the Bing News Engine Results API and the Python programming language. You can look at the complete code in the online IDE (Replit).
What will be scraped
Why using API?
There're a couple of reasons that may use API, ours in particular:
- No need to create a parser from scratch and maintain it.
- Bypass blocks from Google: solve CAPTCHA or solve IP blocks.
- Pay for proxies, and CAPTCHA solvers.
- Figure out the legal part of scraping data.
SerpApi handles everything on the backend with fast response times under ~1.5 seconds (~1.0 seconds with Ludicrous speed) per request and without browser automation, which becomes much faster. Response times and status rates are shown under SerpApi Status page.
Full Code
This code retrieves all news with pagination:
from serpapi import BingSearch
import json
params = {
'api_key': '...', # https://serpapi.com/manage-api-key
'q': 'Coffee', # search query
'engine': 'bing_news', # search engine
'cc': 'US', # country of the search
'first': 1, # pagination
'count': 10, # number of results per page
'qft': 'interval="7"' # news for past 24 hours
}
search = BingSearch(params) # data extraction on the SerpApi backend
results = search.get_dict() # JSON -> Python dict
bing_news_results = []
page_count = 0
page_limit = 5
while 'error' not in results and page_count < page_limit:
bing_news_results.extend(results.get('organic_results', []))
params['first'] += params['count']
page_count += 1
results = search.get_dict()
print(json.dumps(bing_news_results, indent=2, ensure_ascii=False))
Preparation
Install library:
pip install google-search-results
google-search-results
is a SerpApi API package.
Code Explanation
Import libraries:
from serpapi import BingSearch
import json
Library | Purpose |
---|---|
BingSearch |
to scrape and parse Bing results using SerpApi web scraping library. |
json |
to convert extracted data to a JSON object. |
The parameters are defined for generating the URL. If you want to pass other parameters to the URL, you can do so using the params
dictionary:
params = {
'api_key': '...', # https://serpapi.com/manage-api-key
'q': 'Coffee', # search query
'engine': 'bing_news', # search engine
'cc': 'US', # country of the search
'first': 1, # pagination
'count': 10, # number of results per page
'qft': 'interval="7"' # news for past 24 hours
}
Parameters | Explanation |
---|---|
api_key |
Parameter defines the SerpApi private key to use. |
q |
Parameter defines the search query. You can use anything that you would use in a regular Bing search. (e.g., 'query' , NOT , OR , site: , filetype: , near: , ip: , loc: , feed: etc.). |
engine |
Set parameter to bing_news to use the Bing News API engine. |
cc |
Parameter defines the country to search from. It follows the 2-character ISO_3166-1 format. (e.g., us for United States, de for Germany, gb for United Kingdom, etc.). |
first |
Parameter controls the offset of the organic results. This parameter defaults to 1 . (e.g., first=10 will move the 10th organic result to the first position). |
count |
Parameter controls the number of results per page. This parameter is only a suggestion and might not reflect actual results returned. |
qft |
Parameter defines results sorted by date. If the parameter is not set, it will default to the "Best match" sorting. It can be set to: interval="4" - Past hour, interval="7" - Past 24 hours, interval="8" - Past 7 days, interval="9" - Past 30 days, sortbydate="1" - Most Recent. |
📌Note: You can also add other API Parameters.
Then, we create a search
object where the data is retrieved from the SerpApi backend. In the results
dictionary we get data from JSON:
search = BingSearch(params) # data extraction on the SerpApi backend
results = search.get_dict() # JSON -> Python dict
Before extracting data, the bing_news_results
list is created where this data will be added later:
bing_news_results = []
The page_limit
variable defines the page limit. If you want to extract data from a different number of pages, then simply write the required number into this variable.
page_limit = 5
To get all results, you need to apply pagination. This is achieved by the following check: while there is no error
in the results
and the current page_count
value is less than the specified page_limit
value, we extract the data, increase the first
parameter by the value of the count
parameter to get the results from next page and update the results
object with the new page data:
page_count = 0
while 'error' not in results and page_count < page_limit:
# data extraction from current page will be here
params['first'] += params['count']
page_count += 1
results = search.get_dict()
Extending the bing_news_results
list with new data from each page:
bing_news_results.extend(results.get('organic_results', []))
# title = results['organic_results'][0]['title']
# link = results['organic_results'][0]['link']
# snippet = results['organic_results'][0]['snippet']
# source = results['organic_results'][0]['source']
# date = results['organic_results'][0]['date']
# thumbnail= results['organic_results'][0]['thumbnail']
📌Note: In the comments above, I showed how to extract specific fields. You may have noticed the results['organic_results'][0]
. This is the index of a organic result, which means that we are extracting data from the first organic result. The results['organic_results'][1]
is from the second organic result and so on.
After the all data is retrieved, it is output in JSON format:
print(json.dumps(bing_news_results, indent=2, ensure_ascii=False))
Output
[
{
"title": "Coffee drinkers get more steps but also less sleep, study finds",
"link": "https://www.cbsnews.com/sacramento/news/coffee-drinkers-get-more-steps-but-also-less-sleep-study-finds/",
"snippet": "Coffee is one of the most consumed beverages worldwide, but the pendulum has swung back and forth about its benefits and ...",
"source": "CBS News",
"date": "5h",
"thumbnail": "https://serpapi.com/searches/6421c51af26ac67a4930a857/images/85e52d2243238795454091b5f6b3f41e7f9969cd089f02e76ad21147c47e4f69.jpeg"
},
{
"title": "Cedarburg School Board candidate accused of sexually harassing young workers at his former coffee shop",
"link": "https://www.jsonline.com/story/news/education/2023/03/27/cedarburg-school-board-candidate-accused-of-sexual-harassment-coffee-shop/70039331007/",
"snippet": "Former employees accused Scott Sidney, a Cedarburg School Board candidate, of inappropriate words and actions at his former ...",
"source": "Milwaukee Journal Sentinel",
"date": "5h",
"thumbnail": "https://serpapi.com/searches/6421c51af26ac67a4930a857/images/85e52d2243238795454091b5f6b3f41e8897032186a0a3603444deba3cdd76a7.jpeg"
},
{
"title": "New development near WoodTrust in Grand Rapids; new merch from local coffee shop | Streetwise",
"link": "https://news.yahoo.com/development-near-woodtrust-grand-rapids-100428122.html",
"snippet": "If you know of a new business, a development or a place that’s closing, send me a note at cshuda@gannett.com. If you have ...",
"source": "YAHOO!News",
"date": "6h",
"thumbnail": "https://serpapi.com/searches/6421c51af26ac67a4930a857/images/85e52d2243238795454091b5f6b3f41e5f1483416a23fbf49adc15101a042c82.jpeg"
},
... other news
]
📌Note: Head to the playground for a live and interactive demo.
Links
Add a Feature Request💫 or a Bug🐞
Top comments (0)