DEV Community

Scrapfly for Scrapfly

Posted on • Originally published at scrapfly.io on

How to Scrape BestBuy Product, Offer and Review Data

How to Scrape BestBuy Product, Offer and Review Data

In this article, we'll explain how to scrape BestBuy, one of the most popular retail stores for electronic stores in the United States. We'll scrape different data types from product, search, review, and sitemap pages. Additionally, we'll employ a wide range of web scraping tricks, such as hidden JSON data, hidden APIs, HTML, and XML parsing. So, this guide serves as a comprehensive web scraping introduction!

Latest BestBuy Scraper Code

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

  • Do not scrape at rates that could damage the website.
  • Do not scrape data that's not available publicly.
  • Do not store PII of EU citizens who are protected by GDPR.
  • Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping

and for more you should consult a lawyer.

Why Scrape BestBuy?

The amount of data that web scraping BestBuy can allow is numerous. It can empower both businesses and retail buyers in different ways:

  • Competitive Analysis

    The market dynamics are aggressive and fast-changing, making it challenging for businesses to remain competitive. Scraping BestBuy allows businesses to compare their competitors' pricing, sales, and reviews. This provides a better understanding of the current trends and interests to remain up-to-date and attract new customers.

  • Customer Sentiment Analysis

    BestBuy includes thousands of review data for different products. Web scraping BestBuy's reviews can be used to run sentiment analysis research, which provides useful insights into the customers' satisfaction, preferences, and feedback.

  • Empowered Navigation

    Manually browsing the excessive number of similar products on BestBuy can be tedious. On the other hand, retailers can web scrape BestBuy to compare many products quickly, allowing them to identify niche markets and undervalued products.

For further details, refer to our introduction on web scraping use cases.

Setup

To web scrape BestBuy, we'll use Python with a few community libraries:

  • httpx: To request BestBuy pages and get the data as HTML, XML, or JSON.
  • parsel: To parse the HTML and XML data using selectors, such as XPath and CSS.
  • JMESPath: To refine and parse the BestBuy JSON datasets for the useful data only.
  • loguru: To monitor and log our BestBuy scraper in beautiful terminal outputs.
  • asyncio: To increase the web scraping speed by running the code asynchronously.

Since asyncio comes pre-installed in Python, we'll only have to install the other packages using the following pip command:

pip install httpx parsel jmespath loguru
Enter fullscreen mode Exit fullscreen mode

How To Discover BestBuy Pages?

Scraping sitemaps is an efficient way to discover thousands of organized URLs. They are provided for search engine crawlers to index the pages, which we can use to discover web scraping targets on a website.

BestBuy's sitemaps can be found at bestbuy.com/robots.txt. It's a text file that provides crawling instructions along with the website's sitemap directory:

Sitemap: https://sitemaps.bestbuy.com/sitemaps_discover_learn.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_pdp.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_promos.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_qna.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_rnr.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_search_plps.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_standalone_qa.xml
Sitemap: https://www.bestbuy.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode

Each of the above sitemaps represents a group of related page URLs found under an XML file that's compressed to a gzip file to reduce its size:

<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://sitemaps.bestbuy.com/sitemaps_pdp.0000.xml.gz</loc><lastmod>2024-03-08T10:16:14.901109+00:00</lastmod></sitemap>
<sitemap><loc>https://sitemaps.bestbuy.com/sitemaps_pdp.0001.xml.gz</loc><lastmod>2024-03-08T10:16:14.901109+00:00</lastmod></sitemap>
</sitemapindex>
Enter fullscreen mode Exit fullscreen mode

The above gz file looks like the following after extracting:

<?xml version='1.0' encoding='utf-8'?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>https://www.bestbuy.com/site/aventon-aventure-step-over-ebike-w-45-mile-max-operating-range-and-28-mph-max-speed-medium-fire-black/6487149.p?skuId=6487149</loc></url>
<url><loc>https://www.bestbuy.com/site/detective-story-1951/34804554.p?skuId=34804554</loc></url>
<url><loc>https://www.bestbuy.com/site/flowers-lp-vinyl/35944053.p?skuId=35944053</loc></url>
<url><loc>https://www.bestbuy.com/site/apple-iphone-15-pro-max-1tb-natural-titanium-verizon/6525500.p?skuId=6525500</loc></url>
<url><loc>https://www.bestbuy.com/site/geeni-dual-outlet-outdoor-wi-fi-smart-plug-gray/6388590.p?skuId=6388590</loc></url>
<url><loc>https://www.bestbuy.com/site/dynasty-the-sixth-season-vol-1-4-discs-dvd/20139655.p?skuId=20139655</loc></url>
Enter fullscreen mode Exit fullscreen mode

To scrape BestBuy's sitemaps, we'll request the compressed XML file, decode it, and parse it for the URLs. For this example, we'll use the promotions sitemap.

Python:

import asyncio
import json
import gzip
from typing import List
from httpx import AsyncClient, Response
from parsel import Selector
from loguru import logger as log

# initialize an async httpx client
client = AsyncClient(
    # enable http2
    http2=True,
    # add basic browser like headers to prevent getting blocked
    headers={
        "Accept-Language": "en-US,en;q=0.9",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
    },
)

def parse_sitemaps(response: Response) -> List[str]:
    """parse links for bestbuy sitemaps"""
    # decode the .gz file
    print(response.text)
    xml = str(gzip.decompress(response.content), 'utf-8')
    selector = Selector(xml)
    data = []
    for url in selector.xpath("//url/loc/text()"):
        data.append(url.get())
    return data

async def scrape_sitemaps(url: str) -> List[str]:
    """scrape link data from bestbuy sitemaps"""
    response = await client.get(url)
    promo_urls = parse_sitemaps(response)
    log.success(f"scraped {len(promo_urls)} urls from sitemaps")    
    return promo_urls
Enter fullscreen mode Exit fullscreen mode

ScrapFly:

import asyncio
import json
import gzip
from typing import List
from parsel import Selector
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")

def parse_sitemaps(response: ScrapeApiResponse) -> List[str]:
    """parse links for bestbuy sitemaps"""
    # decode the .gz file
    bytes_data = response.scrape_result['content'].getvalue()
    xml = str(gzip.decompress(bytes_data), 'utf-8')
    selector = Selector(xml)
    data = []
    for url in selector.xpath("//url/loc/text()"):
        data.append(url.get())
    return data

async def scrape_sitemaps(url: str) -> List[str]:
    """scrape link data from bestbuy sitemaps"""
    response = await SCRAPFLY.async_scrape(ScrapeConfig(url, country="US",))
    promo_urls = parse_sitemaps(response)
    log.success(f"scraped {len(promo_urls)} urls from sitemaps")
    return promo_urls

Enter fullscreen mode Exit fullscreen mode

Run the code:

async def run():
    promo_urls = await scrape_sitemaps(
        url="https://sitemaps.bestbuy.com/sitemaps_promos.0000.xml.gz"
    )
    # save the data to a JSON file
    with open("promos.json", "w", encoding="utf-8") as file:
        json.dump(promo_urls, file, indent=2, ensure_ascii=False)

if __name__ == " __main__":
    asyncio.run(run())

Enter fullscreen mode Exit fullscreen mode

In the above code, we define an httpx with common browser headers to minimize the chances of getting blocked. Additionally, we define two functions, let's break them down:

  • scrape_sitemaps: To request the sitemap URL using the defined httpx client.
  • parse_sitemaps: To decode the gz file into its XML content and then parse the XML for the URLs using the XPath selector.

Here is a sample output of the results we got:

[
  "https://www.bestbuy.com/site/promo/4k-capable-memory-cards",
  "https://www.bestbuy.com/site/promo/all-total-by-verizon",
  "https://www.bestbuy.com/site/promo/shop-featured-intel-evo",
  "https://www.bestbuy.com/site/promo/laser-heat-therapy",
  "https://www.bestbuy.com/site/promo/save-on-select-grills",
  ....
]
Enter fullscreen mode Exit fullscreen mode

For further details on scraping and discovering sitemaps, refer to our dedicated guide.

How To Scrape BestBuy Search Pages?

Let's start with the first part of our BestBuy scraper code: search pages. Search for any product on the website, like the "macbook" keyword, and you will get a page that looks the following:

How to Scrape BestBuy Product, Offer and Review Data
Products on search pages

To scrape BestBuy search pages, we'll request the search page URL and then parse the HTML. First, let's with the parsing logic.

Python:

def parse_search(response: Response) -> List[Dict]:
    """parse search data from search pages"""
    selector = Selector(response.text)
    data = []
    for item in selector.xpath("//ol[@class='sku-item-list']/li[@class='sku-item']"):
        name = item.xpath(".//h4[@class='sku-title']/a/text()").get()
        link = item.xpath(".//h4[@class='sku-title']/a/@href").get()
        price = item.xpath(".//div[@data-testid='customer-price']/span/text()").get()
        price = int(price[price.index("$") + 1:].replace(",", "").replace(".", "")) // 100 if price else None
        original_price = item.xpath(".//div[@data-testid='regular-price']/span/text()").get()
        original_price = int(original_price[original_price.index("$") + 1:].replace(",", "").replace(".", "")) // 100 if original_price else None
        sku = item.xpath(".//div[@class='sku-model']/div[2]/span[@class='sku-value']/text()").get()
        model = item.xpath(".//div[@class='sku-model']/div[1]/span[@class='sku-value']/text()").get()
        rating = item.xpath(".//p[contains(text(),'out of 5')]/text()").get()
        rating_count = item.xpath(".//span[contains(@class,'c-reviews')]/text()").get()
        is_sold_out = bool(item.xpath(".//strong[text()='Sold Out']").get())
        image = item.xpath(".//img[contains(@class,'product-image')]/@src").get()

        data.append({
            "name": name,
            "link": "https://www.bestbuy.com" + link,
            "image": image,
            "sku": sku,
            "model": model,
            "price": price,
            "original_price": original_price,
            "save": f"{round((1 - price / original_price) * 100, 2):.2f}%" if price and original_price else None,
            "rating": float(rating[rating.index(" "):rating.index(" out")].strip()) if rating else None,
            "rating_count": int(rating_count.replace("(", "").replace(")", "").replace(",", "")) if rating_count and rating_count != "Not Yet Reviewed" else None,
            "is_sold_out": is_sold_out,
        })
    total_count = selector.xpath("//span[@class='item-count']/text()").get()
    total_count = int(total_count.split(" ")[0]) // 18 # convert the total items to pages, 18 items in each page

    return {"data": data, "total_count": total_count}
Enter fullscreen mode Exit fullscreen mode

ScrapFly

def parse_search(response: ScrapeApiResponse) -> List[Dict]:
    """parse search data from search pages"""
    selector = response.selector
    data = []
    for item in selector.xpath("//ol[@class='sku-item-list']/li[@class='sku-item']"):
        name = item.xpath(".//h4[@class='sku-title']/a/text()").get()
        link = item.xpath(".//h4[@class='sku-title']/a/@href").get()
        price = item.xpath(".//div[@data-testid='customer-price']/span/text()").get()
        price = int(price[price.index("$") + 1:].replace(",", "").replace(".", "")) // 100 if price else None
        original_price = item.xpath(".//div[@data-testid='regular-price']/span/text()").get()
        original_price = int(original_price[original_price.index("$") + 1:].replace(",", "").replace(".", "")) // 100 if original_price else None
        sku = item.xpath(".//div[@class='sku-model']/div[2]/span[@class='sku-value']/text()").get()
        model = item.xpath(".//div[@class='sku-model']/div[1]/span[@class='sku-value']/text()").get()
        rating = item.xpath(".//p[contains(text(),'out of 5')]/text()").get()
        rating_count = item.xpath(".//span[contains(@class,'c-reviews')]/text()").get()
        is_sold_out = bool(item.xpath(".//strong[text()='Sold Out']").get())
        image = item.xpath(".//img[contains(@class,'product-image')]/@src").get()

        data.append({
            "name": name,
            "link": "https://www.bestbuy.com" + link,
            "image": image,
            "sku": sku,
            "model": model,
            "price": price,
            "original_price": original_price,
            "save": f"{round((1 - price / original_price) * 100, 2):.2f}%" if price and original_price else None,
            "rating": float(rating[rating.index(" "):rating.index(" out")].strip()) if rating else None,
            "rating_count": int(rating_count.replace("(", "").replace(")", "").replace(",", "")) if rating_count and rating_count != "Not Yet Reviewed" else None,
            "is_sold_out": is_sold_out,
        })
    total_count = selector.xpath("//span[@class='item-count']/text()").get()
    total_count = int(total_count.split(" ")[0]) // 18 # convert the total items to pages, 18 items in each page

    return {"data": data, "total_count": total_count}
Enter fullscreen mode Exit fullscreen mode

Here, we define a parse_search function, which does the following:

  • Iterates over the product boxes on the HTML.
  • Parses each product's data, such as the name, price, link, etc.
  • Gets the total number of search pages available and returns the search data.

Next, we'll utilize the above parsing logic while sending requests to scrape and crawl the search pages.

Python:

import asyncio
import json
import urllib.parse
from typing import List, Dict, Union
from httpx import AsyncClient, Response
from parsel import Selector
from urllib.parse import urlencode, quote_plus
from loguru import logger as log

# initialize an async httpx client
client = AsyncClient(
    # enable http2
    http2=True,
    # add basic browser like headers to prevent getting blocked
    headers={
        "Accept-Language": "en-US,en;q=0.9",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Cookie": "intl_splash=false"
    },
)

def parse_search(response: Response):
    """parse search data from search pages"""
    # rest of the function logic

async def scrape_search(
        search_query: str, sort: Union["-bestsellingsort", "-Best-Discount"] = None, max_pages=None
        ) -> List[Dict]:
    """scrape search data from bestbuy search"""

    def form_search_url(page_number: int):
        """form the search url"""
        base_url = "https://www.bestbuy.com/site/searchpage.jsp?"
        # search parameters
        params = {
            "st": quote_plus(search_query),
            "sp": sort, # None = best match
            "cp": page_number
        }
        return base_url + urlencode(params)

    first_page = await client.get(form_search_url(1))
    data = parse_search(first_page)
    search_data = data["data"]
    total_count = data["total_count"]

    # get the number of total search pages to scrape
    if max_pages and max_pages < total_count:
        total_count = max_pages

    log.info(f"scraping search pagination, {total_count - 1} more pages")
    # add the remaining pages to a scraping list to scrape them concurrently
    to_scrape = [
        client.get(form_search_url(page_number))
        for page_number in range(2, total_count + 1)
    ]
    for response in asyncio.as_completed(to_scrape):
        response = await response
        data = parse_search(response)["data"]
        search_data.extend(data)

    log.success(f"scraped {len(search_data)} products from search pages")
    return search_data
Enter fullscreen mode Exit fullscreen mode

ScrapFly

import asyncio
import json
from typing import Dict, List, Union
from urllib.parse import urlencode, quote_plus
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")

def parse_search(response: ScrapeApiResponse) -> List[Dict]:
    """parse search data from search pages"""
    # rest of the function logic

async def scrape_search(
        search_query: str, sort: Union["-bestsellingsort", "-Best-Discount"] = None, max_pages=None
    ) -> List[Dict]:
    """scrape search data from bestbuy search"""

    def form_search_url(page_number: int):
        """form the search url"""
        base_url = "https://www.bestbuy.com/site/searchpage.jsp?"
        # search parameters
        params = {
            "st": quote_plus(search_query),
            "sp": sort, # None = best match
            "cp": page_number
        }
        return base_url + urlencode(params)

    first_page = await SCRAPFLY.async_scrape(ScrapeConfig(form_search_url(1), country="US", asp=True))
    data = parse_search(first_page)
    search_data = data["data"]
    total_count = data["total_count"]

    # get the number of total search pages to scrape
    if max_pages and max_pages < total_count:
        total_count = max_pages

    log.info(f"scraping search pagination, {total_count - 1} more pages")
    # add the remaining pages to a scraping list to scrape them concurrently
    to_scrape = [
        ScrapeConfig(form_search_url(page_number), country="US", asp=True)
        for page_number in range(2, total_count + 1)
    ]
    async for response in SCRAPFLY.concurrent_scrape(to_scrape):
        data = parse_search(response)["data"]
        search_data.extend(data)

    log.success(f"scraped {len(search_data)} products from search pages")
    return search_data
Enter fullscreen mode Exit fullscreen mode

Run the code:

async def run():
    search_data = await scrape_search(
        search_query="macbook",
        max_pages=3
    )
    # save the results to a JSOn file
    with open("search.json", "w", encoding="utf-8") as file:
        json.dump(search_data, file, indent=2, ensure_ascii=False)    

if __name__ == " __main__":
    asyncio.run(run())
Enter fullscreen mode Exit fullscreen mode

Let's break down the execution flow of the above scrape_search function:

  • Form a search URL based on the search keyword, sorting option, and page number.
  • Request the search URL and parse it with the parse_search function.
  • Get the number of pagination pages to scrape using the max_pages parameter.
  • Add the remaining pagination URLs to a list and request them concurrently.

The above BestBuy scraping code will extract product data from three search pages. Here is what the results should look like:

[
  {
    "name": "MacBook Pro 13.3\" Laptop - Apple M2 chip - 24GB Memory - 1TB SSD (Latest Model) - Silver",
    "link": "https://www.bestbuy.com/site/macbook-pro-13-3-laptop-apple-m2-chip-24gb-memory-1tb-ssd-latest-model-silver/6382795.p?skuId=6382795",
    "image": "https://pisces.bbystatic.com/image2/BestBuy_US/images/products/6382/6382795_sd.jpg;maxHeight=200;maxWidth=300",
    "sku": "6382795",
    "model": "MNEX3LL/A",
    "price": 1499,
    "original_price": 2099,
    "save": "28.59%",
    "rating": 4.8,
    "rating_count": 4,
    "is_sold_out": false
  },
  ....
]
Enter fullscreen mode Exit fullscreen mode

The above code can scrape the product data that is visible on the search pages. However, it can be extended with crawling logic to scrape the full details of each product from its respective URL. For further details on crawling while scraping, refer to our dedicated guide.

How To Scrape BestBuy Product Pages?

Let's add support for scraping product pages to our BestBuy scraper. Before we start, let's have a look at what product pages look like. Go to any product page on the website, like this one, and you will get a page similar to this:

How to Scrape BestBuy Product, Offer and Review Data
Product pages on BestBuy

Data on product pages is comprehensive, and it's scattered across the page. Therefore, it's challenging to scrape it using selectors. Instead, we'll scrape them as JSON datasets from script tags. To locate these script tags, follow the below steps:

  • Open the browser developer tools by pressing the F12 key.
  • Search for the script tags using the selector //script[@type='application/json'].

After following the above steps, you will find several script tags that include JSON data. However, we are only interested in a few of them:

bestbuy page source

The above JSON data are the same on the page but before getting rendered into the HTML, which is often known as hidden web data.

To scrape the product data, we will select the script tags containing the JSON data and parse them.

Python:

import jmespath
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response
from parsel import Selector
from loguru import logger as log

# initialize an async httpx client
client = AsyncClient(
    # enable http2
    http2=True,
    # add basic browser like headers to prevent getting blocked
    headers={
        "Accept-Language": "en-US,en;q=0.9",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Cookie": "intl_splash=false"
    },
)

def refine_product(data: Dict) -> Dict:
    """refine the JSON product data"""
    parsed_product = {}
    specifications = data["shop-specifications"]["specifications"]["categories"]
    pricing = data["pricing"]["app"]["data"]["skuPriceDomain"]
    ratings = jmespath.search(
        """{
        featureRatings: aggregateSecondaryRatings,
        positiveFeatures: distillation.positiveFeatures[].{name: name, score: representativeQuote.score, totalReviewCount: totalReviewCount},
        negativeFeatures: distillation.negativeFeatures[].{name: name, score: representativeQuote.score, totalReviewCount: totalReviewCount}
        }""",
        data["reviews"]["app"],
    )
    faqs = []
    for item in data["faqs"]["app"]["questions"]["results"]:
        result = jmespath.search(
            """{
            sku: sku,
            questionTitle: questionTitle,
            answersForQuestion: answersForQuestion[].answerText
            }""",
            item,
        )
        faqs.append(result)

    # define the final parsed product
    parsed_product["specifications"] = specifications
    parsed_product["pricing"] = pricing
    parsed_product["ratings"] = ratings
    parsed_product["faqs"] = faqs

    return parsed_product

def parse_product(response: Response) -> Dict:
    """parse product data from bestbuy product pages"""
    selector = Selector(response.text)
    # print(response.text)
    data = {}
    data["shop-specifications"] = json.loads(selector.xpath("//script[contains(@id, 'shop-specifications')]/text()").get())
    data["faqs"] = json.loads(selector.xpath("//script[contains(@id, 'content-question')]/text()").get())
    data["pricing"] = json.loads(selector.xpath("//script[contains(@id, 'pricing-price')]/text()").get())
    data["reviews"] = json.loads(selector.xpath("//script[contains(@id, 'ratings-and-reviews')]/text()").get())

    parsed_product = refine_product(data)
    return parsed_product

async def scrape_products(urls: List[str]) -> List[Dict]:
    """scrapy product data from bestbuy product pages"""
    to_scrape = [client.get(url) for url in urls]
    data = []
    for response in asyncio.as_completed(to_scrape):
        response = await response
        product_data = parse_product(response)
        data.append(product_data)
    log.success(f"scraped {len(data)} products from product pages")
    return data
Enter fullscreen mode Exit fullscreen mode

ScrapFly

import json
import jmespath
from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")

def refine_product(data: Dict) -> Dict:
    """refine the JSON product data"""
    parsed_product = {}
    specifications = data["shop-specifications"]["specifications"]["categories"]
    pricing = data["pricing"]["app"]["data"]["skuPriceDomain"]
    ratings = jmespath.search(
        """{
        featureRatings: aggregateSecondaryRatings,
        positiveFeatures: distillation.positiveFeatures[].{name: name, score: representativeQuote.score, totalReviewCount: totalReviewCount},
        negativeFeatures: distillation.negativeFeatures[].{name: name, score: representativeQuote.score, totalReviewCount: totalReviewCount}
        }""",
        data["reviews"]["app"],
    )
    faqs = []
    for item in data["faqs"]["app"]["questions"]["results"]:
        result = jmespath.search(
            """{
            sku: sku,
            questionTitle: questionTitle,
            answersForQuestion: answersForQuestion[].answerText
            }""",
            item,
        )
        faqs.append(result)

    # define the final parsed product
    parsed_product["specifications"] = specifications
    parsed_product["pricing"] = pricing
    parsed_product["ratings"] = ratings
    parsed_product["faqs"] = faqs

    return parsed_product

def parse_product(response: ScrapeApiResponse) -> Dict:
    """parse product data from bestbuy product pages"""
    selector = response.selector
    data = {}
    data["shop-specifications"] = json.loads(selector.xpath("//script[contains(@id, 'shop-specifications')]/text()").get())
    data["faqs"] = json.loads(selector.xpath("//script[contains(@id, 'content-question')]/text()").get())
    data["pricing"] = json.loads(selector.xpath("//script[contains(@id, 'pricing-price')]/text()").get())
    data["reviews"] = json.loads(selector.xpath("//script[contains(@id, 'ratings-and-reviews')]/text()").get())

    parsed_product = refine_product(data)
    return parsed_product

async def scrape_products(urls: List[str]) -> List[Dict]:
    """scrapy product data from bestbuy product pages"""
    to_scrape = [ScrapeConfig(url, country="US", asp=True) for url in urls]
    data = []
    async for response in SCRAPFLY.concurrent_scrape(to_scrape):
        product_data = parse_product(response)
        data.append(product_data)
    log.success(f"scraped {len(data)} products from product pages")
    return data
Enter fullscreen mode Exit fullscreen mode

Run the code:

async def run():
    data = await scrape_products(
        urls=[
            "https://www.bestbuy.com/site/macbook-air-13-3-laptop-apple-m1-chip-8gb-memory-256gb-ssd-gold-gold/6418599.p",
            "https://www.bestbuy.com/site/apple-macbook-air-15-laptop-m2-chip-8gb-memory-256gb-ssd-midnight/6534606.p",
            "https://www.bestbuy.com/site/macbook-pro-13-3-laptop-apple-m2-chip-8gb-memory-256gb-ssd-latest-model-silver/6509654.p"
        ]
    )
    with open("product.json", "w", encoding="utf-8") as file:
        json.dump(data, file, indent=2, ensure_ascii=False)

if __name__ == " __main__":
    asyncio.run(run())
Enter fullscreen mode Exit fullscreen mode

Let's break down the above BestBuy scraping code:

  • refine_product: It refines the product JSON datasets with JMESPath to exclude the unnecessary and keep the useful ones.
  • parse_product: To parse the product hidden JSON data from the HTML with XPath.
  • scrape_products: To request the product page URLs concurrently and parse the HTML output with the parse_product function.

The output is a comprehensive JSON dataset that looks like the following:

[
  {
    "specifications": [
      {
        "displayName": "Key Specs",
        "specifications": [
          {
            "displayName": "Screen Size",
            "value": "13.3 inches",
            "definition": "Size of the screen, measured diagonally from corner to corner.",
            "id": "TQqJBgOyVv"
          }
          ....
        ]
      },
      ....
    ],
    "pricing": {
      "skuId": "6418599",
      "regularPrice": 999.99,
      "currentPrice": 999.99,
      "priceEventType": "regular",
      "totalSavings": 0,
      "totalSavingsPercent": 0,
      "totalPaidMemberSavings": 0,
      "totalNonPaidMemberSavings": 0,
      "customerPrice": 999.99,
      "isMAP": false,
      "isPriceMatchGuarantee": true,
      "offerQualifiers": [
        {
          "offerId": "634974",
          "offerName": "Apple - Apple Music 3 Month Trial GWP",
          "offerVersion": 662398,
          "offerDiscountType": "Free",
          "id": 634974002,
          "comOfferType": "FREEITEM",
          "comRuleType": "10",
          "instanceId": 5,
          "offerRevocableOnReturns": true,
          "excludeFromBundleBreakage": false
        },
        ....
      ],
      "giftSkus": [
        {
          "skuId": "6484511",
          "quantity": 1,
          "offerId": "465099",
          "savings": 0,
          "isRequiredWithOffer": false
        },
        ....
      ],
      "totalGiftSavings": 0,
      "gspUnitPrice": 999.99,
      "financeOption": {
        "offerId": "384913",
        "financeCodeName": "12-Month Financing",
        "financeCode": 7,
        "rank": 8,
        "financeTerm": 12,
        "monthlyPayment": 83.34,
        "monthlyPaymentIncludingTax": 83.34,
        "defaultPlan": true,
        "priority": 1,
        "planType": "Deferred",
        "rate": 0,
        "totalCost": 999.99,
        "termsAndConditions": "NO INTEREST IF PAID IN FULL WITHIN 12 MONTHS. If the deferred interest balance is not paid in full by the end of the promotional period, interest will be charged from the purchase date at rates otherwise applicable under your Card Agreement. Min. payments required. See Card Agreement for details.",
        "totalCostIncludingTax": 999.99,
        "financeCodeDescLong": "No interest if paid in full within 12 months (no points)"
      },
      "financeOptions": [
        {
          "offerId": "384913",
          "financeCodeName": "12-Month Financing",
          "financeCode": 7,
          "rank": 8,
          "financeTerm": 12,
          "monthlyPayment": 83.34,
          "monthlyPaymentIncludingTax": 83.34,
          "defaultPlan": true,
          "priority": 1,
          "planType": "Deferred",
          "rate": 0,
          "totalCost": 999.99,
          "termsAndConditions": "NO INTEREST IF PAID IN FULL WITHIN 12 MONTHS. If the deferred interest balance is not paid in full by the end of the promotional period, interest will be charged from the purchase date at rates otherwise applicable under your Card Agreement. Min. payments required. See Card Agreement for details.",
          "totalCostIncludingTax": 999.99,
          "financeCodeDescLong": "No interest if paid in full within 12 months (no points)"
        }
      ],
      ....
    },
    "ratings": {
      "featureRatings": [
        {
          "attribute": "BatteryLife",
          "attributeLabel": "Battery Life",
          "avg": 4.856636035826451,
          "count": 17194
        },
        ....
      ],
      "positiveFeatures": [
        {
          "name": "Speed",
          "score": 4,
          "totalReviewCount": 2386
        },
        ....
      ],
      "negativeFeatures": [
        {
          "name": "Touch screen",
          "score": 16,
          "totalReviewCount": 168
        },
        ....
      ]
    },
    "faqs": [
      {
        "sku": "6418599",
        "questionTitle": "Does this MacBook have a built-in HDMI port?",
        "answersForQuestion": [
          "No. It has 2 Thunderbolt 3 ports that you can get an adapter for to give you HDMI.",
          "No. However, you can connect your MacBook Air to HDMI using the a USB-C Digital AV Multiport Adapter. (sold separately)",
          "I am afraid not for Mac book air and pro m1 2020 it has only the thunderbolts 2 points"
        ]
      },
      ....
    ]
  }
]
Enter fullscreen mode Exit fullscreen mode

🙋‍ Note that the HTML structure of the BestBuy product pages differs based on product type and category. Therefore, the above product parsing logic should be adjusted for other product types.

Cool! The above BestBuy scraping code can extract the full details of each product. However, it lacks the product reviews - let's scrape them in the next section!

How to Scrape BestBuy Review Pages?

Reviews on BestBuy can be found on each product page:

How to Scrape BestBuy Product, Offer and Review Data
Review data on BestBuy

The above review data are split into two categories:

  • Product ratings

    Review and rating data into each product's specification, which we scraped earlier from the product page itself.

  • User reviews

    Detailed user reviews of the product, which we'll scrape in this section.

To scrape BestBuy reviews, we'll utilize the hidden reviews API. To locate this API, follow the below steps:

  • Open the browser developer tools by pressing the F12 key.
  • Select the network tab and filter by Fetch/XHR requests.
  • Filter the review using the sort option or click on the next review page.

After following the above steps, you will find the reviews API recorded on the browser:

How to Scrape BestBuy Product, Offer and Review Data
Reviews hidden API

The API above is called in the background using the browser and then rendered into HTML. The request can be copied as a cURL and imported into HTTP clients like Postman.

To scrape the product reviews, we'll request the above API and paginate it.

Python:

import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response
from loguru import logger as log

# initialize an async httpx client
client = AsyncClient(
    # enable http2
    http2=True,
    # add basic browser like headers to prevent getting blocked
    headers={
        "Accept-Language": "en-US,en;q=0.9",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Cookie": "intl_splash=false"
    },
)

def parse_reviews(response: Response) -> List[Dict]:
    """parse review data from the review API responses"""
    data = json.loads(response.text)
    total_count = data["totalPages"]
    review_data = data["topics"]
    return {"data": review_data, "total_count": total_count}

async def scrape_reviews(skuid: int, max_pages: int=None) -> List[Dict]:
    """scrape review data from the reviews API"""
    first_page = await client.get(f"https://www.bestbuy.com/ugc/v2/reviews?page=1&pageSize=20&sku={skuid}&sort=MOST_RECENT")
    data = parse_reviews(first_page)
    review_data = data["data"]
    total_count = data["total_count"]

    # get the number of total review pages to scrape
    if max_pages and max_pages < total_count:
        total_count = max_pages

    log.info(f"scraping reviews pagination, {total_count - 1} more pages")
    # add the remaining pages to a scraping list to scrape them concurrently
    to_scrape = [
        client.get(f"https://www.bestbuy.com/ugc/v2/reviews?page={page_number}&pageSize=20&sku={skuid}&sort=MOST_RECENT")
        for page_number in range(2, total_count + 1)
    ]
    for response in asyncio.as_completed(to_scrape):
        response = await response
        data = parse_reviews(response)["data"]
        review_data.extend(data)

    log.success(f"scraped {len(review_data)} reviews from the reviews API")
    return review_data
Enter fullscreen mode Exit fullscreen mode

Python:

import asyncio
import json
from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")

def parse_reviews(response: ScrapeApiResponse) -> List[Dict]:
    """parse review data from the review API responses"""
    data = json.loads(response.scrape_result['content'])
    total_count = data["totalPages"]
    review_data = data["topics"]
    return {"data": review_data, "total_count": total_count}

async def scrape_reviews(skuid: int, max_pages: int=None) -> List[Dict]:
    """scrape review data from the reviews API"""
    first_page = await SCRAPFLY.async_scrape(ScrapeConfig(
        f"https://www.bestbuy.com/ugc/v2/reviews?page=1&pageSize=20&sku={skuid}&sort=MOST_RECENT",
        asp=True, country="US"
    ))
    data = parse_reviews(first_page)
    review_data = data["data"]
    total_count = data["total_count"]

    # get the number of total review pages to scrape
    if max_pages and max_pages < total_count:
        total_count = max_pages

    log.info(f"scraping reviews pagination, {total_count - 1} more pages")
    # add the remaining pages to a scraping list to scrape them concurrently
    to_scrape = [
        ScrapeConfig(
            f"https://www.bestbuy.com/ugc/v2/reviews?page={page_number}&pageSize=20&sku={skuid}&sort=MOST_RECENT",
            asp=True, country="US"
        )
        for page_number in range(2, total_count + 1)
    ]
    async for response in SCRAPFLY.concurrent_scrape(to_scrape):
        data = parse_reviews(response)["data"]
        review_data.extend(data)

    log.success(f"scraped {len(review_data)} reviews from the reviews API")
    return review_data
Enter fullscreen mode Exit fullscreen mode

The above part of our BestBuy scraper is fairly straightforward. We only use two functions:

  • scrape_reviews: For requesting the reviews API, which accepts product skuID, sorting option, and page number. It starts by requesting the first page and then adding the remaining API URLs to a scraping list to request them concurrently.
  • parse_reviews: For parsing the JSON response of the reviews API. The response contains various review data types, but the function only parses the user reviews.

Here is a sample output of the above BestBuy scraping code:

[
  {
    "id": "6b88383f-3830-3c78-915c-d3cf9f16596d",
    "topicType": "review",
    "rating": 5,
    "recommended": true,
    "title": "Amazing!",
    "text": "An absolutly amazing console very fast and smooth.",
    "author": "CocaNoot",
    "positiveFeedbackCount": 0,
    "negativeFeedbackCount": 0,
    "commentCount": 0,
    "writeCommentUrl": "/site/reviews/submission/6565065/review/337294210?campaignid=RR_&return=",
    "submissionTime": "2024-03-02T10:52:07.000-06:00",
    "brandResponses": [],
    "badges": [
      {
        "badgeCode": "Incentivized",
        "badgeDescription": "This reviewer received promo considerations or sweepstakes entry for writing a review.",
        "badgeName": "Incentivized",
        "badgeType": "Custom",
        "fileName": null,
        "iconText": null,
        "iconPath": null,
        "index": 90900
      },
      {
        "badgeCode": "VerifiedPurchaser",
        "badgeDescription": "We’ve verified that this content was written by people who purchased this item at Best Buy.",
        "badgeName": "Verified Purchaser",
        "badgeType": "Custom",
        "fileName": "badgeContextual-verifiedPurchaser.jpg",
        "imageURL": "https://bestbuy.ugc.bazaarvoice.com/static/3545w/badgeContextual-verifiedPurchaser.jpg",
        "iconText": "Verified Purchase",
        "iconPath": "/ugc-raas/ugc-common-assets/ugc-badge-verified-check.svg",
        "index": 100000,
        "iconUrl": "https://www.bestbuy.com/~assets/bby/_com/ugc-raas/ugc-common-assets/ugc-badge-verified-check.svg"
      },
      {
        "badgeCode": "rewardZoneNumberV3",
        "badgeDescription": "My Best Buy members receive promotional considerations or entries into drawings for writing reviews.",
        "badgeName": "My Best Buy\\u00ae Member",
        "badgeType": "Custom",
        "fileName": "badgeRewardZoneStd.gif",
        "imageURL": "https://bestbuy.ugc.bazaarvoice.com/static/3545w/badgeRewardZoneStd.gif",
        "iconText": "",
        "iconPath": "/ugc-raas/ugc-common-assets/badge-my-bestbuy-core.svg",
        "index": 100500,
        "iconUrl": "https://www.bestbuy.com/~assets/bby/_com/ugc-raas/ugc-common-assets/badge-my-bestbuy-core.svg"
      }
    ],
    "photos": [
      {
        "photoId": "008b1a1e-ba1b-38ea-b86e-effb7c0ca162",
        "caption": null,
        "normalUrl": "https://photos-us.bazaarvoice.com/photo/2/cGhvdG86YmVzdGJ1eQ/e79a5ff1-e891-57fa-ae03-e9f52bb4d7c4",
        "piscesUrl": "https://pisces.bbystatic.com/image2/BestBuy_US/ugc/photos/thumbnail/8db68b60f7a60bcea8f6cd1470938da9.jpg",
        "thumbnailUrl": "https://photos-us.bazaarvoice.com/photo/2/cGhvdG86YmVzdGJ1eQ/bd287ee8-1c8b-52ae-9c12-4a379d7ecb24",
        "reviewId": "6b88383f-3830-3c78-915c-d3cf9f16596d"
      }
    ],
    "qualityRating": null,
    "valueRating": null,
    "easeOfUseRating": null,
    "daysOfOwnership": 70,
    "pros": null,
    "cons": null,
    "secondaryRatings": [
      {
        "attribute": "Performance",
        "value": 5,
        "attributeLabel": "Performance",
        "valueLabel": "Excellent"
      },
      {
        "attribute": "StorageCapacity",
        "value": 5,
        "attributeLabel": "Storage Capacity",
        "valueLabel": "Excellent"
      },
      {
        "attribute": "Controller",
        "value": 5,
        "attributeLabel": "Controller",
        "valueLabel": "Excellent"
      }
    ]
  },
  ....  
]
Enter fullscreen mode Exit fullscreen mode

With this last feature, our BestBuy scraper is complete. It can scrape sitemaps, search, product, and review data.

Avoid BestBuy Scraping Blocking

We have successfully scraped BestBuy data from various pages. However, attempting to scale our scraping rate will lead the website to block the IP address. For this, we'll use ScrapFly, a web scraping API that allows for scraping at scale by providing:

How to Scrape BestBuy Product, Offer and Review Data
ScrapFly service does the heavy lifting for you!

Here is how we can scrape without getting blocked with ScrapFly. All we have to do is replace the HTTP client with the ScrapFly client, enable the asp parameter, and select a proxy country:

# standard web scraping code
import httpx
from parsel import Selector

response = httpx.get("some bestbuy.com URL")
selector = Selector(response.text)

# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient

# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")

response = scrapfly.scrape(ScrapeConfig(
    url="website URL",
    asp=True, # enable the anti scraping protection to bypass blocking
    country="US", # set the proxy location to a specfic country
    render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))

# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']
Enter fullscreen mode Exit fullscreen mode

Try for FREE!

More on Scrapfly

FAQ

To wrap up this guide on web scraping BestBuy, let's have a look at some frequently asked questions.

Are there public APIs for BestBuy?

Yes, BestBuy offers APIs for developers. We have scraped review data from hidden BestBuy APIs. The same approach can be utilized to scrape other data sources on the website.

Are there alternatives for scraping BestBuy?

Yes, other popular e-commerce platforms include Amazon and Walmart. We have covered scraping Amazon and Walmart in previous tutorials. For more guides on similar scraping targets, refer to our #scrapeguide blog tag.

Latest BestBuy Scraper Code

Summary

In this guide, we have explained how to scrape BestBuy. We went through a step-by-step guide on scraping BestBuy with Python for different pages on the website, which are:

  • Sitemaps for BestBuy page URLs.
  • Search pages for product data on search results.
  • Product pages for various details, including specifications, pricing, and ratings.
  • Review pages for user reviews on products.

Top comments (0)