DEV Community

Gareth M.
Gareth M.

Posted on

Undocumented APIs in websites

Intro

A website needs to be able to retrieve data, such as information on a set of products or users. To do this, many websites make requests to APIs, which in turn access data from a database or some other backend service. Today I'm going to be talking about how I was able to scrape JSON data from an undocumented API, in order to integrate that data into my own application.

The website

Currently I am working on an application that allows users to create a grocery list (here), as well as compare items. For this I need several pieces of data, such as the name and price of the product(s), as well as nutrition information, ingredients, and images of the product and/or its packaging. Conveniently, I found all of this information in the API at https://www.bakersplus.com/atlas/v1.

Retrieving the data

Preliminary request to get product ids

First request

Secondary request that returns product data

Second request

Upon looking at the web traffic, there are two web requests here that are important. The first one access the endpoint /search/v1/products-search, which takes parameters filter.query and page.size. This endpoint returns product id numbers in a JSON object.


h = {"User-Agent": user_agent, "Accept":"*/*", "Host":"www.bakersplus.com", "x-laf-object":json.dumps(x_laf_obj)}

def getProductIds(query, num):
    #query = "rice"
    ids = []

    req = requests.get(base_url+"/search/v1/products-search?filter.query="+query+"&page.size="+str(num), headers=h)

Enter fullscreen mode Exit fullscreen mode

The second request is made to the /product/v2/products endpoint. An array of the product numbers (filter.gtin13s) is passed as a url parameter. What we get as the response is all the information we need, although, including some extra info we don't want that is filtered out via a helper function.


def getProductInfo(ids):
    p = {"filter.gtin13s":[]}
    for i in ids:
        p["filter.gtin13s"].append(i)   

    req = requests.get(base_url+"/product/v2/products", headers=h, params=p)

Enter fullscreen mode Exit fullscreen mode

Rate limiting/Request restrictions

APIs, especially when publicly exposed, will limit the amount of consecutive requests. In my experience with the above example, it is more likely to timeout if the same search is made consecutively. In addition, this API also required certain HTTP headers to be included, which others did not require. One of these was called x-laf-object, which appeared to be some kind of location tracking object.

Conclusion

Many websites don't make it this easy to scrape their data and effectively bypass their frontend. However, if the data these services provide is public anyway, and is effectively rate-limited and protected from abuse, then there isn't really a reason to not build a solution that is modular and easier to fix/debug.

Livestreams/VODs of dev: Youtube

Code: Github

Top comments (0)