Illia Zub for SerpApi

Posted on Dec 26, 2022 • Originally published at serpapi.com

How to reverse engineer a JSON API on a single page application

#webscraping

Websites like Bing Image Search and Walmart render pages with JavaScript and deliver page content via JSON APIs. While it's possible to scrape dynamic web pages using the browser automation, I prefer fetching data from the API endpoints directly. It usually (not always) works faster and more reliable.

I was debugging the Bing Image Search to help implementing our new Bing Reverse Image Search API. Initially, I've used mitmproxy because Ctrl+Shift+F in the browser dev tools haven't found the request. Then I've figured out how to filter network requests in the browser dev tools, examined the response, and made a draft data adapter.

Algorithms to reverse engineer a JSON API on the SPA

Two ways I've used to reverse engineer a JSON API used on the Bing Image Search: mitmproxy and browser developer tools. I explain the devtools process because it's used more often.

Browser devtools

Ctrl+F in the Network tab of browser dev tools.

Go to the Preview tab of the JSON response.
Expand JS object recursively (my Brave Browser doesn't search in the collapsed JSON 😕)

Ctrl+F the target string

Copy property path

Navigate up and down in JS object (with arrow keys) to learn its structure and create an adapter.
Copy as cURL and transform response with jq to check my assumption.

`mitmproxy`

Ctrl+Shift+F in the browser dev tools no longer searches across all responses.

I've proxied the browser network connections via mitmproxy. Then filtered response bodies with ~bs "TEXT_FROM_THE_HTML_ELEMENT_I_"LOOKING_FOR".

Start mitmproxy with view filter

$ mitmproxy --view-filter '~bs "Freshsales"'

Start chromium-based browser with the target URL and the following flags and parameters

Proxy requests via mitmproxy: --proxy-server='http://127.0.0.1:8080'.
Use incognito mode (1) with a temporary user profile (2) ignoring insecure connections (3) and certificate errors (4): --temp-profile -incognito --user-data-dir="mktemp -d" --no-first-run --ignore-certificate-errors --allow-insecure-localhost. (I ignore certificate errors in a temporary browser profile to not install mitmproxy's certificates system-wide.)

$ brave-browser 'https://www.bing.com/images/search?view=detailV2&insightstoken=bcid_RLKVsIV2BwkFXg*ccid_spWwhXYH&form=SBIHMP&iss=SBIUPLOADGET&sbisrc=ImgPicker&idpbck=1&sbifsz=927+x+524+%c2%b7+25.15+kB+%c2%b7+png&sbifnm=serpapi-serpbear.png&thw=927&thh=524&ptime=223&dlen=34344&expw=798&exph=451&selectedindex=0&id=-1051855017&ccid=spWwhXYH&vt=2&sim=11' --proxy-server='http://127.0.0.1:8080'  --temp-profile -incognito --user-data-dir="`mktemp -d`" --no-first-run --ignore-certificate-errors --allow-insecure-localhost

mitmproxy will display the matched requests

Conclusion

mitmproxy can be used to find the HTTP request with the needed data in addition browser dev tools. At some point, I'll explore tcpdump and wireshark to reverse engineer websites for web scraping and share the learnings with you.

If you have anything to share, any questions, suggestions, or something that isn't working correctly, feel free to reach out via Twitter at @ilyazub_, or @serp_api, or Mastodon at @iz.

DEV Community

How to reverse engineer a JSON API on a single page application

Algorithms to reverse engineer a JSON API on the SPA

Browser devtools

`mitmproxy`

Conclusion

Top comments (0)