Websites like Bing Image Search and Walmart render pages with JavaScript and deliver page content via JSON APIs. While it's possible to scrape dynamic web pages using the browser automation, I prefer fetching data from the API endpoints directly. It usually (not always) works faster and more reliable.
I was debugging the Bing Image Search to help implementing our new Bing Reverse Image Search API. Initially, I've used mitmproxy
because Ctrl+Shift+F
in the browser dev tools haven't found the request. Then I've figured out how to filter network requests in the browser dev tools, examined the response, and made a draft data adapter.
Algorithms to reverse engineer a JSON API on the SPA
Two ways I've used to reverse engineer a JSON API used on the Bing Image Search: mitmproxy
and browser developer tools. I explain the devtools process because it's used more often.
Browser devtools
-
Ctrl+F
in the Network tab of browser dev tools.
- Go to the
Preview
tab of the JSON response. - Expand JS object recursively (my Brave Browser doesn't search in the collapsed JSON 😕)
-
Ctrl+F
the target string
- Copy property path
- Navigate up and down in JS object (with arrow keys) to learn its structure and create an adapter.
- Copy as cURL and transform response with
jq
to check my assumption.
mitmproxy
Ctrl+Shift+F
in the browser dev tools no longer searches across all responses.
I've proxied the browser network connections via mitmproxy
. Then filtered response bodies with ~bs "TEXT_FROM_THE_HTML_ELEMENT_I_"LOOKING_FOR"
.
- Start
mitmproxy
with view filter
$ mitmproxy --view-filter '~bs "Freshsales"'
- Start chromium-based browser with the target URL and the following flags and parameters
- Proxy requests via
mitmproxy
:--proxy-server='http://127.0.0.1:8080'
. - Use incognito mode (1) with a temporary user profile (2) ignoring insecure connections (3) and certificate errors (4):
--temp-profile -incognito --user-data-dir="
mktemp -d" --no-first-run --ignore-certificate-errors --allow-insecure-localhost
. (I ignore certificate errors in a temporary browser profile to not installmitmproxy
's certificates system-wide.)
$ brave-browser 'https://www.bing.com/images/search?view=detailV2&insightstoken=bcid_RLKVsIV2BwkFXg*ccid_spWwhXYH&form=SBIHMP&iss=SBIUPLOADGET&sbisrc=ImgPicker&idpbck=1&sbifsz=927+x+524+%c2%b7+25.15+kB+%c2%b7+png&sbifnm=serpapi-serpbear.png&thw=927&thh=524&ptime=223&dlen=34344&expw=798&exph=451&selectedindex=0&id=-1051855017&ccid=spWwhXYH&vt=2&sim=11' --proxy-server='http://127.0.0.1:8080' --temp-profile -incognito --user-data-dir="`mktemp -d`" --no-first-run --ignore-certificate-errors --allow-insecure-localhost
-
mitmproxy
will display the matched requests
Conclusion
mitmproxy
can be used to find the HTTP request with the needed data in addition browser dev tools. At some point, I'll explore tcpdump
and wireshark
to reverse engineer websites for web scraping and share the learnings with you.
If you have anything to share, any questions, suggestions, or something that isn't working correctly, feel free to reach out via Twitter at @ilyazub_, or @serp_api, or Mastodon at @iz.
Top comments (0)