DEV Community 👩‍💻👨‍💻

Cover image for Building a simple TikTok scraping API 🤖
Bryce Dorn
Bryce Dorn

Posted on

Building a simple TikTok scraping API 🤖

As part of a project centered around TikTok (coming soon!) I needed a way to get a video's media. The closest thing that TikTok officially provides is their oembed API, but unfortunately this only includes basic information like title, thumbnail_url etc. And other unofficial APIs cost money or require python, which I don't want to do. So I needed another way to retrieve this info.

Leveraging SSR patterns 📦

As a website that handles large volumes of traffic I knew they had some sort of SSR/client-side hydration going on. This generally involves is shipping data to the client in an object inlined inside a <script> tag, reducing the number of necessary client-side requests by directly populating components with this data. In Next.js for example, this happens via server-side props.

And lo and behold! There's a handy blob provided alongside the page markup for every video. You can confirm this for yourself by visiting any TikTok video's page (https://www.tiktok.com/:user/video/:id) and looking at the contents of the SIGI_STATE tag via console:

JSON.parse(document.getElementById("SIGI_STATE").innerText);
Enter fullscreen mode Exit fullscreen mode

Knowing this I built a simple handler using deno to fetch this blob and return it directly.

Enter cheerio & deno 🥣🦕

If the video URL is attached to the request (via POST body or params), a simple handler can fetch (scrape) the content for the page:

const response = await fetch(`https://tiktok.com/${video}`);
const html = await response.text();
Enter fullscreen mode Exit fullscreen mode

Then, using cheerio, this raw text can be turned into a queryable jQuery-like API. This provides an entrypoint to get the content of the server-generated blob within the script tag:

const $ = cheerio.load(html);
const appContext = $("#SIGI_STATE").text();
Enter fullscreen mode Exit fullscreen mode

From there, it's only a matter of parsing this as JSON and returning pertinent data in the response (ItemModule). The data is currently keyed by some unique identifier, Object.keys helps remove that:

const json = JSON.parse(appContext);
const key = Object.keys(json.ItemModule)[0];
const data = json.ItemModule[key];

return new Response(JSON.stringify({ ...data }), {
  headers: { 'Content-Type': 'application/json' },
  status: 200,
});
Enter fullscreen mode Exit fullscreen mode

And that's it! Now I can make a client-side request to get up-to-date media for a video and display it directly. This also gets around TikTok's CDN expiration which would prevent saving a reference to this media.

Side-note: comparing this to the output for one of the unofficial API options I came across, the data looks identical. So it's likely they're doing the same thing (but charging almost $200/mo for unlimited usage)!


Anyways, this is fragile and could break if they decide to rename the script tag, rate-limit, etc. But this demonstrates how easy it is to spin up an instance for your own project. And maybe with enough bot traffic TikTok will create a public API that accomplishes this instead of making us scrape for it! 😇

Oldest comments (0)

This post blew up on DEV in 2020:

js visualized

🚀⚙️ JavaScript Visualized: the JavaScript Engine

As JavaScript devs, we usually don't have to deal with compilers ourselves. However, it's definitely good to know the basics of the JavaScript engine and see how it handles our human-friendly JS code, and turns it into something machines understand! 🥳

Happy coding!