tldr; There are no shortage of API or services for getting website screenshots. However, this is a good experiment to explore what it takes to make high-fidelity screenshots. Let me be clear--this is largely a "solved problem" as demonstrated by multiple companies in the space (see appendix). There are even some open-source projects solving pieces of it at a time.
This distraction satisfies a years-long curiosity I have had about making website screenshots in an automated way.
Ok, you pushy bastards. The whole Django project is here:
- Simplicity (hah, I know there are Yaks that need shaving), meaning no CI/CD, no Kubernetes, deploy to prod right from dev machine, and other "sins."
- How does execution look when I am not at all concerned with scale of any kind?
- Optimize only for developer productivity?
- Slow your roll and document things (check that README.md, yo)
- Challenge old truths like, "storing images in the database is bad" and "using a database as a queue is bad."
- How far can we push cheap, commodity VMs at a place like Scaleway?
- How far can this single server scale vertically before needing a 2nd server?
Below is the current system architecture used.
In case you missed it, most of my projects (personal and professional) follow a similar pattern of system and application architecture: linked here
- Django for web
- Django Commands for async tasks
- Traefik for reverse proxy and HTTPS
- All on a single Ubuntu Linux VM.
I always preface this with "scale is not an issue" when starting new projects. You have no customers, no users--nothing! So, it makes sense that we should not distract ourselves with the complexity of distributed systems, remote storages, or other things.
Should we need them we can revisit this conversation.
Everything you see lives on a single system all running in plain old Docker. I use a
docker-compose.yml file to keep deployment simple.
Learn more about my setup for reverse proxy configuration with LetsEncrypt and Traefik.
That may already sound technically heavy, but it is nothing compared to the massive stacks of AWS Lambda, DynamoDB, Cloud Front, S3, and Cloud Formation whatever just to take pictures of web pages. My goal is to spin up a dev environment in a few minutes without any need for third-party apis.
My reasononing is that what was true in 2010 is not necessarily true in 2020. Compute, bandwidth, and storage are far cheaper and plentiful. For example, 1TB of data today is not that much from a capacity perspective. However, that is potentially 10 million images at 100Kb each. (Screenshots, yo).
Developer time is far more expensive than CPU, Bandwidth, or Storage. Why then do we let, or even encourage developers, to spend so much time on optimizations when it is more efficient to buy a bigger server?
See? I was not kidding. This list is current as of February 2020.
- PagePeeker API
- Screenshot Bin API
- ScreenshotsCloud API
- thumbnail.ws API