I'm Liviu, a Solutions Architect at Endtest.
Let's take a look at how we can deceive a CAPTCHA.
This is useful in a test or in any script that wants to imitate a real user.
In case there's someone out there who doesn't know:
A CAPTCHA (an acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge–response test used in computing to determine whether the user is human.
It's meant to stop bots and scripts from pretending to be real humans on the internet.
Our use case here is for an automated test.
The typical workaround is to use whitelisting, and skip the CAPTCHA for requests coming from certain IP addresses.
But that workaround does not verify if the CAPTCHA works and if your real users can actually sign up.
What if the CAPTCHA system that you're using is broken?
Your tests would pass, but your real users would get stuck.
We need to extract the text from that image.
And the best way to do that is with Optical Character Recognition (OCR).
If you're looking to invest time in that and maybe even build a product around OCR, it's worth exploring the Tesseract OCR package from Google.
There are also different wrappers around this package, such as pytesseract for Python.
But it's not that smooth, the process involves a number of steps:
- Align the image correctly.
- Convert the image to grayscale.
- Increase the sharpeness of the image.
- Increase or decrease the DPI.
- Configure a tolerance.
- Extract the text.
- Remove or convert incorrect characters.
If you're doing that only for an automated test, you can just use the OCR feature from Endtest:
And when you run the test, it will easily extract that text:
Here's the video recording from that test execution:
This tutorial should only be used for the websites you're testing, and not to scrape other websites.