loading...

Web scraping tutorial with Real-Time Crawler

smusca profile image smusca ・3 min read

Web scraping is an irreplaceable tool in nowadays marketing world, used by companies to stay competitive and create more sales. Analysing scraped data allows companies to compare their offered content, merchandise, prices and availability. In most cases, comparing prices to your competitors and lowering them by at least in 1$, can boost up the sales for a long time. On the other hand, if you are not a big company, it might get tricky of where you should start, and what you should do to get all the data that you need.

To start with, you should decide if you want to write a web crawler yourself, or want to use an already made web scraping tools for that. Writing your own crawler could be useful, if you know the specific content needed, getting data in a specific language or using the exact methods. Downside of this - you need to have people, who could code and who would create a useful tool for you. Also, it needs combining your own crawler code, buying proxies and exporting all gathered data in easy-to-read way. If you do not have a team that could provide you with such tool, there is always a possibility to use pre-made tools. Pre-made web scraping tools downloads specific web pages and extracts data that is required, such as a list of items available, their prices, availability, and other details. Let’s consider one of the tools and check how it works - Real-Time Crawler .

This tool works in a pretty simple way - user makes a request about what data is needed, the crawler receives the request and tries to access the data. If it is successful, crawler then sends data back to the user.

alt text

If you want to see it in action, you can always try out their sample on their web page - you can try it out with a search engine, or with e-commerce search for websites. As it uses ASIN (Amazon Standard Identification Number), all you need is to paste a product number in the field, and get data about the product in JSON or HTML formats.

alt text

You can extract data from product pages, product offer listing pages, reviews, questions & answers, search results or from any URL in general. There are also two options in how to retrieve data - with the callback data delivery method, you don't need to check your task status – Real-Time Crawler lets you know once the data is ready. With real-time data delivery, the data is retrieved on the same connection. Proxies enable data collection without IP bans, this way assuring anonymity as well. The same goes for the Real-Time Crawler as it uses both data center and residential proxies.

As for search engines, you can make a request in two languages - Python or PHP. You can also write your command in shell. All you need is a keyword, domain, language and country, which will be used for search results. Real-Time Crawler supports any number of requests done for any location and any keyword. High accuracy is ensured by the use of natural geo-located IP addresses. This is how your request would look.

alt text

Your results would be extracted in to given link. In this case - The result is a SERP (search result page) in JSON format. It has a HTML-code inside JSON. So, we need to parse JSON to see the result page.

All the parameters needed to form a payload, and all the examples of how the crawler works, can be found on Oxylabs learning hub. They also explain their other tools as well as how to write a code for specific searches or extractions. 


All in all, Real-Time Crawler is an easy to use convenient tool for both e-commerce and search engine crawling, which can help your business to become more profitable and gather data that is needed.

Posted on by:

Discussion

markdown guide