Why scrape in the first place
There are many reasons why you might want to extract data from a specific public website. Usually, the most common reason is because the data you want is not accessible by an API.
Use cases
- Scrape products from your favorite webshop, add a notification mechanism, and make sure to never miss that discount again.
- Your sales team needs a list of potential clients listed on some huge directory.
- Scrape a real estate directory, make sure to be the first one to give the offer for that cozy condo you are looking for
Whatever the reason and the use case, scraping is an automated way of data extraction from websites.
Let's code
As a developer, your first instinct is to solve problems by coding. But as a problem solver, you should not presume your problem is unique and you should look for an existing solution to your problem.
Also, the title suggests no coding :)
Parsehub
Parsehub is a powerful web scraping GUI tool for efficient fetching and manipulating data from any webpage. It helps you create an API output for a given website. You can even sanitize your content by using regex or replace function.
So the input is a URL and the output is a structured json file.
An example
For example, your input is Bornfight careers page URL. And your output is formatted json with all data that you want to use.
{
"jobs": [
{
"name": "Sales and Account Manager - m/f",
"url": "https://www.bornfight.com/careers/strategic-partnerships-executive/",
"location": "Zagreb",
"due_date": "Open until filled",
"type": "Full time job",
},
{
"name": "iOS Developer - m/f",
"url": "https://www.bornfight.com/careers/ios-developer/",
"location": "Zagreb / remote",
"due_date": "Open until filled",
"type": "Full time job",
},
{
"name": "Office Assistant (student job) - m/f",
"url": "https://www.bornfight.com/careers/office-assistant-student-job/",
"location": "Zagreb",
"due_date": "Open until filled",
"type": "Student job (part-time)",
}
]
}
How to
This is a short video for the given example. It demonstrates the basic features of the tool.
Scrape multiple pages
To add more relevant data to your API, you can instruct the tool to click on each of the job posting, "visit" that single page and add more data to your json output.
What else?
- click through the page navigation and ajax links
- use conditional statements
- create flows with multiple templates
- scroll
- hover
- sanitize data by string replacement and regex
How to get the data
You can download the extracted data in json/csv format, but better yet, you can access it via Parsehub API.
Parsehub API
You can automate the extraction execution via the API, fetch the extracted data and control multiple projects you might have in the tool.
Conclusion
Parsehub is a powerful scraping tool. It can handle complex scraping scenarios and it's great for most use cases. You should follow the guiding tutorial once you create your first project. The documentation is good and you should check it to find out more.
Parsing the Bornfight careers page is a good first exercise. However, if you're interested in joining our team, and there is no open position, you should apply to the open application :)
If you have any questions, feel free to ask in the comments.
Top comments (3)
Vrh!!
Really awesome method, did not know about it! Thanks!
You're welcome Renato!