DEV Community

fooooo-png
fooooo-png

Posted on • Originally published at octoparse.com

Movie Crawler: Scraping 100,000+ Movie Information

The data of movies record audiences' preferences and their attitude towards certain things. Gathering the movie info from relative websites, like IMDb and Rotten Tomatoes, will contribute to data analysis and data mining in the film industry. Generally speaking, the scraped data can be employed in some scenario:

· Analyzing the features of the target audience
· Obtaining public opinions to predict the coming trends
· Helping the Advertising Push

There are still more things that we can do with the movie data according to the needs. To help you fulfill data gathering, this article will introduce how to scrape the information from the IMDb Horror movie list, including director information, the cast of actors, and some other important information.

In this case, I’ll show you how to scrape the 134,555 Horror movie information from IMDb, using the link:

https://www.imdb.com/search/title/?genres=horror&start=51&explore=title_type,genres&ref_=adv_nxt

The goal of this web scraper is to find films that are listed on the Horror movie list, obtaining director information, the cast of actors, and some other important information.

Before getting started, please download Octoparse V7 on your computer to follow up. Besides, it’s highly recommended to learn the basic logic of using Octoparse.

Let’s get started

Step 1: Open the target website in the Octoparse built-in browser.

Simply click “+task” under the Advanced Mode.
Advanced Mode

Then, paste the URL to the box and click the “Save URL” button.
Save URL

Step 2: Click to build a task to scrape the movie information.
After having the RUL opened in the Octoparse built-in browser, we can continue to build a pagination and a loop item to get the data.

Simply click the “next>>” element in the built-in browser and then click “Loop click selected element” on the Action Tips.

Action Tips Penal

We can see the pagination has been built in the workflow.
Pagination

If you want to make the Octoparse recognize the element you selected more precisely, you could simply revise the XPath. As we can see in the below picture, the XPath that Octoparse generated is //DIV[@class='nav']/DIV[2]/A[2]. We’d better change it to //a[contains(text(), "Next »")]
XPath

In this case, we need to scrape the data from the movie list, which says, we can directly create a loop item to extract the data.

Select one of the “blocks” on the browser, Octoparse can detect all the data fields in the blog you selected.

Click to select

Then, select “Select all sub-elements”.

All the needed data are being selected by Octoparse and highlighted in red. Select “Select All” to continue.

Click to select Info section

Finally, we select “Extract data in the loop”.

Select the matching action

Now, we have both the pagination and the loop item done in Octoparse. We can see the workflow of the task on the left side and the data that are displayed on the right side.
Data preview

Step 3: Clean the data in Octoparse.

Before extracting data, we’d better clean the data to make our final result better. Simple need to click to delete the unwanted field and rename the description you need.

Step 4: Extract data
Simply click “Extract data” to get the data locally.

Extract data

As local extraction utilizes your own computer resources, such as the CPU, internet speed, it works slower than using Octoparse cloud extraction.

Anyway, after creating the scraper, what you need to do is wait and get the data, more than 100,000 lines of movie data in about 2 hours.

final result

With the above steps, I suppose, everyone, including those who have no programming background can easily build a movie crawler with Octoparse V7 and get more than 100,000 lines of the movie information. However, that's not the easiest way. Using Octoparse V8 could be much easier:

Octoparse 8: Auto-detection

All in all, with data scraping, we can obtain the movie data online with any legal issue.

Apart from the data, the more important is about the skill you learned, which is extremely useful for doing the market research, keeping yourself updating, and many other things.

Top comments (0)