DEV Community

Apify for Apify

Posted on • Originally published at blog.apify.com on

How to deduplicate scraped data

Data duplicates can be a real problem when web scraping. Learn how to get rid of them quickly and easily.

Duplicates can be a real problem when web scraping. Deduplication is the process of getting rid of duplicates in data - in other words, making sure that we dont have the same thing recorded multiple times. We're going to use Apify Actors to make the process easier.

Step 1. Choose an Actor to build a dataset

Were going to use Contact Details Scraper 🔗 to build a dataset containing unique email addresses extracted from various websites. If everything goes well, well end up with a setup that incrementally - whenever it runs - adds newly scraped emails to a single dataset. The result will look like this:

[
  { email: 'name@example.com' },
  { email: 'name2@example.com' },
  // ...
]

Enter fullscreen mode Exit fullscreen mode

Expected shape of output data

Lets start by creating a task for Contact Details Scraper and giving it the input, just a URL to begin with and a reasonable number of maximum pages:

Screenshot of input for Contact Detail Scraper Task

Input of Contact Detail Scraper Task

When we run it, we can see that the information in the dataset looks something like this.

{
    "depth": 1,
    "originalStartUrl": "<http://www.gsd.harvard.edu/>",
    "referrerUrl": "<http://www.gsd.harvard.edu/>",
    "url": "<https://www.gsd.harvard.edu/doctoral-programs/>",
    "domain": "harvard.edu",
    "emails": [
      "melissa_hulett@gsd.harvard.edu",
      "mmoore@gsd.harvard.edu",
      "thorstenson@gsd.harvard.edu"
    ],
    "facebooks": [
      "<http://www.facebook.com/HarvardGSD>",
      "<https://www.facebook.com/HarvardGSD/>"
    ],
    "youtubes": [
      "<http://www.youtube.com/user/TheHarvardGSD>",
      "<https://www.youtube.com/TheHarvardGSD>"
    ]
  }

Enter fullscreen mode Exit fullscreen mode

Example of item in dataset produced by Contact Details Scraper

Scraping the data looks fairly easy, lets continue with the deduplication and transformation. In this case, thats the hard part.

Step 2. Find an Actor to deduplicate datasets

Luckily, there's already an Actor on Apify Store that deals with this issue: Merge, Dedup & Transform Datasets. Its functionality is quite advanced and exceeds just deduplication, so feel free to explore its other features, such as moving data to key-value stores.

Step 3. Create an integration between the two Actors

Go to the Integrations page and add Integration with Actor and connect the right one (up top in this screenshot).

Scraped data deduplication: screenshot of Adding Integrations

Adding Integrations

We only need to set values for a few fields and leave the defaults for others:

  • Dataset IDs - we need to add one id, {{resource.defaultDatasetId}} - this is a variable representing the id of the dataset produced by the task run.

  • Fields for deduplication - we need to add just email

  • Mode - for our example, we dont care about the order of items, so we can choose faster Dedup as loading

  • Output dataset ID or name - here we need to give the name of the dataset where we want to keep the deduplicated data, lets say emails-on-the-internet.

  • Hiding in the Advanced section of input - Dataset IDs for just deduping. Here we need to put the same name we put as output dataset name, prefixed with ~(the Actor internally calls Apify API, which allows it to use ~ to access named datasets). This is what makes sure that we ignore the duplicates from previous runs too, not just duplicates in the current run. So lets put in ~emails-on-the-internet

  • In the Transforming functions section, we need to fill Pre dedup transform function. This one is going to be a bit more complex. If you're interested, read the comments.

// We are working with datasets of two shapes.
// The items produced by Contact Details Scraper look something like this
// { url: 'example.com', emails: ['name@example.com', 'name2@example.com'], facebooks: [...]}
// The items we want on the output would look like this:
// [{email: 'name@example.com'}, {email: 'name2@example.com'}]
// The transformation makes sure that we always use the output format
async (items, { Apify }) => {
    // No items at all, empty array can be returned.
    if (items.length === 0) return [];
    // If the items have the output format already, just return them
    if (items[0].email) return items;
    // Otherwise assume Contact Details Scraper shape and convert it.
    return items.reduce((acc, {emails}) => {
        const datasetItems = (emails || []).map(email => ({email}) );
        acc.push(...datasetItems);
        return acc;
    }, []);
}

Enter fullscreen mode Exit fullscreen mode

Pre dedup transform function

This JSON contains the fields set to proper values:

{
    "datasetIds": [
        "{{resource.defaultDatasetId}}"
    ],
    "datasetIdsOfFilterItems": [
        "~emails-on-the-internet"
    ],
    "fields": [
        "email"
    ],
    "mode": "dedup-as-loading",
    "outputDatasetId": "emails-on-the-internet",
    "postDedupTransformFunction": "async (items, { Apify }) => {\\n return items;\\n}",
    "preDedupTransformFunction": "// We are working with datasets of two shapes.\\n// The items produced by Contact Details Scraper look something like this\\n// { url: 'example.com', emails: ['name@example.com', 'name2@example.com'], facebooks: [...]}\\n// The items we want on the output would look like this:\\n// [{email: 'name@example.com'}, {email: 'name2@example.com'}]\\n// The transformation makes sure that we always use the output format\\nasync (items, { Apify }) => {\\n // No items at all, empty array can be returned.\\n if (items.length === 0) return [];\\n // If the items have the output format already, just return them\\n if (items[0].email) return items;\\n // Otherwise assume Contact Details Scraper shape and convert it.\\n return items.reduce((acc, {emails}) => {\\n const datasetItems = (emails || []).map(email => ({email}) );\\n acc.push(...datasetItems);\\n return acc;\\n }, []);\\n}",
    "verboseLog": false
}

Enter fullscreen mode Exit fullscreen mode

Input for Dedup Actor when used as integration

  • The Actor has quite a high default memory , for our use case its going to be enough to set it to 1GB.

The setup is complete; lets check if it works.


Step 4. Check your integration setup

Now, lets see what happens when we run the task. We can see it has finished and produced 30 results. But only some of them actually contain email addresses.

Scraped data deduplication: screenshot of Output of Contact Details Scraper Task

Output of Contact Details Scraper Task

On the Integrations tab of the run, we can see that the Dedup Actor was triggered:

Scraped data deduplication: screenshot of Integrations tab

Integrations tab

When we check the named dataset (under Storages), we can see that we have 389 unique emails:

Scraped data deduplication: screenshot of resulting dataset

Resulting dataset

Now lets increase the Maximum number of pages per start URL input field on the task. Most likely, its going to find the same emails, and probably few more that we yet have not seen. In our case, we got five new emails.

Now, whenever you run the task again, only previously unseen emails will make it to the named dataset and you dont have to worry about it containing the duplicates.

That's it! Remember that you can set up your own Actor-to-Actor or Actor-to-other-service Integration from scratch. See this video for example:

Top comments (0)