Jesse Smith Byers

Posted on Sep 26, 2020

Someone Stole My DEV Article! How To Build a Python Script to Detect Stolen Content

#python #codenewbie #tutorial

Last spring, one of my blog posts got a lot of views and likes, and was ultimately featured on the Top 7 List for that week. As a new blogger, I wanted to figure out what had made it so popular, so I could build on that success. However, in the process of investigation I stumbled upon my own writing...IN SOMEONE ELSE'S BLOG!!! And then I also stumbled upon their 12 Twitter accounts which were all tweeting out my content, AND TAKING CREDIT FOR IT AS THEIR OWN WORK!!! The wonderful folks at DEV helped me out and sent out a request to have the stolen content removed ...but ever since, I've been nervous about posting, fearing this might happen again.

Fast forward a few months, and my job involves the task of building a Python script to search for web content that could be used for cheating. I immediately realized a similar script could be used to monitor the web to detect any unauthorized scraping and re-posting of DEV posts. I had also recently read @saschat's post, Get Historical Stats for your Dev.to Articles, which uses the Dev.to API to fetch data on each of your articles. So I got to work, building a script that leverages both the Dev.to API to fetch my article data, as well as the Google Custom Search Engine API to manage the custom searches.

Below, I'll share my process and my code. I hope you'll follow along and build your own script, using Python, the Dev.to API, and the Google Custom Search Engine API. This tutorial assumes some basic knowledge of Python fundamentals, and general familiarity with fetching data from external APIs.

Build a Python Script to Check if Anyone Has Stolen Your DEV Posts

Initial Set-Up

Create Project Folder and Repository

First, you'll need to create a project folder, create a few placeholder files, initialize your repository and push it to GitHub.

We will organize our dependencies in the requirements.txt file. In this project, we will start by installing the requests package (for requesting data from the APIs) and colorama package (for formatting text printed to the terminal). Add these names to the file, and run pip3 install -r requirements.txt to install all of the dependencies at once.

Set up the Google Custom Search Engine API

In order to use this API to manage your custom searches, you will need to have a google account. Follow the link on the Google Custom Search JSON API Documentation to set up an API key for your project, and jot it down. Then click on the Google Programmable Search Control Panel link to set up your custom search and generate your Search Engine ID. You will need to choose at least one website domain to include in your search, so for now, just set it to dev.to/*, and any other sites where your articles should be showing up in a web search. Since I post my articles to Twitter (and occasionally to LinkedIn and GitHub), I added those domains as well. This will get us started for testing the script, and we can add/refine the custom search parameters once we are confident the script works.

After you press "Create", follow the link to your Control Panel to finish setting up the search and to get your Search Engine ID. Save the Search Engine ID for later, and keep the Control Panel open, as we'll modify the settings later on.

Set up the Dev.to API

Next, you will need to get your API key for the Dev.to API. In order to do this, go to your DEV settings page, click on Account, and scroll to the DEV API keys section. Just add a project title, and it will generate a new key. Jot down this key as well. For more details, visit the Documentation for the DEV API.

Save API keys and Search Engine ID to config.py

First, make sure you have added the config.py file to your .gitignore file to prevent it from being tracked by GitHub. Next, add all of the keys and IDs you just generated to the config.py file, to keep them secure.

When you're finished, the file should look like this, with each key/ID stored as a string.

If you need a resource on securing API keys, I recommend BlackTechDiva's article Hide Your API Keys for a clear and concise tutorial.

Fetch Article Data from the Dev.to API

Now we're ready to start putting those API keys to work! First, we'll make a fetch to the Dev.to API to retrieve all of our articles published on DEV, and then use the response data to create a list of article titles, which we will be able to use in our custom search.

In the search.py file, we will need to import requests, which will allow us to request the data from the Dev.to API. We will also import config so that we can access our API key which is locally stored in that file.

Our request will return our article data in this format:

By looping through the data object, we then create a list of titles of all published articles (article_list) by accessing article["title"]. We can also create a list of recent_articles, so that we can conduct searches on a sub-set of the ten most recent articles (recent_articles):

Build a General Search Script

Once we have a list of articles, we are ready to set up a general search script which will use each article title as a search term, and will print out the top 10 google hits (based on the settings in our Custom Search Engine) to the terminal.

First, we will import and initiate colorama, which will allow us to add color formatting to our search results when they print out to the terminal. The argument autoreset=True will set the color back to the default after each line.

Next, we will set up our search logic. We will need to iterate through the list of recent articles, and make a request to the Google Custom Search API for each title. The response for each article title search will be a large dictionary, and the hits will be housed inside the "items" key. We can set the hits variable to include the top 10 hits by indexing the results.

Next, we can iterate through the hits list to set up our logic for printing results to the terminal. By adding the try: and except: logic, we can avoid throwing errors when a search term fails to get hits.

Note: So far in this example, I have limited the recent_articles list to 10 articles, and the hits list to 10 hits. This is based on the API's daily query limits, which allow for 100 queries per day. Feel free to adjust these numbers, but be mindful of the daily limit as you make your choices. More information about the daily limit can be found in the documentation.

When we run the script at this point, we will get a printout of the top 10 hits for each of the 10 article titles, from any of the domains that we identified in our Google Custom Search Control Panel during an earlier step. In my example, my printout shows links to my articles on Dev.to, Twitter, LinkedIn and Github. For example:

Putting it All Together

At this point, we have a script that works to detect our recent posts on the sites where we want and would expect to see the posts. However, if the ultimate goal is to search the web for stolen blog content, we will need to alter our settings in the Google Custom Search Engine Control Panel. Click on Setup, and then Advanced. You can now revise your previous settings, add domains that you would like to EXCLUDE from your search. After some manual testing, I ended up revising my settings to look like this, so that I would be searching the entire web, but excluding my own content shared via my own accounts:

Here is an example of the new search output, which highlights some additional places where my article titles are showing up on the web:

While the first two are harmless shares or re-posts, the third one is concerning, as that matches the twitter handles and blog address of the original stolen content.

Final Thoughts

It definitely takes some trial and error to refine the Google Programmable Search settings in your control panel - you need to find the sweet spot between excluding your own posts and shares, and including other domains and other people's references to your work. You'll then have to manually skim through the output of the search. Most of your hits will be harmless, like the first two above - people (or bots) who find and like your work will share it! But you may also detect sites that have scraped your writing and re-posted it without attributing it to you, and those are the sites that should be reported, with a request to remove the stolen content.

Try This!

After you finish your script, there are a number of modifications and refinements you can make to make it even better:

Play around with search terms: Maybe try searching for a snippet of your blog content rather than just the title.
Refine the settings in the Control Panel: What settings give you the most relevant number of hits?
Change the numbers in the script: Do you want to see the more hits, for a fewer number of posts? Just fiddle around with how your index your lists.
Create a Script to search for different content: What else could you use a Google Custom Search for?

Check out the Repository

Thanks for reading! If you'd like to check out my repo, you can check it out here.

Resources

Since this whole article is focused on protecting intellectual property, it's important to give credit where credit is due! I used the following resources to help put together this project:

How to Use Google Custom Search Engine API in Python - a tutorial that helped me put together the basics for integrating the Google API into a Python script.
Tutorial GitHub repo - The source code from the above tutorial
Google Custom Search JSON API Documentation - includes details on obtaining an API key and usage limits.
Google Programmable Search Control Panel - where you will create/edit/manage your search domains through your google account
Get Historical Stats for your Dev.to Articles - for inspiration and information on accessing the DEV API for article data.
Documentation for the DEV API - covers authentication and endpoints for accessing data on DEV articles.
Hide Your API Keys - clear and concise summary of how to secure your API keys in a python project.

Top comments (13)

Michael Tharrington • Sep 29 '20

This is seriously AWESOME!

I work as DEV's Community Manager and would love to figure out how we might be able to use this to find potential plagiarism on DEV. I'm a non-developer, so bear with me here... 😅

I'm wondering if this could be adjusted so that an admin wouldn't need to search for the title of an article, but instead the tool would bubble up a list of DEV articles that are duplicates or near duplicates based on what's posted in the title and body. Then an admin could review the material and investigate whether it is indeed plagiarism or not.

I may be thinking of something entirely different here, but you have definitely captured my interest with this. Thank you for sharing!

Jesse Smith Byers • Sep 29 '20

Thank you for the feedback! This could definitely be tweaked for something like that. I’ll send you a DM so I can better understand the problem you’re aiming to solve.

Jesse Smith Byers • Sep 29 '20

OK, I don't think I can send you a DM unless you follow me!

Michael Tharrington • Sep 30 '20

Good stuff! Just followed you. Thought I was already, but now I am for sure. 🙂

Sandor Dargo • Sep 27 '20

Thanks a lot for sharing! This is so great that I'm gonna steal it!

Jesse Smith Byers • Sep 28 '20

Uh oh, I guess I better run the script!

xtofl • Sep 30 '20 • Edited

Very nice to integrate these apis into a neat tool.

Btw, did you learn about list comprehensions? They can make your python more expressive.

titles = (art["title"] for art in data)

Rajan Panchal • Sep 26 '20 • Edited

Great post! Stolen articles is always a pain.. how about you extend it to report to DMCA using their API?!

Jesse Smith Byers • Sep 26 '20

Thank you, what a great idea! I’ll definitely take a look at that API and see how I could build that in.

Alwina Oyewoleturner • Sep 26 '20

This is a great post! I’m just learning Python and this is a great tutorial to create my own search script. Thank you!

Jesse Smith Byers • Sep 26 '20

I’m fairly new to Python as well, and this was a perfect intro project for me too. What resources have you been using to learn Python so far? Any recommendations?

Alwina Oyewoleturner • Sep 27 '20

Check out JetBrains Academy, python developer track. They’re free until January 2021. JetBrains created the IntelliJ IDE for Java and they have a few tracks for learning programming. You can complete the code challenges online or with the PyCharm IDE (Python version of IntelliJ). If you already have IntelliJ (community or paid version), install the EduTools plugin so you can learn. I really like this way to learn! Thank you again for the post!

Raymond • Sep 27 '20

I find this a good use of python's capabilities finding when articles are copied. I've also been working on a similar project where I'm checking if a website for randomly copied code.

View full discussion (13 comments)