Ninjeneer

Posted on Jan 21, 2022 • Edited on Jan 22, 2022

Creating a Netflix clone

#python #algorithms #automatisation #torrent

Disclaimer

This project is for exercice purpose only. The whole source code will not be shared to avoid abuses.

Introduction

For the sake of the experimentation, I'm willing to build a Netflix clone, based on automated torrent downloading.

Requirements

To build a such system, I will need a web platform able to stream videos accross multiple devices as Netflix does. Hopefully, the Plex platform already does the job in a pretty awesome way. Therefore, I will just have to build a software able to search crawl in torrent websites in order to find and download the films and series I want to watch.

The idea is to have a web interface, asking me for the film I want to watch. If I already have it on my hard drive, it will open the Plex platform. If not, it will trigger an automated torrent download and move the film/serie into my plex media folder.

Technical stack

For this projet, I will use the Python language. As I haven't really worked with it yet, it will be a great introduction to this technology.

Steps

Step 1 : Getting the web page

After a research on the website, a specific URL is built. For instance, when searching for the avengers film, the URL looks like this : https://xxxxxxx/search/avengers/1/99/200

To process web scrapping, I am using the BeautifulSoup4 python module coupled with the requests one.

class XXXParser(Parser):
    def __init__(self):
        super().__init__()
        self.base_url = "https://xxx"

    def __build_url(self, film_name) -> str:
        url = self.base_url
        url += "/search/" + film_name + "/1/99/200"
        return url

    def __get_page_content(self, film_name: str) -> BeautifulSoup:
        html_content = requests.get(self.__build_url(film_name)).text
        return BeautifulSoup(html_content, 'html.parser')

The website I am scrapping is structured this way :

<tr>
    <td class="vertTh">
        <center>
            <a href="https://.../browse/200" title="More from this category">Video</a><br>
            (<a href="https://...y/browse/207" title="More from this category">HD - Movies</a>)
        </center>
    </td>
    <td>
        <div class="detName">
            <a href="https://.../torrent/34281763/Avengers.Endgame.2019.1080p.BRRip.x264-MP4"
                class="detLink">Avengers.Endgame.2019.1080p.BRRip.x264-MP4</a>
        </div>
        <a href="magnet:?...">
            <img src="https://.../static/img/icon-magnet.gif" alt="Magnet link" width="12"height="12">
        </a>

        <a href="https://.../user/..."><img src="https://.../static/img/trusted.png" alt="Trusted" title="Trusted" style="width:11px;" width="11" height="11" border="0"></a>
    </td>
    <td align="right">1803</td>
    <td align="right">383</td>
</tr>

After receiving the web page as a BeautifulSoup result object, I can start filtering the HTML tags to retrieve the information I am looking for :

Title
Download URL
Number of seeders
Trusted uploader
Video quality

page_content = self.__get_page_content(film_name)
rows = page_content.select("tr")
for row in rows:
    if row.select_one(".vertTh") is None:
      # This is not a table row containing a film
      continue

   film_name = row.select_one(".detLink").text
   film_url = row.select_one(".detLink").attrs.get('href')
   seeders = int(row.contents[len(row.contents) - 2].text)
   leechers = int(row.contents[len(row.contents) - 1].text)
   trusted = row.select_one("img[alt=Trusted]")
   quality = re.search("\\d{4}p", film_name)

Step 2 : filtering results

One of the problem with torrents name, is their unintelligable names. Many of them basically looks like this Avengers.Infinity.War.2018.1080p.10bit.BluRay.8CH.x265.HEVC-PSA which makes the work harder when it goes to filtering data.

So, I need to identify which text to remove to clear the titles.

Replace dots by spaces
Remove the quality using a regex (\d{3,4}p)
Remove the tags "DVDrip", "HDrip" etc... using a regex (\w{2,3}rip)
Remove repeted keywords among all titles : blueray, bluray, HEVC, AAC, ACC, PSA, MP4....
Remove encoding tags with regex ((x|h)\d+)
Remove useless "The" at the beggining of titles

I now have more natural results :

avengers
avengers endgame
avengers endgame (2019)
avengers infinity war
avengers infinity war 2018 english
avengers age of ultron (2015)

Step 3 : Sorting results

I don't want to spend time filtering the results myself to find the best one, I want it to be automated. That's why I need to give a score to each result based on several criteria.
By default, every result has a score of 0.

Levenshtein distance

This one is the most important of all scoring methods.

The levenshtein distance calculates the number of changes needed to go from a string A to a string B. In my case, I want the levenshtein distance to be the lower as possible between my query and the titles. Thanks to the previous title clearing done above, film titles already looks pretty natural.

Seeders

As I want my film to be downloaded as fast as possible, I'm looking for the ones with the most seeders. To avoid increasing too much the score based on the number of seeders, I am using the mathematical square root function, where the Y values increases slower as the X values increases.

Language

As a french speaker, I prefer watching french movies. If the movie title contains "french" keyword, then its score is increased by one. However, if it only contains a "fr" keyword, its score is increased by 0.5 because I am less sure it is a french language related tag.

Quality

The quality is also an important criteria. If the title contains a quality greater or equal than 1080p, the film's score increases of 1 points. If the quality is lower, it increases proportionnally to the quality (720p => 0.5, 480p => 0.25...)

Trusted uploader

The website I am scrapping has the ability to reward users with a tag "Trusted". This tag insures me a good quality and an accurate content. A film uploaded by a trusted uploader automatically increases its score by 1.

Step 4 : automate download

To be continued...

Thanks for reading, keep in mind to stay awesome !

DEV Community