Cat McGee

Posted on Jun 29, 2020 • Edited on Jun 30, 2020

Get your first dev job by building these projects! #2: Markov Chain Lyrics Generator

#beginners #python #career

Hello my friends! You may be aware of what this 'get your first dev job by building these projects' is (I don't know why I gave it such a long name), but if not - check out this article.

TLDR: Follow these tutorials (they all come with a short video), build on top of the projects, and become a developer. It's like a guaranteed way to get PAID TO CODE. And that's the coolest feeling ever.

This Markov Chain Lyrics Generator is one of my favourite projects. Basically, it uses a type of prediction model (called a Markov Chain) to generate its own version of your favourite artist's lyrics. These end up being super hilarious sometimes, because they sound similar to the artist but just... not quite. AI generated stuff is totally my sense of humour so I've literally spent hours laughing at the responses. Call me lame, I dare you.

A Markov Chain is the simplest kind of prediction model - it only uses the previous state to predict the next one. In our case, it only uses the previous word to predict the next word in the lyrics.

Dive right into the code from GitHub here, or watch the 2:20 Tutorial video here.

I feel like this is one of those recipe blogs where you have to read through the rambling nonsense before getting to the actual recipe. Sorry about that. Let's get on with it.

Step 1: Install dependencies

Python3 - Download here

re - Python’s regex library. We’ll be using this when scraping a website to find the links to lyrics

pip install re

urllib- URL library. We use this library to scrape the HTML off of a page and read it to a string.

pip install urllib

markovify - This library can generate a Markov Chain for us.

pip install markovify

Step 2: Create a file for markovify to read from

We’re going to be putting all of the artist’s lyrics into one file for markovify to read from.

originalLyrics = open('lyrics.txt', 'w')

Step 3: Scrape the links to the lyrics

Here, we could manually input all the lyrics into the file, or use a paid API. But this way is more fun.

We’ll be using AZLyrics to scrape the lyrics of your artist and place them in lyrics.txt.

The first thing we need to do here is go to the artist’s page on AZLyrics and find every link on the page that links to a song. Have look at the Coldplay page on AZLyrics to see what I mean (we'll be using Coldplay in this tutorial. I don't really know why.)

Use the urllib library to request the HTML from the artist’s page, and convert it into a string using read so we can easily search it for links.

url = "https://www.azlyrics.com/c/coldplay.html"
artistHtml = urllib.request.urlopen(url)
artistHtmlStr = str(artistHtml.read())
To find the links on the page, we’ll be using the re library we imported which will find all a link.

links = re.findall('href="([^"]+)"', artistHtmlStr)

Now you have a list called links which contains all of the links on the artist page! Try printing the links to see what happens.

print(links)

Notice anything? If you look through the links, you’ll see that you have all of the links to the lyrics pages (great!) but also to other pages, like contact pages and even the CSS files. Another thing you should notice is that all of the lyrics links have the string lyrics/coldplay in them.

Now we have to filter those links so that we only get the lyrics pages. Initiate another list that will hold only lyrics links.

songLinks = []

We’ll loop through the links list, selecting everything that contains the string lyrics/coldplay and appending them to the songLinks list.

Still with me?

for x in links:
    if "lyrics/coldplay" in x:
        songLinks.append(x)

Print songLinks outside of the for loop to see what you end up with. Great - only the lyrics links! However, we still have another problem. The links do not have the full URL, only the end. And they have .. in them. Before adding the links to the songLinks array, we need to append the beginning of the URL to the link, and replace .. with an empty string. Update your if statement to look like this.

for x in links:
    if "lyrics/coldplay" in x:
        x = x.replace("..", "")
        x = "https://www.azlyrics.com/" + x
        songLinks.append(x)

Print songLinks again and see what happens. Now we have only full lyrics links. Awesome.

Step 4: Scrape the lyrics from the lyrics links

The next step is to get the HTML from each link the songLinks list, find the actual lyrics and nothing else, and save it to the file we created in Step 1. Alright, let’s go.

Firstly, we’ll need to iterate through our songLinks list and scrape the HTML, converting it into a string. We already did this for the artist page in Step 2.

for x in songLinks:
    songHtml = urllib.request.urlopen(x)
    songHtmlStr = str(songHtml.read())

If we print songHtmlStr here, we’ll get everything on the song page. We don’t want that, or our prediction model will be taking from more text than just the lyrics. It'll be like 'yeah, baby, contact us, join the newsletter.' So we need to find where the lyrics are on the page, and split our string so that we’re only adding the lyrics to the file.

Look at a lyrics page on AZLyrics and right click the part where the lyrics are. When you click inspect (or inspect element depending on your browser), you’ll see that the lyrics start after a disclaimer (‘content by any third-party…’) and end at the end tag </div>. So we need to split our string twice - once to get all HTML after the disclaimer, and another time to get all the HTML before </div>.

Luckily, Python has a super easy way to split a string - split(). The split function takes an argument for the text in the string that marks the end of one split and the beginning of the other. It returns a list of strings, but we can simply get get the first or second item in that list.

To split our lyrics after the disclaimer, we pass two arguments - the disclaimer text, and 1 because we only want to split the string twice, not at every instance of the disclaimer.

split = songHtmlStr.split('content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that. -->',1)

Then we split the new string, splitHtml, at the instance of </div>, again only splitting once, and the first item in the list will be our lyrics!

split_html = split[1]
    split = split_html.split('</div>',1)
    lyrics = split[0]

Try printing this lyrics list - you’ll see we have all the scraped lyrics! However, we have another problem. We’ve picked up all of the extra bits like <br> and \n that we don’t want. Use replace methods to replace each of these with empty strings or whatever you like. I used these ones in the video, but they’re not perfect - you can write your own!

lyrics = lyrics.replace('\\', '')
lyrics = lyrics.replace('\nn', '\n')
lyrics = lyrics.replace('<i>', '')
lyrics = lyrics.replace('</i>', '')
lyrics = lyrics.replace('[Chorus]', '')

Now all we have to do is write these lyrics to the file we had opened earlier! Once we’ve finished going through the loop, we’ll close the file - we no longer need to write anything to it.

Here’s what the whole for loop should look like.

for x in songLinks:
    songHtml = urllib.request.urlopen(x)
    songHtmlStr = str(songHtml.read()) 
    split = songHtmlStr.split('content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that. -->',1)
    splitHtml = split[1]
    split = splitHtml.split('</div>',1)
    lyrics = split[0]
    lyrics = lyrics.replace('<br>', '\n')
    lyrics = lyrics.replace('\\', '')
    lyrics = lyrics.replace('\nn', '\n')
    lyrics = lyrics.replace('<i>', '')
    lyrics = lyrics.replace('</i>', '')
    lyrics = lyrics.replace('[Chorus]', '')
    originalLyrics.write(lyrics)
originalLyrics.close()

Step 5: Generate the new lyrics

This step is where the magic happens, but it's also the easiest part. Python’s markovify does everything for us! We need to create two variables - one that can store the new generated lyrics, and one string variable that markovify can read from. To create this string, we’re going to read() our lyrics.txt file that we created and filled with lyrics. This time, we’ll open it in read-only mode ('r').

generatedlyrics = ()
file = open('lyrics.txt', 'r')
text = file.read()

We’ll pass our text variable to markovify to generate a markovify model. This is super easy:

markovifyTextModel = markovify.Text(text)

Now all we need to do is use that markovify model to generate a sentence. There are hundreds of things you can do with this model, but we’ll be using make_sentence, a markovify method that just predicts one sentence.

generatedlyrics = markovifyTextModel.make_sentence()

Print generatedLyrics and you’ll see some predicted Coldplay lyrics! These can be absolutely hilarious, and sometimes quite morbid… my one in the video is 'I took my son.'

Top tip: You may want to put all the steps into different files or functions or comment out all the code before Step 5. You don’t need to create a new lyrics.txt file every time.

Now it’s your turn!

To really make the most out of these tutorials, try to build something on top of this project. Here are some ideas:

Try lots of different artists
Make it into a Twitter bot - I actually created one for Led Zeppelin a while ago, but it's inactive now
Create functions so you don’t have to run the scraper every time
Use filter instead of replace (https://www.tutorialspoint.com/lambda-and-filter-in-python-examples)
Ask the user what artist they would like at the beginning
Make a GUI
Generate more than one sentence - check out what else Markovify can do!
Use the same technique to scrape other websites and create Markov Chains for scripts or speeches

And there you have it! A funny, interesting, impressive project you can show employers. DM me on Twitter or comment here when you've build it and let me see some of your results!

Top comments (5)

ianainslie • Aug 19 '20 • Edited

Hi. I love this and have a few ideas for adding my own twists on it.
However, using Jupyter Notebook, I get as far as the end of Step 3 - with a nice list of URLs. When I go into Step 4, I get hit with "HTTPError: HTTP Error 403: Forbidden".
I take this to mean that the site is not happy with the bot activity. How do I incorporate a User=Agent into the code to circumvent this? I know how to do this with the requests library, but this is my first time with the urllib library and I am struggling a little :)
Would love your input on this, as it's such a cool looking little project that I would love to complete
Thanks, Ian

Cat McGee • Aug 19 '20

Hi Ian, yeah unfortunately this happens. If you look at the GItHub repo, somebody actually merged a PR to help with that.

You can use sleep from Python's time module, and sleep for random intervals while you scrape the lyrics. That way AZLyrics doesn't pick up that you are a bot.

Fortunately, after getting the lyrics you don't have to run it again so you can avoid the ban!

ianainslie • Sep 3 '20 • Edited

Hi Cat, I just wanted to pop by and say thanks again. I decided to make a few changes to your original code:

Rewrote most of it to use BeautifulSoup, as I understand that module better
Dropped the Markov chains, as they were not really working for me. Made more use of the random module
Created a GUI (which in turn lead to me learning a tonne of stuff about TKinter)

I am now working on a new project and was just wondering how many do employees like to see in a portfolio for entry level software engineers?
Also, does blogging about projects help in gaining employment? I have my own site (ianainslie.uk), so that would be easy to initiate if worthwhile.
Lastly, thanks again for the whole 2.20 videos thing - it lit a fire under my arse to make the transition from tutorial watcher to project maker.