DEV Community

Leonardo Holanda
Leonardo Holanda

Posted on • Edited on

How To Find An Artist's Country of Origin?

I'm this post, I'm gonna talk more about the approach I used to find an artist's country of origin.

A quick note before we start

For this post, I thought that showing the problems I solved, the solutions I found and the mistakes I made would be more interesting than just showing the code and explaining it. If you want to replicate it, at least you know what not to do. What do you think?

Also, my goal with this series of posts is to share what I learned while developing Cartogrify so I thought it would make more sense.

Anyway, if you just want to see the code, it's near the end.

In this post, we will see:

The Problem

Cartogrify fetches the user's 50 top artists from Spotify or Last.fm APIs. In both of them, you will have an array of objects containing the artists' data which includes their names.

To generate the data visualization, you need to know where the artists come from or know that you couldn't find their countries. The aim is to end up with an array of objects containing the artists' names and the country where they come from or undefined.

Since the country detection algorithm spends the hosting free tier resources, it shouldn't run for an artist that it already encountered before. Because of this, every searched artist needs to be saved in a database for future queries.

Also, it's kinda boring to look at a spinner and wait for 20+ artists to have their countries discovered. Since it can take a while, I did this loading screen:


Cartogrify country detection loading screen

It means that the artists must have their country detected sequentially rather than wait for all of them to be detected to proceed.

Where to fetch the data?

I already knew by looking at explr.fm source code that using Last.fm API was an option.

However, I didn't want to follow the devs' approach to extract country data from the artist's tags since it's an unreliable source. Sometimes, there's country tags and sometimes not. So I went searching for alternatives.

While I was searching, I stumbled upon Dr. Markus Schedl's paper "Three web-based heuristics to determine a person's or institution's country of origin". Dr. Schedl's approach relies on using a search engine with a specific query to retrieve top-ranked pages and extract the person's country data from their textual content.

This approach might work well in a research environment but I'm not quite sure about a web environment. The Google Custom Search API limit of 100 search queries for free per day seems heavily restrictive.

However, the article also mentions that other authors use a different approach by fetching data directly from specific websites. This approach is more suitable for web environments due to less usage restrictions which is the reason I chose it in Cartogrify.

You could also just ask ChatGPT. I did some tests and the answers were correct most of the time. But since it isn't free, it isn't an option for me, unfortunately.

Besides Last.fm, I searched for more websites that would contain artists' country data. These were the ones I found:

  • Rate Your Music
  • Discogs
  • MediaWiki API

They ended up being the initial "source pool".

How to fetch the data?

A great thing about some of these websites is that they have public APIs where you can send requests and get data about artists.

There are only two options, then: Send a request to the music website API or go to the artist profile page and do web scraping.

Which source to choose?

Rate Your Music ❌

Rate Your Music doesn't have a public API and it blocked my IP when I sent a request to an artist profile page. So neither of the options is available.

Discogs ❌

Discogs do have a public API but the artist search endpoint response doesn't have country data.

Since they are heavily focused on albums, the only country data available is related to albums. But I suppose is the country where the album was produced so it isn't reliable.

Web scraping, according to some forum posts, can also result in an IP ban.

MediaWiki API ❌

While taking a deep look at MediaWiki, which is under the Wikipedia umbrella, I noticed that it contains data for the most famous artists but the underground ones are missing.

Because of the equivalence between the data from Last.fm and MediaWiki for famous artists and Last.fm giving better results for underground ones, I decided to stick with Last.fm.

Last.fm ✅

Last.fm has a public API but the country data may only be available indirectly through tags or wiki text.

They do allow web scraping and the artist's page can be reached using the artist's name. Most famous artists have their country available on their pages which makes web scraping a reliable option.

The First Solution

Given this context, I have chosen the web scraping approach using Last.fm as a source. It seemed like a step up from the explr.fm approach so I dived into it.

Caveats

  • CORS
    Since sending a request from the browser to a Last.fm page triggers a CORS error, the request must be made from the backend, which means using an Edge Function from Supabase.

  • Readable Stream
    In the beginning, I only fetched 20 artists. For me, it wouldn't make sense to invoke 20 Edge Functions for each artist since it would just spend the free tier resources faster. So I tried to make one request to return the data from 20 artists. This is achievable using a Readable Stream.

How it works?

Here's the idea:

  1. Invoke the Edge Function sending the artists' names array as the request's body
  2. In the Edge Function, fetch the Last.fm profile page for each artist. Return each HTML page in the response stream
  3. In the frontend, read the response stream and concatenate its chunks to an accumulator string
  4. When a chunk is concatenated, check if the accumulator string contains a full artist HTML page. If yes, extract the page from the accumulator string and apply web scraping.

How the web scraping works:

  1. Search for the tags whose content you know that contains country data
  2. Extract their content as strings
  3. Search for country names in each string
  4. The country with more matches is associated with the artist

Where do you get the countries' names?

I was already using an amazing map dataset called Natural Earth to generate Cartogrify's world map. Since it already contains the countries' names, it was an easy choice to use it.

Problems

  • String comparison
  • Tags can be misleading
  • Edge Functions CPU time limit

String comparison

Lots of users were complaining that some artists were being associated with strange countries. The ones that attracted more attention were:

  • Michal Jackson was from India.
  • Every folk artist was from Norfolk Island.
  • Lots of artists were from Saint Barthélemy, Caribbean. Lil Peep, for example.
  • Artists from Georgia, USA were from Georgia, a country from Europe/Asia.
  • Artists from New Jersey, USA were from Jersey, Channel Islands.
  • An American artist named Neon Indian was from... India.
  • Gilberto Gil, a fantastic Brazilian artist born in the state of Salvador, was from El Salvador.

It's kinda funny, though. Unacceptable but really funny.

Why?

There are two approaches I used to compare strings. The exact match and the substring match.

I started with the exact match because it's the standard logic, right? If it says "Djavan is an artist from Brazil" you split the string by the whitespaces, match "Brazil" with "Brazil" and that's it.

But it turns out that I was associating lots of artists with plenty of country data with an undefined country. This would happen because "Brazil," or "Brazilian" doesn't match with "Brazil", for example. (Examples are in sentence case but were converted to lowercase before comparison)

I thought that using the substring match would loosen the match criteria and therefore give better results. Oh, boy. What would happen is that a lot of undesired matches would occur.

For example, India is a substring of "Michael Jackson is an artist from Indiana". Also, a substring of "Neon Indian". India ended up being the country with the most matches and these artists would be associated with it.

Solution

Use the exact match combined with a demonym's exact match. Demonyms, according to Google, is "a noun used to denote the natives or inhabitants of a particular country, state, city, etc". Like Brazilian, American, Italian, etc.

Lots of times the wiki text would contain something like "Luiz Gonzaga do Nascimento (Exu, Pernambuco, December 13, 1912 — Recife, Pernambuco, August 2, 1989) was a prominent Brazilian folk singer, songwriter, musician and poet."

There ain't no "Brazil" string but a "Brazilian" one. Without demonyms, Luiz Gonzaga would be associated with an undefined country. With demonyms, he is associated with Brazil. And no substring match problems.

I got the demonyms list from this Wikipedia page.

The idea of using something like Levenshtein Distance instead of an exact match also crossed my mind but I just tried to keep it simple.

Tags can be misleading

There are only 3 tags you need to search for content in a Last.fm artist page.

  • Metadata tag
  • Wiki tag
  • Tags tag


Gilberto Gil profile page in Last.fm

The metadata tag is the most valuable because it may contain the exact country data.

The tags tag may not contain country data at all. But when it contains, it can be a demonym, the country name in its own language or things like that.

The wiki tag is less valuable because it can contain country data that isn't associated with the country the artist was born. For example, "In the 1970s, Gil added new elements of African and North American music to his already broad palette [...]". As "African" and "American" are respectively demonyms from Africa and the USA, it would count as a match to Africa and the USA.

Solution

Instead of counting country matches, use a point approach. Each tag will have a point weight associated with its value. Metadata tags have 5 points, tags tags have 3 points and wiki tags have 1 point.

When matches occur, the country receives the points according to the tag where the match occurred. The country with the most points wins.

Edge Function CPU time limit

The Edge Functions were working fine when I was fetching only 20 artists from Spotify and Last.FM APIs. But when I increased the number, this error started to appear.

CPU time limit reached. isolate: 16597602940236451129
CPU time used: 560ms
hyper::Error(User(Body), hyper::Error(Body, Custom { kind: UnexpectedEof, error: "unexpected EOF during chunk size line" }))
Enter fullscreen mode Exit fullscreen mode

What was annoying me was that this error would sometimes appear and sometimes not. I was aware the number of artists was causing it but I couldn't find why this intermittent behaviour was happening.

I thought that the root of the problem was the unexpected EOF message but after posting a question in StackOverflow some answers made me realize that it was just the time limit.

Since I was using a timeout of 1s between each request to avoid being blocked by Last.fm, 30 artists means at least 30s, 35 means 35s and so on.

Solution

Migrate the country detection code to AWS Lambda which can run up to 15 minutes. Since an user with 50 unknown artists takes roughly 1 minute, it's ok.

The Second Solution

Do you remember the "source pool"? Well, there was an option that didn't appear there. It was MusicBrainz API.

When I found MusicBrainz API for the first time, I had already decided to follow the Last.fm approach. For that reason, I didn't further explore their API and wasn't aware that there was a resource named Area which holds the artist's location data. That would have solved my problems.

Fortunately, the same lovely person in the Last.fm Discord gave me this hint about the Area resource. Then, I decided to shift the approach to using MusicBrainz API as its main source.

That one goes to the "What I learned" section. That was my biggest mistake in the process of finding the solution to this problem. It cost me so much time.

How it works?

  1. Invoke the AWS Lambda function passing the artists' names array as the request body
  2. In the Lambda function, request the MusicBrainz API for the data of each artist.
  3. Return the data in the response stream.
  4. Extract the country from the data.

Here's the Lambda function code:

const https = require('https');
const URL = require('url');

async function getArtistData(artistName) {
  try {
    return new Promise((resolve, reject) => {
      setTimeout(() => {
        const req = https.get({
          hostname: 'musicbrainz.org',
          path: `/ws/2/artist/?query=artist:${encodeURIComponent(artistName)}&fmt=json&limit=100`,
          headers: {
            'User-Agent': ###########
          }
        }, (res) => {
          let body = '';
          res.on('data', (chunk) => body += chunk);
          res.on('end', () => resolve(body));
        }); 

        req.on('error', (e) => reject(e));
        req.end();
    }, 1000);
  });
  } catch (e) {
    return new Promise((resolve, reject) => reject(e))
  }
}

exports.handler = awslambda.streamifyResponse(async (event, responseStream, _context) => {
    responseStream.setContentType("text/event-stream");

    const artistsName = event.body.split("###") || [];
    for (const artistName of artistsName) {
      try {
      responseStream.write("START_OF_JSON");
        responseStream.write(JSON.stringify({
          name: artistName,
          data: await getArtistData(artistName)
      }));
          responseStream.write("END_OF_JSON");
      } catch (e) {
        console.log(artistName)
        console.error(e)
        responseStream.write(e)
      }
  }

    responseStream.end();
});
Enter fullscreen mode Exit fullscreen mode

And here's the code that runs on the Angular app.

 findArtistsCountryOfOrigin(artists: Artist[]): Observable<ScrapedArtist> {
    const artists$ = new Subject<ScrapedArtist>();

    const artistsNames = artists.map((artist) => artist.name);
    fetch(environment.PAGE_FINDER_URL, {
      method: "POST",
      body: artistsNames.join("###"),
    })
      .then(async (response) => {
        const streamReader = response.body?.getReader();
        if (!streamReader) return;

        const textDecoder = new TextDecoder();
        let streamAccumulatedContent = "";

        while (true) {
          const { value, done } = await streamReader.read();

          streamAccumulatedContent += textDecoder.decode(value);
          if (
            streamAccumulatedContent.includes("START_OF_JSON") &&
            streamAccumulatedContent.includes("END_OF_JSON")
          ) {
            const startIndex =
              streamAccumulatedContent.indexOf("START_OF_JSON") + this.START_INDICATOR_OFFSET;
            const endIndex = streamAccumulatedContent.indexOf("END_OF_JSON");

            const rawArtistData: RawMusicBrainzArtistData = JSON.parse(
              streamAccumulatedContent.slice(startIndex, endIndex)
            );

            streamAccumulatedContent = streamAccumulatedContent.slice(
              endIndex + this.END_INDICATOR_OFFSET
            );

            const artistData = {
              name: rawArtistData.name,
              artistDataFromMusicBrainz: this.musicBrainzService.getArtistData(rawArtistData),
            };

            const { country, secondaryLocation } =
              this.musicBrainzService.getArtistLocation(artistData);

            if (country == undefined && secondaryLocation != undefined) {
              this.countryService
                .findCountryBySecondaryLocation(secondaryLocation)
                .pipe(
                  switchMap((countryFromSecondaryLocation) => {
                    if (countryFromSecondaryLocation.NE_ID == -1)
                      return this.lastFmService.getLastFmArtistCountry(artistData.name);
                    return of(countryFromSecondaryLocation);
                  })
                )
                .subscribe({
                  next: (country) => {
                    artists$.next({
                      name: artistData.name,
                      country: country,
                      secondaryLocation,
                    });
                  },
                  error: () => {
                    artists$.next({
                      name: artistData.name,
                      country: undefined,
                      secondaryLocation: undefined,
                    });
                  },
                });
            } else if (country == undefined && secondaryLocation == undefined) {
              this.lastFmService.getLastFmArtistCountry(artistData.name).subscribe({
                next: (country) => {
                  artists$.next({
                    name: artistData.name,
                    country: country,
                    secondaryLocation,
                  });
                },
                error: () => {
                  artists$.next({
                    name: artistData.name,
                    country: undefined,
                    secondaryLocation: undefined,
                  });
                },
              });
            } else {
              artists$.next({
                name: artistData.name,
                country,
                secondaryLocation,
              });
            }
          }
          if (done) break;
        }
      })
      .catch((err) => {
        console.log(err);
      });

    return artists$.asObservable();
  }
Enter fullscreen mode Exit fullscreen mode

I will refactor it to make it more presentable. I'm in the "make it work" phase.

Problems

  • Missing country data
  • Matching artists' names
  • Missing Last.fm underground artists
Missing country data

Sometimes, MusicBrainz only knows the city or state that an artist comes from. Since we need the country, the full location is necessary.


Thee Sacred Souls data from MusicBrainz API

The solution that I found is to use the Free Geocoding API. You send a request with an address and it returns the full location. Then, it's a matter of finding the country name through string comparison.


Free Geocoding API response to San Diego query

Matching artists' names

One example that I saw was the Racionais MC's example.

When you fetch data from an artist in MusicBrainz, it returns a list with the most likely artists to match the artist name you provided sorted by a "likelihood" score.

At first, I always got the first one because it had a higher chance of being the artist I wanted. But I noticed that that's not always true.

When searching for Tyler, The Creator, for example, he's not the first on the list. Some artists named only as "The Creator" appear first.


MusicBrainz API response to "Tyler, The Creator" query

To fix this, I tried to exactly match the artist name for the first 100 artists and it worked. But then I got the Racionais MC's problem.


Free Geocoding API response to San Diego query

The "Racionais MC's" string I got from Last.fm is actually different from the "Racionais MC's" string I got from MusicBrainz. That's because the ' character has a different encoding so the strings never match.

In this scenario, I put a fallback that returns the first artist if there's no match in the first 100 artists. That seems to be working for now.

Missing Last.fm underground artists

MusicBrainz has data from a lot of artists but the Last.fm underground artists are a special kind of artist whose data only seems to exist in Last.fm.

When I shifted the approach to use MusicBrainz, I stopped using Last.fm. But due to this missing artist problem, I decided to use Last.fm as a fallback in case MusicBrainz doesn't have the data.

The Final Solution

Finally, we reached the final solution. As I said, it's the MusicBrainz API solution + Last.fm API as a fallback to fix the missing artists problem.

The only difference is that I'm no longer doing web scrapping with the Last.fm artist profile page. I received some advice from the same lovely person in the Last.fm Discord to move away from that approach. It's kinda like a "good neighbor" policy.

What changes is that the data from Last.fm now comes from their API and only the artists' wiki text and tags are available. The techniques to extract content and match countries' names stay the same. This is the approach that I've been using for more than a month now.

I haven't seen any significant complaints from the users anymore. Actually, there were some compliments that were very nice from users who saw artists being assigned to the wrong countries before.

Here and there artists are still being assigned to undefined countries and sometimes wrong countries. However, I found that the ones that got undefined usually don't have enough data available to determine their countries.

The already implemented suggestions feature is doing a great job of reassigning the correct countries for the artists that this solution failed and providing countries for the ones that don't have enough data.

What I learned from all of this

  • Share early versions of your project with your users and other developers! Valuable advice may come from some lovely people out there.
  • Do not dismiss a development path without exploring it properly.
  • Even if you put a lot of effort into improving something, it may fail in cases you couldn't even imagine. Always have a fallback when this happens.
  • The user's point of view is always the most important one. It doesn't matter that the artist's country of origin was found if the user closes the tab because it took too long.

That's it! I hope you learned something or that this will help you with any problem that you are facing. Tell me in the comments what you think about the solutions. Suggestions and feedback are very welcome!

Top comments (0)