loading...

Building a Colors API Through Web Scraping, pt. 1

carlyraejepsenstan profile image CarlyRaeJepsenStan ・6 min read

For this weekend's first backend project, I thought I'd make a useful API. From one of my previous articles, you can see that my simple site accepts hex codes - what if it could accept color names like darkorchid or peru?

My immediate thought was a colors API! I did searching, but no luck. What I did find, though, was this URl:
http://www.html-color-names.com/darkorchid.php

Oooh 👀. Do you see that? There's darkorchid in the URL name! I fiddled with the URL, and it turns out

  • http://www.html-color-names.com/powderblue.php
  • http://www.html-color-names.com/thistle.php
  • http://www.html-color-names.com/lightseagreen.php were all valid URLs!

But, I wanted a site that could get both color names AND hex codes. Turns out, I was asking for too much - my favorite site colorhexa has hex codes in its url, like https://www.colorhexa.com/8acaff

Great! Looks like I can make my own API now. Here's my basic structure:

  • If input is a hex code, send to colorhexa
  • If input is a name, send to html-color-names
  • Scrape sites for the color info (hex code, rgb, etc)
  • Return in an object, like
{
  input: input, 
  hex: hexcode, rgb: 
  rgb(r,g,b), 
  name: color name (if exists)
}

I have some previous experience with web scraping - I used tools like axios and cheerio before.
Axios downloads the entire HTML page for you to manipulate, and cheerio is like jQuery for the backend - it allows you to manipulate and get elements with CSS selectors like .class and #id.

Anyway, before we start building the Node app, let's check out the HTML code behind these two sites.

On html-color-names, I ran into an issue:
Screen Shot 2020-09-25 at 12.40.15 PM

Hmmm.... I assumed that the RGB, hex code and name would have different IDs.

Anyways, I wrote a few lines to make sure the only elements with class="color-data light" were the ones with the color info.

Screen Shot 2020-09-25 at 12.42.48 PM

Whew 😅. I'll work this into my API later.

Next is colorhexa. As it turns out, their site organization is much more complex than html-color-names:
Screen Shot 2020-09-25 at 12.48.03 PM

A quick read seems to show that all the numbers are inside of class="value" elements...
Screen Shot 2020-09-25 at 12.48.34 PM

Anyways, one good thing is that the colorhexa data pattern is consistent - arr[1] is always the RGB value.

Now, on to the backend coding. We can use axios to fetch the site, and then use cheerio to parse the HTML and get what we want.

So first, we're going to do something like this:

//import our packages
const axios = require("axios")
const cheerio = require("cheerio")

//load html-color-names
axios.get("http://www.html-color-names.com/darkorchid.php")
    .then(html => {
        console.log(html)
})

We can't start parsing out HTML yet - this is the axios output:
Screen Shot 2020-09-25 at 1.34.57 PM

Whoaaa, what? I expected a plain-text response, but its actually an object! We need to look for the HTML part, which is actually html.data.

axios.get("http://www.html-color-names.com/darkorchid.php")
    .then(html => {
        console.log(html.data)
})

Screen Shot 2020-09-25 at 1.36.29 PM

Cool! Now we get HTML plain text. Now, we're going to look at this through our cheerio-tinted lenses:

//import our packages
const axios = require("axios")
const cheerio = require("cheerio")
const data = []

//load html-color-names
axios.get("http://www.html-color-names.com/darkorchid.php")
    .then(html => {
      //console.log(html.data)
    const $ = cheerio.load(html.data)
    var arr = $(".color-data")
    console.log(arr)
})

However, I found an error - arr returns a large object. I tried using arr.text(), but it returned a string. arr.toArray() results in an array. What to do?

After that I tried:

  • .text(): returns the names, but with tons of whitespace.
  • .data(): returns the text of only the first element.
  • this link - I rewrote it as an arrow function, but it just returned ["","",""].

Finally, I tried

arr.each(function(i, elem) {
  a[i] = $(this).text();
});

Whoa! It worked! I got this as the input:

[
  '\n\t\tDarkOrchid\n\t\t',
  '\n\t\t#9932cc\n\t\t',
  '\n\t\t\trgb(153, 50, 204)\n\t\t'
]

Whoa! It worked! It seems like arrow functions don't work in Cheerio. And because we're smart developers who don't copy and paste, let's figure out how it works:

  • $(this) -> The element being passed into the loop
  • i -> Every time each runs, i is increased by 1. Basically, it's a compressed for loop.
  • elem -> a useless parameter - I deleted it and it still works. BUT: you can also write $(elem) instead of $(this). It seems to work the same way, so I'll use it that way in the future.

Getting info from colorhexa is going to be pretty similar; I selected .value, iterated over it, and then got out a result.

[
  'd8bfd8',
  '216, 191, 216',
  '84.7, 74.9, 84.7',
  '0, 12, 0, 15',
  '300°, 24.3, 79.8',
  '300°, 11.6, 84.7',
  'cccccc',
  '80.078, 13.217, -9.237',
  '59.342, 56.819, 72.803',
  '0.314, 0.301, 56.819',
  '80.078, 16.125, 325.053',
  '80.078, 12.716, -16.458',
  '75.378, 8.614, -4.499',
  '11011000, 10111111, 11011000'
]

The second position [1] is the RGB value.

Yay! Looks like I can get the values now.

Ok, now we need to parse our results. As you remember from the html-color-names array, each string in the array is like
'\n\t\tDarkOrchid\n\t\t'
What I'm going to do (and what would be smart) would be to write a quick regex matching \n or \t, replace them with spaces, and then use .trim() to strip the spaces.

But, because each string has exactly six characters behind and in front (\n\t\t), I could also slice (6, str.length() - 6). That would be the fast and hacky way to do it.

But I have all weekend! I'll write a regex. At first, I tried writing \\n and \\t, but, as it turns out \n and \t are legal regex statements - so my code became like this:

var regex = /(\n|\t){3}/
//match any three sets of \n or \t
str.replace(regex, "").trim()
//replace the regex positions with spaces, and then rinse them out
>>> "DarkOrchid"

Anyway, I did end up spending close to an hour debugging this, but my code was much cleaner and simpler.

As for colorhexa, I really only wanted the RGB, so:
a[1].
Yes, that's it. The RGB values are in the second slot in the array, and arrays start from 0, so its 1. Perfect! We can get the color info from the URLs.

Now, let's get on with the actual API part.

Serving the API

First, you should be familiar with NodeJS and Express and stuff like this:

app.get("/" , (req, res) => {
 res.send("Hello world!")
})

In the above, the path is "/" - I'll be talking about paths a lot later on.

So right now, my workflow looks like this:

  • Get the color
  • If the color starts with "#", send it to colorhexa.
  • If it doesn't, send it to html-color-names.
  • Respond with an object with the desired information.

At this point, I ran into another roadblock. Variables declared inside of the axios.get() functions can't be used outside - so I have to put both of them inside the same path.

So, I came up with this:

  • Get the input on the path /input/:color.
  • If :color starts with "#" or matches my color regex from my earlier article, send to /hex/:color and remove the "#".
  • If color is a string, send to /string/:color

And for those who aren't familiar with the colons, using things like :color allows you to get part of the path using req.params.color.

Basically, I would pass the color parameter into the functions I wrote previously, and then send the results.

That wraps up the first part of my API! Check out the next article to see how I split up the code and set up the validation.

Posted on by:

carlyraejepsenstan profile

CarlyRaeJepsenStan

@carlyraejepsenstan

Hey there! I'm Connor, and I like writing long, complicated guides that detail every minute problem and feature I've encountered.

Discussion

pic
Editor guide