loading...

Daily Challenge #89 - Extract domain name from URL

thepracticaldev profile image dev.to staff ・1 min read

Write a function that, when given a URL as a string, returns only the domain name as a string.

domainName(https://twitter.com/explore) == "twitter"
domainName(https://github.com/thepracticaldev/dev.to) == "github"
domainName(https://www.youtube.com) == "youtube"

Good luck!


Want to propose a challenge idea for a future post? Email yo+challenge@dev.to with your suggestions!

Discussion

pic
Editor guide
 

Javascript!

function domainName(domain) {
  const a = document.createElement('a');
  a.href = domain;
  const { hostname } = a;
  const hostSplit = hostname.split('.');
  hostSplit.pop();
  if (hostSplit.length > 1) {
    hostSplit.shift();
  }
  return hostSplit.join();
}

domainName('https://twitter.com/explore') == "twitter"
domainName('https://github.com/thepracticaldev/dev.to') == "github"
domainName('https://www.youtube.com') == "youtube"
 

Can you please explain what

const { hostname } = a;

does and how?

 

It has the same effect as:
const hostname = a.hostname

 

That's called destructuring assignment. "a" probably is an object which has the hostname property so that assignment extracts hostname.
I'm just writin about this 😅. Hopefully that'll help you.

 

Actually the domain name is the whole thing - "twitter.com", "github.com", "youtube.com". What do you want to get for something like "a.b.c.ac.il"?

 

Even "com" or whatever other TLD is a domain name. The most specific domain name in "www dot youtube dot com" is "www".

(edited because def dot to removed the www from my youtube URL.)

 

JavaScript one-liner

const domainName = url => url.replace(/https?:\/\/(?:www\.)?/, "").split(".")[0]
 

Haskell

Note: Assuming the string will always start with either http:// or https:// or nothing but not supporting any other protocols (by lazyness).

Note2: This won't work for the URIs that have a dot in their name, still working on it.

Note3: Working all good but the end is a mess haha...

import Control.Arrow
import Data.List (isPrefixOf)

removeProtocol :: String -> String
removeProtocol "" = ""
removeProtocol string@(firstCharacter : rest)
    | isPrefixOf https string = removeProtocol (drop (length https) string)
    | isPrefixOf http string = removeProtocol (drop (length http) string)
    | otherwise = firstCharacter : removeProtocol rest
    where
        https = "https://"
        http = "http://"

countDots :: String -> Int
countDots =
    filter (== '.') >>> length

dropUntilWithDot :: String -> String
dropUntilWithDot =
    dropWhile (/= '.') >>> drop 1 

domainName :: String -> String
domainName url 
    | countDots url == 1 = takeWhile (/= '.') urlWithoutProtocol
    | otherwise = takeWhile (/= '.') $ iterate dropUntilWithDot url !! (countDots (takeWhile (/= '/') urlWithoutProtocol) - 1)
    where urlWithoutProtocol = removeProtocol url

main :: IO ()
main = do
    print $ domainName "domain.com"                                 -- domain
    print $ domainName "http://domain.com"                          -- domain
    print $ domainName "https://domain.com"                         -- domain
    print $ domainName "api.domain.com"                             -- domain
    print $ domainName "http://api.domain.com"                      -- domain
    print $ domainName "https://api.domain.com"                     -- domain
    print $ domainName "dev.api.domain.com"                         -- domain
    print $ domainName "http://dev.api.domain.com"                  -- domain
    print $ domainName "https://dev.api.domain.com"                 -- domain
    print $ domainName "https://dev.api.domain.com/something.cool"  -- domain

Try it online

Hosted on Repl.it.

 

Have you tried it using a .co.uk TLD? :D

 

No I didn't and I assume it will fail for this particular case. Do you have any way on improving this one?

Unfortunately, I think the only way to improve on it is to use the list of all TLDs to find how much of the end of the domain is TLD.

 

Javascript, using npm package "tldjs"

const { parse } = require("tldjs");

const regexify = str => {
  return str.replace(/[|\\{}()[\]^$+*?.]/g, "\\$&");
};

const getDomain = url => {
  const parseResult = parse(url);
  console.log(parseResult);
  if (parseResult.domain) {
    return parseResult.domain.replace(
      RegExp(regexify("." + parseResult.publicSuffix) + "$"),
      ""
    );
  } else {
    return parseResult.hostname;
  }
};

Try it: codesandbox.io/s/affectionate-dawn...

 

Jesus, NO! This is what's wrong with development today. This is equivalent to killing a fly with a Sherman tank.

Please, please, for the love of God and companies everywhere that are sick of obfuscated, confusing unmanageable, unmaintainable and insecure code, please look at the one line pure Javascript code above this answer.

There is absolutely no reason on earth to include a library with hundreds of lines of code to perform a simple operation.

 

Your comment is what's wrong with development today. Taking the shortest solution that seems to somehow solve the vague requirements and declaring it solved and secure.
Each of the JS solutions will fail for one of these test URLs: 'a.b.c.ac.il/', 'news.com.au/', 'youtube.com'.
Except for mine.

 

O(N) approach:

#include<iostream>
#include <string>
using namespace std;
string domainName(string url)
{
    string domain = "";
    bool flag = true;
    for(int i = 0; i < url.size(); i++)
    {
        if(flag)
        {
            if(url[i] == '/')
            {
                flag = false;
                i+=2;
                if(url[i] == 'w')
                    i = i+3;
                else
                    i--;
            }
            continue;
        }
        else if(url[i] == '.')
            return domain;
        domain += url[i];
    }
    return domain;
}
int main()
{
    string url;
    cin >> url;
    cout << domainName(url) << endl;
    return 0;
}

Naive approach:

#include<iostream>
#include <string>
using namespace std;
string domainName(string url)
{
    int x = url.find("www");
    if(x==string::npos)
    {
        x = url.find("//");
        x+=2;
    }
    else
        x+=4;
    return url.substr(x,url.find(".com")-x);
}
int main()
{
    string url;
    cin >> url;
    cout << domainName(url) << endl;
    return 0;
}
 

My solution in js

const domainName = (url) => {
  const match = url.match(/:\/\/(www[0-9]?\.)?(.[^/:]+)/i);
  return (match && match.length > 2 && typeof match[2] === 'string' && match[2].length) ? match[2].split('.')[0] : null;
}
 

JavaScript:

const domainName = url => {
    let hostName = new URL(url).hostname;
    let domain = hostName;
    if (hostName != null) {
        let str = hostName.split('.').reverse();
        if (str != null && str.length > 1) {
            domain = str[1] + '.' + str[0];
            if (hostName.indexOf(/[^/]+((?=\/)|$)/g) != -2 && str.length > 2) {
                domain = str[2] + '.' + domain;
            }
        }
    }
    return domain.split('.')[0];
}

considering subdomains & second-level domains (ie. '.bc.ca')

 

Python one-liner:

def domain_name(url):
    return url.split("/")[2].split(".")[-2]
 

One in JS

hostName is based on a quick reading of the spec and should cope with usernames and ports. Uniform_Resource_Identifier on wikipedia

salientSubdomain is the human-readable part of a domain name host as reqd. It just clips off www's and TLD's, with a little complication to handle non-matching strings cleanly.

const hostName = url => /^(?:[^:]+:\/\/(?:[^@\/?]+@)?([^:\/?]+))?/.exec(url)[1]
const salientSubdomain = url => /^(?:(?:www\.)?(.+?)(?:\.[a-zA-Z]+)?)?$/.exec(domainName(url)||'')[1]

const testUrls = [
"https://twitter.com/explore",
"https://github.com/thepracticaldev/dev.to",
"https://www.youtube.com",
"https://will:p4ssw0rd@google.com?q=cybersecurity",
"https://a.b.c.d:8080",
"mailto:mr@willsm.art",
"http://192.168.1.60:3000/home"
];

console.log(testUrls.map(hostName))
console.log(testUrls.map(salientSubdomain))

output:

[ "twitter.com", "github.com", "www.youtube.com", "google.com", "a.b.c.d", undefined, "192.168.1.60" ]
[ "twitter", "github", "youtube", "google", "a.b.c", undefined, "192.168.1.60" ]

The hostname extractor regex looks fairly funky but isn't too bad if you break it down into parts...
Regular expression visualization
(vis by Debuggex, which rocks)

 

Quick & dirty ugly chain if you don't want to research regex:

const domainName = (domain) => domain.split('://')[1].split('/')[0].includes('www.') ? domain.split('://')[1].split('/')[0].split('www.')[1].split('.')[0] : domain.split('://')[1].split('/')[0].split('.')[0]
 

Oneline with javascript

const domainName = d => new URL(d).hostname.split(".").shift()
 

No regex needed

 function domainName(url){
     return url.split("/")[2].split(".").slice(-2)[0]
 }
 

it got to be in my fav JavaScript!!

Alt text of image