DEV Community

Red
Red

Posted on • Originally published at redthemers.tech

How to extract text from HTML String using javascript

There are various ways to extract text from an html string but we will be doing it using Regex.
First we will store the html string in a variable then apply the replace method and pass and the appropriate regular expression and another parameter for the value to be changed with.

Example :

        let  name = my name is anzar
        let  newName = name.replace(anzar,red);
        console.log(the new name is , newName);   // my name is red
Enter fullscreen mode Exit fullscreen mode

Here if we see the first parameter is used to find the matching word from the variable . And the second param replaces it with the matched word.

Simple right, but wait what if there are many matching words for anzar

Example :

        let name = hey anzar how are you anzar;
        let  newName = name.replace(anzar,red);
        console.log(the new name is , newName);  
Enter fullscreen mode Exit fullscreen mode

If we try again on this string the result would be - hey red how are you anzar.
So it will work for the first matching word only. As html have many tags so this won’t work for us.
In Order to get this working we need to add /g at the end of the first parameter. This means global, so now every matching word will be replaced, Instead of only the first word.

Great, now let's do our main task, There are close to 100 html tags like p tag, a tag etc.
So it needed to remove every tag the way above. Just kidding 😜

Regular Expression comes for the rescue. This is one one of the most powerful things you can use in programming but is highly frustrating.
We can’t understand it now but don’t worry i will provide you the expression for removing html.

The regular expression is -

      replace(/<[^>]*(>|$)| |‌|»|«|>/g, ' ');
Enter fullscreen mode Exit fullscreen mode

The second parameter is empty because we just wanted to remove the html This will work great. Just one more stuff remaining.

In html & is represented as & so if the text in the html contains & there would be chances that it may have & So to remove this lets again use replace method but this time instead of passing the second parameter as empty string we will pass & because we wanted to preserve the text.

Example :

replace(/&/g,"&");
Enter fullscreen mode Exit fullscreen mode

Finally we removed everything and just got a plain text. Hope you had understood it completely. Do remember to add /g at the end of the first parameter to remove every matching instance.

Discussion (0)