DEV Community

Cover image for 3 ways to convert HTML text to plain text
Sanchithasr
Sanchithasr

Posted on

3 ways to convert HTML text to plain text

I was working with a rich text editor the other day and needed to strip the HTML tags from the string and store it in the database. And here are the few ways I learned that could come in handy to anyone who is trying to do the same.
What we are trying to do is remove the tags from the string and make the string printable as plain text. Let’s dive in and see how it works.

1) Using .replace(/<[^>]*>/g, ‘’)

This method is a simple and efficient way to remove the tags from the text. This method uses the string method .replace(old value,new value) which replaces the HTML tag values with the empty string. The /g is used for it to happen globally (every value found in the string is replaced with the specified if the /g is used).
The drawback of this method is that we can’t remove some HTML entities. It still works well though.

var myHTML= "<div><h1>Jimbo.</h1>\n<p>That's what she said</p></div>";

var strippedHtml = myHTML.replace(/<[^>]+>/g, '');

// Jimbo.
// That's what she said
console.log(stripedHtml);
Enter fullscreen mode Exit fullscreen mode

2) Create a temporary DOM element and retrieve the text

This is the most efficient way of doing the task. Create a dummy element and assign it to a variable. We can extract later using the element objects. Assign the HTML text to innerHTML of the dummy element and we will get the plain text from the text element objects.

function convertToPlain(html){

    // Create a new div element
    var tempDivElement = document.createElement("div");

    // Set the HTML content with the given value
    tempDivElement.innerHTML = html;

    // Retrieve the text property of the element 
    return tempDivElement.textContent || tempDivElement.innerText || "";
}

var htmlString= "<div><h1>Bears Beets Battlestar Galactica </h1>\n<p>Quote by Dwight Schrute</p></div>";


console.log(convertToPlain(htmlString));
// Expected Result:
// Bears Beets Battlestar Galactica 
// Quote by Dwight Schrute
Enter fullscreen mode Exit fullscreen mode

3) html-to-text npm package

This is the package I discovered recently. This is the converter that parses HTML and returns beautiful text. It comes with many options to convert it to plain text like wordwrap, tags, whitespaceCharacters , formattersetc.
Package.json is needed to use the package. We need to install the package first and then use it in our file.
You can find the official doc of the package here.

Installation

npm install html-to-text
Enter fullscreen mode Exit fullscreen mode

Usage

const { htmlToText } = require('html-to-text');

const text = htmlToText('<div>Nope Its not Ashton Kutcher. It is Kevin Malone. <p>Equally Smart and equally handsome</p></div>', {
    wordwrap: 130
});
console.log(text); // expected result: 
// Nope Its not Ashton Kutcher. It is Kevin Malone.

// Equally Smart and equally handsome

Enter fullscreen mode Exit fullscreen mode

Find the example of the project here.

And that sums it up. Thank you!!

Top comments (7)

Collapse
 
etienneburdet profile image
Etienne Burdet • Edited

There is an alternative to create a DOM element that doesn't append to the body (actually the one from OP doesn't either!) :

 

const parser = new DOMParser()
const floatingElement = parser.parseFromSrting(stringWithTags, 'text/xml')
const string = floatingElement.innerText
Enter fullscreen mode Exit fullscreen mode

You could also use document.createDocumentFragment(string), but fragment have no innerText, so in the end it's more complicated.

Collapse
 
mellen profile image
Matt Ellen

The method presented in the main post doesn't append to the body, either.

Collapse
 
etienneburdet profile image
Etienne Burdet • Edited

Ho you're right, I read that too quickly :p

Collapse
 
crs1138 profile image
Honza • Edited

The html-to-text module seems great, but I have problems with using it in a browser environment. I am running into a similar issue as described on Github. Can anybody recommend a production ready alternative that plays well with browser?

Collapse
 
baenencalin profile image
Calin Baenen

What is [^ with RegExps (well, I mean, the context clues of the post tell me, but y'know, maybe there's more to it than I think), and where can I learn about similar RE features?

Collapse
 
darvanen profile image
Max Pogonowski

It means "negated set" so [^>] is "a set of any characters except for >"
My favourite tool for learning about and testing regex is regexr.com/

Collapse
 
abhishekshahasane profile image
AbhishekShahasane

Is it possible to parse the images which we insert in the editor? and do we need HTML Viewer to view the data in Flutter App