DEV Community

Jaydeep Dave
Jaydeep Dave

Posted on

Golang Html tokenizer

Photo by <a href="https://unsplash.com/@afgprogrammer?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Mohammad Rahmani</a> on <a href="https://unsplash.com/s/photos/html?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

Looking for parsing & extracting HTML content in golang as we can simply do in PHP or Js by creating a new dom document. In golang, there are multiple ways to do it by using different packages based on your requirements. Some of the ways I found out are:

  • gohtml: gohtml is an HTML5 tokenizer and parser implementation. It returns nodes after parsing, and then the elements can be extracted by various attributes such as tag type, tag name, attr, and text data using a tokenizer concept.

  • goquery: goquery is built on the gohtml package and the CSS Selector library Cascadia, giving it more power over content selection and extraction. It has a similar syntax as jquery.

  • godom: godom is a library that allows you to manipulate the DOM in Golang similar to javascript. It compiles Go code to JavaScript using GopherJS.

For now, I will use gohtml for the demonstration purpose, to use tokenization.

Tokenization is the lexical analysis, parsing the input into tokens. Among HTML tokens are start tags, end tags, attribute names and attribute values.

Tokenizing the document is the first step in parsing it into a tree of element and text nodes, similar to the DOM.

Types of HTML Tokens Supported:

  • html.StartTagToken: a start tag such as
  • html.EndTagToken: an end tag such as
  • html.SelfClosingTagToken: a self-closing tag such as <imgĀ .../>
  • html.TextToken: text content within a tag
  • html.CommentToken: an HTML comment such as <!-- comment -->
  • html.DoctypeToken: a document type declaration such as <!DOCTYPE html>

Example:

package main
import (
 "fmt"
 "strings"
 "io"
 "golang.org/x/net/html"
)
func main() {
 tokenizer := html.NewTokenizer(strings.NewReader(sampleHtml))
 for {
  tokenType := tokenizer.Next()
  token := tokenizer.Token()
  if tokenType == html.ErrorToken {
   if tokenizer.Err() == io.EOF {
    return
   }
   fmt.Printf("Error: %v", tokenizer.Err())
   return
  }
  fmt.Printf("Token: %v\n", html.UnescapeString(token.String()))
 }
}
const sampleHtml = `<!DOCTYPE html><html><head><style> body {background-color: powderblue;} h1 {color: red;} p {color: orange;}</style><title>Sample HTML Code</title><script src="my-script.js">abc</script></head><body><h1>Main title</h1><p id="demo"></p><a href="https://dev.to/">Dev Community</a><script>document.getElementById("demo").innerHTML = "Hello JavaScript!";</script></body></html>`

Enter fullscreen mode Exit fullscreen mode

Output:

Token: <!DOCTYPE html>
Token: <html>
Token: <head>
Token: <style>
Token:  body {background-color: powderblue;} h1 {color: red;} p {color: orange;}
Token: </style>
Token: <title>
Token: Sample HTML Code
Token: </title>
Token: <script src="my-script.js">
Token: abc
Token: </script>
Token: </head>
Token: <body>
Token: <h1>
Token: Main title
Token: </h1>
Token: <p id="demo">
Token: </p>
Token: <a href="https://dev.to/">
Token: Dev Community
Token: </a>
Token: <script>
Token: document.getElementById("demo").innerHTML = "Hello JavaScript!";
Token: </script>
Token: </body>
Token: </html>
Enter fullscreen mode Exit fullscreen mode

Here, I had just simply checked for Error Token or EOF and printed all the token types as it is.

We can also parse HTML based on the Token such as html.StartTagToken, html.EndTagToken, etc as mentioned above.

Also, on the element type such as html, h1, script, style, etc.

tokenizer := html.NewTokenizer(strings.NewReader(sampleHtml))
 for {
  tokenType := tokenizer.Next()
  token := tokenizer.Token()
  if tokenType == html.ErrorToken {
   if tokenizer.Err() == io.EOF {
    return
   }
   fmt.Printf("Error: %v", tokenizer.Err())
   return
  }
  switch token.Data {
  case "script":
   fmt.Printf("Script Token: %v\n", html.UnescapeString(token.String()))
  case "style":
   fmt.Printf("Style Token: %v\n", html.UnescapeString(token.String()))
  default: //This will also include contents of <script>, <style> tags content
   fmt.Printf("Others: %v\n", html.UnescapeString(token.String()))
  }
 }
Enter fullscreen mode Exit fullscreen mode

Reference

Discussion (0)