Build a HTML parser using struct tags in Golang
As a total beginner to Golang I find struct tags a very interesting idea. With tags, I can separate the data structure from the meta data used by other parties. It’s similar to how HTML and CSS are separated by class names. While doing my side project, I implemented a package to parse HTML content similar to how the native XML parser works. This blog post summarizes the process. I will assume that if you read this you probably have used struct tags before, and have a basic understanding about them.
It all started with a simple idea about an interface to describe how a HTML parser should work without it being tied to the underneath parser logic. The main problem is that each parser implementation provides its own structs and ways to work with HTML. I need a more abstract approach.
Fortunately, HTML parsing at its core is just returning DOM nodes based on a valid CSS selector. HTML parsing packages usually build their own conventions on top of DOM nodes to make traversing through them easier. What I wanted to achieve was an abstract parser that recieves a slice of byte, a struct and returns error if any. Behind the scene, it will fill the struct with appropriate data based on the tags provided. It’s similar to how xml package works, but of course a lot simpler.
Let’s start with an interface
type HTMLParser interface {
Parse(content []byte, structure interface{}) error
}
and how I can use it
type Entry struct {
Title string `html:".title"`
Title string `html:".content"`
ReadTime int `html:".read-time"`
}
type HTMLPage struct {
Date string `html:".articles > .published"`
Entries []Entry `html:".article"`
}
var parser HTMLParser
func main() {
htmlContent, err := GetPage(url) // this returns []bytes
parser = NewParser()
var page HTMLPage
// as a non-native English speaker I have no idea about the differences
// between Unmarshal and Parse so I just go with what I usually use in JS
err = parser.Parse(htmlContent, &page)
}
Similar to the native XML package, I have a struct with some tags to describe how each field should be mapped to a CSS selector. And here is the HTML structure
Next, I need to implement the actual logic to unpack the struct tags and do something with them. I’m gonna build the first version using goquery
func NewParser() *GoQueryHTMLParser {
return &GoQueryHTMLParser{}
}
type GoQueryHTMLParser struct {
}
func (p *GoQueryHTMLParser) Parse(content []byte, structure interface{}) error {
r := bytes.NewReader(content)
doc, err := goquery.NewDocumentFromReader(r)
if err != nil {
return err
}
return recursivelyParseDoc(doc.Find("html"), structure)
}
First step is to convert []byte
into an io.Reader
via bytes.NewReader
because goquery only accepts io.Reader
(or maybe I don’t know how to make it work with []byte
). I decided to go with []byte
instead of io.Reader
as the parameter because byte
is a primitive type and easier to pass around or store somewhere.
The main logic happens inside recursivelyParseDoc
which (as the name suggest) recursively goes through the passed struct and find related DOM nodes to extract data. If you are not familiar with goquery, doc.Find("html")
returns a Selection. I will mostly work with Selection struct and only do simple query/data structure, otherwise this blog post will become a novel due to complex nature of HTML parsing :(
Alright! Here we go
func recursivelyParseDoc(doc *goquery.Selection, structure interface{}) error {
structType := reflect.TypeOf(structure)
if structType.Kind() != reflect.Ptr {
return fmt.Errorf("must pass a pointer")
}
// ...
}
First thing first, I need to make sure that the passed structure is a pointer, otherwise, I won’t be able to write anything. Next I need to enforce it to be a struct so to prevent people (mostly me) from passing anything weird into the function
func recursivelyParseDoc(doc *goquery.Selection, structure interface{}) error {
// ...
elem := structType.Elem()
if elem.Kind() != reflect.Struct {
return fmt.Errorf("must pass a struct")
}
// ...
}
The Elem()
method is an interesting one, from the documentation it says
Elem returns the value that the interface v contains
or that the pointer v points to.
It panics if v's Kind is not Interface or Ptr.
It returns the zero Value if v is nil.
In this case, it returns the type that the pointer points to. Now that I have the type of the struct passed to the function, it’s time to inspect its structure.
const selectorTagName = "html"
func recursivelyParseDoc(doc *goquery.Selection, structure interface{}) error {
// ...
for i := 0; i < elem.NumField(); i++ {
field := elem.Field(i)
if field.Tag == "" {
continue
}
tagValue := field.Tag.Get(selectorTagName)
if tagValue == "" {
continue
}
kind := field.Type.Kind()
targetNode := doc.Find(tagValue)
htmlValue := strings.TrimSpace(targetNode.Text())
switch kind {
case reflect.String:
// ...
case reflect.Int, reflect.Int8, reflect.Int16, reflect.Int32, reflect.Int64:
// ...
case reflect.Float32, reflect.Float64:
// ...
case reflect.Struct:
// ...
case reflect.Slice:
// ...
default:
fmt.Printf("unsupported kind [%s][%s]\n", kind, field.Name)
break
}
}
// ...
}
This process is straightforward, NumField()
returns how many field the struct has, and Field()
returns a StructField which contains information about a field such as name, type, and kind. And by the way, the difference between type
and kind
is kinda tricky to understand, I usually think it this way, if I define a struct named MyStruct
then its type is MyStruct
and its kind is struct.
Then, based on the field’s kind, I need to process the DOM node content differently. For example, if it’s int, I would do something like this
case reflect.Int, reflect.Int8, reflect.Int16, reflect.Int32, reflect.Int64:
if htmlValue == "" {
fieldPointer.SetInt(0)
break
}
intValue, err := strconv.ParseInt(htmlValue, 10, 64)
if err != nil {
fmt.Printf("unable to convert value to [%s][%s]\n", kind, field.Name)
fmt.Println(err)
break
}
fieldPointer.SetInt(intValue)
break
You might wonder, where the heck does fieldPointer
come from, well, I omitted few parts in the for loop. Here is where they are defined
ps := reflect.ValueOf(structure).Elem() // <--- was omitted
for i := 0; i < elem.NumField(); i++ {
// ...
field := elem.Field(i)
// ...
fieldPointer := ps.FieldByName(field.Name) // <--- was omitted
if !fieldPointer.CanSet() {
continue
}
kind := field.Type.Kind()
// ...
I’m gonna have to step a few steps back to explain ValueOf()
because it took me some time to understand it (and TypeOf
) at first. In this example
type Person struct {
Title string
Age int
}
var person Person
field := reflect.TypeOf(&person).Elem().Field(0) // points to Title (the field)
value := reflect.ValueOf(&person).Elem().Field(0) // 0 (the value)
Field is the definition of a field in the struct, it contains meta data such as Name
, Tag
. It has nothing to do with the actual value inside the field. Value on the other hand, refers to the actual value that the field contains in its memory address. Without value, I won’t be able to set the data. Also, it needs to be “writable”, otherwise the code will panic
reflect.ValueOf(person).Field(0).CanSet() // false
reflect.ValueOf(&person).Elem().Field(0).CanSet() // true
The first call uses a struct, and since it’s just a value, I won’t be able to change its content. The second call uses a pointer to a struct (that’s why I have to do Elem()
to get the struct that the pointer is pointing to). With a pointer, I’m now be able to set values.
Alright, with that out of the way, let’s continue with the parsing logic. All the primitive values are pretty much handled the same way, read value, convert it if need, and write it back. struct
on the other hand requires a bit more work
case reflect.Struct:
// create a new struct pointer and recursively extract data from it
nestedStruct := reflect.New(fieldPointer.Type()).Interface()
recursivelyParseDoc(targetNode, nestedStruct)
fieldPointer.Set(reflect.ValueOf(nestedStruct).Elem())
break
Few things here, reflect.New(fieldPointer.Type())
returns a pointer to the nested struct. And because of how I use recursivelyParseDoc
(passing a pointer), I need to call Interface()
to convert the reflection Value
to an actual struct pointer.
After I have got the value from the recursive call, I can just call Set()
to update the value in the main struct. Because of how the struct is defined (value instead of pointer), I need to do an additional call to Elem()
(Elem seems to be the MVP in most cases) to get the value that the pointer points to
As for slice
, it’s mostly the same with few additional changes. Firstly, fieldPointer.Type().Elem()
return the type of the slice’s elements. Then logic to create a new struct to be appended to the slice is similar to the normal struct case. Each struct/element created that way is then passed to recursivelyParseDoc
. The final result is appended to the field (pointer) of the main struct (or the one level above struct of the nested struct contains another struct - struct-ception!) using reflect.Append(fieldPointer, reflect.ValueOf(nestedStruct).Elem())
.
case reflect.Slice:
// first get the Type of the children
childType := fieldPointer.Type().Elem()
// then loop through each matched elements and populate the struct
targetNode.Each(func(i int, selection *goquery.Selection) {
nestedStruct := reflect.New(childType).Interface()
recursivelyParseDoc(selection, nestedStruct)
fieldPointer.Set(reflect.Append(fieldPointer, reflect.ValueOf(nestedStruct).Elem()))
})
One thing to remember here is that the whole process needs to be resolved around reflection and not actual data. Data need to be wrapped inside one of the reflection values before setting it.
And that’s it, I have my HTML parser which doesn’t depend on the actual implementation. This is achieved by using tags to separate the data structure from how the data is retrieved. However, this is just a naive implementation with lots of room for improvement. There are several obvious features that I should add such as
- Support DOM attributes, the current implementation just call
Text()
to extract the data, and it’s common to get the data from the HTML tag attribute - Support more data types. I have not explored all of the primitive data types that Go provides yet, so I’m pretty sure that there are many types that I have missed
- Support slices of primitive types. For now when it’s a slice, the code assumes that it’s a slice of struct. In reality, it’s also common to extract data to a slice of string (for example to get a list of tags)
After this small exercise, I have learned a lot about golang’s reflection system. Reflection is one of my favourite features in any programming language, and it’s one of the many reasons that convinced me to give golang a try!
Updated 2020-07-09 here is the package https://gitlab.com/tanqhnguyen/gohtml I’m using it in my personal project but it’s not by any mean battle tested since my use case is just simple HTML parsing :)