Go Web Scraping: HTML Parsing from Zero to Hero
Daniel Hayes
Full-Stack Engineer · Leapcell

Installation and Usage of Goquery
Installation
Execute:
go get github.com/PuerkitoBio/goquery
Import
import "github.com/PuerkitoBio/goquery"
Load the Page
Take the IMDb Popular Movies page as an example:
package main import ( "fmt" "log" "net/http" "github.com/PuerkitoBio/goquery" ) func main() { res, err := http.Get("https://www.imdb.com/chart/moviemeter/") if err != nil { log.Fatal(err) } defer res.Body.Close() if res.StatusCode != 200 { log.Fatalf("status code error: %d %s", res.StatusCode, res.Status) }
Get the Document Object
doc, err := goquery.NewDocumentFromReader(res.Body) if err != nil { log.Fatal(err) } // Other creation methods // doc, err := goquery.NewDocumentFromReader(reader io.Reader) // doc, err := goquery.NewDocument(url string) // doc, err := goquery.NewDocument(strings.NewReader("<p>Example content</p>"))
Select Elements
Element Selector
Select based on basic HTML elements. For example, dom.Find("p")
matches all p
tags. It supports chained calls:
ele.Find("h2").Find("a")
Attribute Selector
Filter elements by element attributes and values, with multiple matching methods:
Find("div[my]") // Filter div elements with the my attribute Find("div[my=zh]") // Filter div elements whose my attribute is zh Find("div[my!=zh]") // Filter div elements whose my attribute is not equal to zh Find("div[my|=zh]") // Filter div elements whose my attribute is zh or starts with zh- Find("div[my*=zh]") // Filter div elements whose my attribute contains the string zh Find("div[my~=zh]") // Filter div elements whose my attribute contains the word zh Find("div[my$=zh]") // Filter div elements whose my attribute ends with zh Find("div[my^=zh]") // Filter div elements whose my attribute starts with zh
parent > child
Selector
Filter the child elements under a certain element. For example, dom.Find("div>p")
filters the p
tags under the div
tag.
element + next
Adjacent Selector
Use it when the elements are irregularly selected, but the previous element has a pattern. For example, dom.Find("p[my=a]+p")
filters the adjacent p
tags whose my
attribute value of the p
tag is a
.
element~next
Sibling Selector
Filter the non-adjacent tags under the same parent element. For example, dom.Find("p[my=a]~p")
filters the sibling p
tags whose my
attribute value of the p
tag is a
.
ID Selector
It starts with #
and precisely matches the element. For example, dom.Find("#title")
matches the content with id=title
, and you can specify the tag dom.Find("p#title")
.
ele.Find("#title")
Class Selector
It starts with .
and filters the elements with the specified class name. For example, dom.Find(".content1")
, and you can specify the tag dom.Find("div.content1")
.
ele.Find(".title")
Selector OR (|) Operation
Combine multiple selectors, separated by commas. Filtering is done if any one of them is satisfied. For example, Find("div,span")
.
func main() { html := `<body> <div lang="zh">DIV1</div> <span> <div>DIV5</div> </span> </body>` dom, err := goquery.NewDocumentFromReader(strings.NewReader(html)) if err != nil { log.Fatalln(err) } dom.Find("div,span").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Html()) }) }
Filters
:contains
Filter
Filter elements that contain the specified text. For example, dom.Find("p:contains(a)")
filters the p
tags that contain a
.
dom.Find("div:contains(DIV2)").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Text()) })
:has(selector)
Filter elements that contain the specified element nodes.
:empty
Filter elements that have no child elements.
:first-child
and :first-of-type
Filters
Find("p:first-child")
filters the first p
tag; first-of-type
requires it to be the first element of that type.
:last-child
and :last-of-type
Filters
The opposite of :first-child
and :first-of-type
.
:nth-child(n)
and :nth-of-type(n)
Filters
:nth-child(n)
filters the n
th element of the parent element; :nth-of-type(n)
filters the n
th element of the same type.
:nth-last-child(n)
and :nth-last-of-type(n)
Filters
Calculate in reverse order, with the last element being the first one.
:only-child
and :only-of-type
Filters
Find(":only-child")
filters the only child element in the parent element; Find(":only-of-type")
filters the only element of the same type.
Get Content
ele.Html() ele.Text()
Traversal
Use the Each
method to traverse the selected elements:
ele.Find(".item").Each(func(index int, elA *goquery.Selection) { href, _ := elA.Attr("href") fmt.Println(href) })
Built-in Functions
Array Positioning Functions
Eq(index int) *Selection First() *Selection Get(index int) *html.Node Index...() int Last() *Selection Slice(start, end int) *Selection
Extended Functions
Add...() AndSelf() Union()
Filtering Functions
End() Filter...() Has...() Intersection() Not...()
Loop Traversal Functions
Each(f func(int, *Selection)) *Selection EachWithBreak(f func(int, *Selection) bool) *Selection Map(f func(int, *Selection) string) (result []string)
Document Modification Functions
After...() Append...() Before...() Clone() Empty() Prepend...() Remove...() ReplaceWith...() Unwrap() Wrap...() WrapAll...() WrapInner...()
Attribute Manipulation Functions
Attr*(), RemoveAttr(), SetAttr() AttrOr(e string, d string) AddClass(), HasClass(), RemoveClass(), ToggleClass() Html() Length() Size() Text()
Node Search Functions
Contains() Is...()
Document Tree Traversal Functions
Children...() Contents() Find...() Next...() *Selection NextAll() *Selection Parent[s]...() Prev...() *Selection Siblings...()
Type Definitions
Document Selection Matcher
Helper Functions
NodeName OuterHtml
Examples
Getting Started Example
func main() { html := `<html> <body> <h1 id="title">O Captain! My Captain!</h1> <p class="content1"> O Captain! my Captain! our fearful trip is done, The ship has weather’d every rack, the prize we sought is won, The port is near, the bells I hear, the people all exulting, While follow eyes the steady keel, the vessel grim and daring; </p> </body> </html>` dom, err := goquery.NewDocumentFromReader(strings.NewReader(html)) if err != nil { log.Fatalln(err) } dom.Find("p").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Text()) }) }
Example of Crawling IMDb Popular Movie Information
package main import ( "fmt" "log" "github.com/PuerkitoBio/goquery" ) func main() { doc, err := goquery.NewDocument("https://www.imdb.com/chart/moviemeter/") if err != nil { log.Fatal(err) } doc.Find(".titleColumn a").Each(func(i int, selection *goquery.Selection) { title := selection.Text() href, _ := selection.Attr("href") fmt.Printf("Movie Name: %s, Link: https://www.imdb.com%s\n", title, href) }) }
The above examples extract the movie names and link information from the IMDb popular movies page. In actual use, you can adjust the selectors and processing logic according to your needs.
Leapcell: The Next-Gen Serverless Platform for Web Hosting
Finally, I would like to recommend the best platform for deploying Go services: Leapcell
1. Multi-Language Support
- Develop with JavaScript, Python, Go, or Rust.
2. Deploy unlimited projects for free
- pay only for usage — no requests, no charges.
3. Unbeatable Cost Efficiency
- Pay-as-you-go with no idle charges.
- Example: $25 supports 6.94M requests at a 60ms average response time.
4. Streamlined Developer Experience
- Intuitive UI for effortless setup.
- Fully automated CI/CD pipelines and GitOps integration.
- Real-time metrics and logging for actionable insights.
5. Effortless Scalability and High Performance
- Auto-scaling to handle high concurrency with ease.
- Zero operational overhead — just focus on building.
Explore more in the documentation!
Leapcell Twitter: https://x.com/LeapcellHQ