七叶笔记 » golang编程 » 用 Go 做爬虫的话，有哪些库可以选择？

用 Go 做爬虫的话，有哪些库可以选择？

分类: golang编程 | 浏览: 585

绘制图表不是前端的专利，服务端语言也可以关注公众号 “ 转角遇到 github ” ，每天推荐给你优秀开源项目

大家好，我是欧盆索思（opensource），每天为你带来优秀的开源项目！

说起爬虫，很多人可能想到 Python ，其实 Go 目前在这方面表现也还可以。今天看看 Go 比较有名的爬虫相关库有哪些？

GoQuery

说起 Go 爬虫相关库，最早、最知名的应该是 goquery，这是模仿 jquery 的，所以使用过 jquery 的，用 goquery 会特别亲切，功能也很强大。

项目地址：，Star 数 9.4k+。

例子：

 package main

import (
  "fmt"
  "log"
  "net/http"

  "github.com/PuerkitoBio/goquery"
)

func ExampleScrape() {
  // Request the  HTML  page.
  res, err := http.Get("#34;)
  if err != nil {
    log.Fatal(err)
  }
  defer res.Body.Close()
  if res.StatusCode != 200 {
    log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
  }

  // Load the HTML document
  doc, err := goquery.NewDocumentFromReader(res.Body)
  if err != nil {
    log.Fatal(err)
  }

  //  Find  the review items
  doc.Find(".sidebar-reviews article .content-block").Each(func(i int, s *goquery. Selection ) {
    // For each item found, get the band and title
    band := s.Find("a").Text()
    title := s.Find("i").Text()
    fmt.Printf("Review %d: %s - %s\n", i, band, title)
  })
}

func main() {
  ExampleScrape()
}

colly

相对来说 goquery API 有些低级，而 colly 这个库是一个真正的爬虫框架。这是一个用于 Golang 的优雅的 Scraper 和 Crawler 框架。

项目地址：，Star 数：12.3k+。

它还有一个专门的网站：。

 func main() {
 c := colly.NewCollector()

 // Find and visit all links
 c.OnHTML("a[href]", func(e *colly.HTMLElement) {
  e.Request.Visit(e.Attr("href"))
 })

 c.OnRequest(func(r *colly.Request) {
  fmt.Println("Visiting", r.URL)
 })

 c.Visit("#34;)
}

注意，colly 是基于 goquery 的。

soup

Go 中的网页抓取工具，类似于 Python 的 BeautifulSoup。该库很短小，核心代码才 500 多行，对爬虫实现感兴趣的可以研究下它的源码。

项目地址：，Star 数：1.4k+。

示例：

 package main

import (
 "fmt"
 "github.com/anaskhan96/soup"
 "os"
)

func main() {
 resp, err := soup.Get("#34;)
 if err != nil {
  os.Exit(1)
 }
 doc := soup.HTMLParse(resp)
 links := doc.Find("div", "id", "comicLinks").FindAll("a")
 for _, link := range links {
  fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
 }
}

 package main

import (
 "fmt"
 "github.com/anaskhan96/soup"
 "os"
)

func main() {
 resp, err := soup.Get("#34;)
 if err != nil {
  os.Exit(1)
 }
 doc := soup.HTMLParse(resp)
 links := doc.Find("div", "id", "comicLinks").FindAll("a")
 for _, link := range links {
  fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
 }
}

Pholcus

这是国人写的，分布式高并发爬虫软件。这是一个完整的项目，而不是一个库。它支持单机、服务端、客户端三种运行模式，拥有 Web、GUI、命令行三种操作界面；规则简单灵活、批量任务并发、输出方式丰富（mysql/mongodb/kafka/csv/excel等）；另外它还支持横纵向两种抓取模式，支持模拟登录和任务暂停、取消等一系列高级功能。

项目地址：，Star 数 6.6k+。

小结

以上有各自的优劣，如果你有需求，可以根据需要选择一个适合你的。

Go 爬虫哪些

七叶笔记

用 Go 做爬虫的话，有哪些库可以选择？

GoQuery

colly

soup

Pholcus

小结

相关文章

更多编程视频和电子书关注公众号

最近发表

标签列表