哪个网站可以试玩不朽情缘

書名： Go語言精進之路：從新手到高手的編程思想、方法和技巧（2）
作者名：白明
本章字數： 3142字
更新時間： 2022-01-04 17:42:28

47.2　使用pprof進行性能剖析的實例

前面我們了解了go tool pprof的工作原理、性能數據類別、采樣方式及剖析方式等，下面用一個實例來整體說明利用pprof進行性能剖析的過程。該示例改編自Brad Fitzpatrick在YAPC Asia 2015（http://yapcasia.org）上的一次名為“Go Debugging, Profiling, and Optimization”的技術分享。

1. 待優化程序（step0）

待優化程序是一個簡單的HTTP服務，當通過瀏覽器訪問其/hi服務端點時，頁面上會顯示如圖47-9所示的內容。

圖47-9　演示程序呈現的頁面

頁面上有一個計數器，顯示訪客是網站的第幾個訪客。該頁面還支持通過color參數進行標題顏色定制，比如使用瀏覽器訪問下面的地址后，頁面顯示的“Welcome!”標題將變成紅色。

http://localhost:8080/hi?color=red

該待優化程序的源碼如下：

//chapter8/sources/go-pprof-optimization-demo/step0/demo.go

var visitors int64

func handleHi(w http.ResponseWriter, r *http.Request) {
    if match, _ := regexp.MatchString(`^\w*$`, r.FormValue("color")); !match {
        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
        return
    }
    visitNum := atomic.AddInt64(&visitors, 1)
    w.Header().Set("Content-Type", "text/html; charset=utf-8")
    w.Write([]byte("<h1 style='color: " + r.FormValue("color") +
        "'>Welcome!</h1>You are visitor number " + fmt.Sprint(visitNum) + "!"))
}

func main() {
    log.Printf("Starting on port 8080")
    http.HandleFunc("/hi", handleHi)
    log.Fatal(http.ListenAndServe("127.0.0.1:8080", nil))
}

這里,我們的實驗環境為Go 1.14 + macOS 10.14.6。

2. CPU類性能數據采樣及數據剖析（step1）

前面提到go tool pprof支持多種類型的性能數據采集和剖析，在大多數情況下我們都會先從CPU類性能數據的剖析開始。這里通過為示例程序建立性能基準測試的方式采集CPU類性能數據。

// chapter8/sources/go-pprof-optimization-demo/step1/demo_test.go
...
func BenchmarkHi(b *testing.B) {
    req, err := http.ReadRequest(bufio.NewReader(strings.NewReader("GET /hi HTTP/1.0\r\n\r\n")))
    if err != nil {
        b.Fatal(err)
    }
    rw := httptest.NewRecorder()
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        handleHi(rw, req)
    }
}
...

建立基準，取得初始基準測試數據：

$go test -v -run=^$ -bench=.
goos: darwin
goarch: amd64
pkg: chapter8/sources/go-pprof-optimization-demo/step1
BenchmarkHi
BenchmarkHi-8       365084             3218 ns/op
PASS
ok         chapter8/sources/go-pprof-optimization-demo/step1   2.069s

接下來，利用基準測試采樣CPU類型性能數據：

$go test -v -run=^$ -bench=^BenchmarkHi$ -benchtime=2s -cpuprofile=cpu.prof

執行完上述命令后，step1目錄下會出現兩個新文件step1.test和cpu.prof。我們將這兩個文件作為go tool pprof的輸入對性能數據進行剖析：

$go tool pprof step1.test cpu.prof
File: step1.test
Type: cpu
Time: xx
Duration: 2.35s, Total samples = 2.31s (98.44%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top -cum
Showing nodes accounting for 0.18s, 7.79% of 2.31s total
Dropped 43 nodes (cum <= 0.01s)
Showing top 10 nodes out of 121
      flat  flat%   sum%        cum   cum%
         0     0%     0%      1.90s 82.25%  chapter8/sources/go-pprof-optimization-demo/step1.BenchmarkHi
         0     0%     0%      1.90s 82.25%  chapter8/sources/go-pprof-optimization-demo/step1.handleHi
         0     0%     0%      1.90s 82.25%  testing.(*B).launch
         0     0%     0%      1.90s 82.25%  testing.(*B).runN
         0     0%     0%      1.31s 56.71%  regexp.MatchString
         0     0%     0%      1.26s 54.55%  regexp.Compile (inline)
     0.01s  0.43%  0.43%      1.26s 54.55%  regexp.compile
     0.16s  6.93%  7.36%      0.75s 32.47%  runtime.mallocgc
     0.01s  0.43%  7.79%      0.49s 21.21%  regexp/syntax.Parse
         0     0%  7.79%      0.48s 20.78%  bytes.(*Buffer).Write
(pprof)

通過top -cum，我們看到handleHi累積消耗CPU最多（用戶層代碼范疇）。通過list命令進一步展開handleHi函數：

(pprof) list handleHi
Total: 2.31s
ROUTINE ======================== chapter8/sources/go-pprof-optimization-demo/step1.handleHi in chapter8/sources/go-pprof-optimization-demo/step1/demo.go
         0      1.90s (flat, cum) 82.25% of Total
         .          .      9:)
         .          .     10:
         .          .     11:var visitors int64 // must be accessed atomically
         .          .     12:
         .          .     13:func handleHi(w http.ResponseWriter, r *http.Request) {
         .      1.31s      14:if match, _ := regexp.MatchString(`^\w*$`, r.FormValue
                                                  ("color")); !match {
         .          .     15:              http.Error(w, "Optional color is invalid",
                                                   http.StatusBadRequest)
         .          .     16:              return
         .          .     17:}
         .          .     18:visitNum := atomic.AddInt64(&visitors, 1)
         .       30ms     19:w.Header().Set("Content-Type", "text/html;
                                                  charset=utf-8")
         .      500ms      20:w.Write([]byte("<h1 style='color: " + r.FormValue
                                         ("color") +
         .       60ms     21:           "'>Welcome!</h1>You are visitor number
                                            " + fmt.Sprint(visitNum) + "!"))
         .          .     22:}
         .          .     23:
         .          .     24:func main() {
         .          .     25:log.Printf("Starting on port 8080")
         .          .     26:http.HandleFunc("/hi", handleHi)
(pprof)

我們看到在handleHi中，MatchString函數調用耗時最長（1.31s）。

3. 第一次優化（step2）

通過前面對CPU類性能數據的剖析，我們發現MatchString較為耗時。通過閱讀代碼發現，每次HTTP服務接收請求后，都會采用正則表達式對請求中的color參數值做一次匹配校驗。校驗使用的是regexp包的MatchString函數，該函數每次執行都要重新編譯傳入的正則表達式，因此速度較慢。我們的優化手段是：讓正則表達式僅編譯一次。下面是優化后的代碼：

// chapter8/sources/go-pprof-optimization-demo/step2/demo.go

...
var visitors int64
var rxOptionalID = regexp.MustCompile(`^\d*$`)

func handleHi(w http.ResponseWriter, r *http.Request) {
    if !rxOptionalID.MatchString(r.FormValue("color")) {
        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
        return
    }
    ...
}
...

在優化后的代碼中，我們使用一個代表編譯后正則表達式對象的rxOptionalID的MatchString方法替換掉了每次都需要重新編譯正則表達式的MatchString函數調用。

重新運行一下性能基準測試：

$go test -v -run=^$ -bench=.
goos: darwin
goarch: amd64
pkg: chapter8/sources/go-pprof-optimization-demo/step2
BenchmarkHi
BenchmarkHi-8      2624650              457 ns/op
PASS
ok         chapter8/sources/go-pprof-optimization-demo/step2   1.734s

相比于優化前（3218 ns/op），優化后handleHi的性能（457 ns/op）提高了7倍多。

4. 內存分配采樣數據剖析

在對待優化程序完成CPU類型性能數據剖析及優化實施之后，再來采集另一種常用的性能采樣數據——內存分配類型數據，探索一下在內存分配方面是否還有優化空間。Go程序內存分配一旦過頻過多，就會大幅增加Go GC的工作負荷，這不僅會增加GC所使用的CPU開銷，還會導致GC延遲增大，從而影響應用的整體性能。因此，優化內存分配行為在一定程度上也是提升應用程序性能的手段。

在go-pprof-optimization-demo/step2目錄下，為demo_test.go中的BenchmarkHi增加Report-Allocs方法調用，讓其輸出內存分配信息。然后，通過性能基準測試的執行獲取內存分配采樣數據：

$go test -v -run=^$ -bench=^BenchmarkHi$ -benchtime=2s -memprofile=mem.prof
goos: darwin
goarch: amd64
pkg: chapter8/sources/go-pprof-optimization-demo/step2
BenchmarkHi
BenchmarkHi-8       5243474      455 ns/op      364 B/op      5 allocs/op
PASS
ok         chapter8/sources/go-pprof-optimization-demo/step2   3.052s

接下來，使用pprof工具剖析輸出的內存分配采用數據（mem.prof）：

$go tool pprof step2.test mem.prof
File: step2.test
Type: alloc_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

在go tool pprof的輸出中有一行為Type: alloc_space。這行的含義是當前pprof將呈現程序運行期間所有內存分配的采樣數據（即使該分配的內存在最后一次采樣時已經被釋放）。還可以讓pprof將Type切換為inuse_space，這個類型表示內存數據采樣結束時依然在用的內存。

可以在啟動pprof工具時指定所使用的內存數據呈現類型：

$go tool pprof --alloc_space step2.test mem.prof // 遺留方式
$go tool pprof -sample_index=alloc_space step2.test mem.prof //最新方式

亦可在進入pprof交互模式后，通過sample_index命令實現切換：

(pprof) sample_index = inuse_space

現在以alloc_space類型進入pprof命令交互界面并執行top命令：

$go tool pprof -sample_index=alloc_space step2.test mem.prof
File: step2.test
Type: alloc_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top -cum
Showing nodes accounting for 2084.53MB, 99.45% of 2096.03MB total
Showing top 10 nodes out of 11
     flat  flat%   sum%        cum   cum%
        0     0%     0%  2096.03MB   100%  chapter8/sources/go-pprof-optimization-
                                           demo/step2.BenchmarkHi
 840.55MB 40.10% 40.10%  2096.03MB   100%  chapter8/sources/go-pprof-optimization-
                                           demo/step2.handleHi
        0     0% 40.10%  2096.03MB   100%  testing.(*B).launch
        0     0% 40.10%  2096.03MB   100%  testing.(*B).runN
        0     0% 40.10%  1148.98MB 54.82%  bytes.(*Buffer).Write
        0     0% 40.10%  1148.98MB 54.82%  bytes.(*Buffer).grow
1148.98MB 54.82% 94.92%  1148.98MB 54.82%  bytes.makeSlice
        0     0% 94.92%  1148.98MB 54.82%  net/http/httptest.(*ResponseRecorder).
                                           Write
        0     0% 94.92%       95MB  4.53%  net/http.Header.Set (inline)
     95MB  4.53% 99.45%       95MB  4.53%  net/textproto.MIMEHeader.Set (inline)
(pprof)

我們看到handleHi分配了較多內存。通過list命令展開handleHi的代碼：

(pprof) list handleHi
Total: 2.05GB
ROUTINE ======================== chapter8/sources/go-pprof-optimization-demo/step2.handleHi in chapter8/sources/go-pprof-optimization-demo/step2/demo.go
  840.55MB     2.05GB (flat, cum)   100% of Total
         .          .     17:    http.Error(w, "Optional color is invalid",
                                            http.StatusBadRequest)
         .          .     18:    return
         .          .     19: }
         .          .     20:
         .          .     21:  visitNum := atomic.AddInt64(&visitors, 1)
         .       95MB     22:  w.Header().Set("Content-Type", "text/html;
                                             charset= utf-8")
  365.52MB     1.48GB     23:  w.Write([]byte("<h1 style='color: " +
                                             r.FormValue ("color") +
  475.02MB   486.53MB     24:    "'>Welcome!</h1>You are visitor number " +
                                      fmt.Sprint(visitNum) + "!"))
         .          .     25:}
         .          .     26:
         .          .     27:func main() {
         .          .     28:  log.Printf("Starting on port 8080")
         .          .     29:  http.HandleFunc("/hi", handleHi)
(pprof)

通過list的輸出結果我們可以看到handleHi函數的第23~25行分配了較多內存（見第一列）。

5. 第二次優化（step3）

這里進行內存分配的優化方法如下：

刪除w.Header().Set這行調用；
使用fmt.Fprintf替代w.Write。

優化后的handleHi代碼如下：

// go-pprof-optimization-demo/step3/demo.go
...
func handleHi(w http.ResponseWriter, r *http.Request) {
    if !rxOptionalID.MatchString(r.FormValue("color")) {
        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
        return
    }

    visitNum := atomic.AddInt64(&visitors, 1)
    fmt.Fprintf(w, "<html><h1 stype='color: %s'>Welcome!</h1>You are visitor number %d!", r.FormValue("color"), visitNum)
}
...

再次執行性能基準測試來收集內存采樣數據：

$go test -v -run=^$ -bench=^BenchmarkHi$ -benchtime=2s -memprofile=mem.prof
goos: darwin
goarch: amd64
pkg: github.com/bigwhite/books/effective-go/chapters/chapter8/sources/go-pprof-optimization-demo/step3
BenchmarkHi
BenchmarkHi-8       7090537      346 ns/op      173 B/op      1 allocs/op
PASS
ok      chapter8/sources/go-pprof-optimization-demo/step3     2.925s

和優化前的數據對比，內存分配次數由5 allocs/op降為1 allocs/op，每op分配的字節數由364B降為173B。

再次通過pprof對上面的內存采樣數據進行分析，查看BenchmarkHi中的內存分配情況：

$go tool pprof step3.test mem.prof
File: step3.test
Type: alloc_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) list handleHi
Total: 1.27GB
ROUTINE ======================== chapter8/sources/go-pprof-optimization-demo/step3.handleHi in chapter8/sources/go-pprof-optimization-demo/step3/demo.go
   51.50MB     1.27GB (flat, cum)   100% of Total
         .          .     17:    http.Error(w, "Optional color is invalid",
                                             http. StatusBadRequest)
         .          .     18:    return
         .          .     19:  }
         .          .     20:
         .          .     21:  visitNum := atomic.AddInt64(&visitors, 1)
   51.50MB     1.27GB     22:  fmt.Fprintf(w, "<html><h1 stype='color: %s'>Welcome!
                                     </h1>You are visitor number %d!",
                                  r.FormValue ("color"),visitNum)
         .          .     23:}
         .          .     24:
         .          .     25:func main() {
         .          .     26:  log.Printf("Starting on port 8080")
         .          .     27:  http.HandleFunc("/hi", handleHi)
(pprof)

我們看到，對比優化前handleHi的內存分配的確大幅減少（第一列：365MB+475MB -> 51.5MB）。

6. 零內存分配（step4）

經過一輪內存優化后，handleHi當前的內存分配集中到下面這行代碼：

fmt.Fprintf(w, "<html><h1 stype='color: %s'>Welcome!</h1>You are visitor number %d!", r.FormValue("color"), visitNum)

fmt.Fprintf的原型如下：

$ go doc fmt.Fprintf
func Fprintf(w io.Writer, format string, a ...interface{}) (n int, err error)

我們看到Fprintf參數列表中的變長參數都是interface{}類型。前文曾提到過，一個接口類型占據兩個字（word），在64位架構下，這兩個字就是16字節。這意味著我們每次調用fmt.Fprintf，程序就要為每個變參分配一個占用16字節的接口類型變量，然后用傳入的類型初始化該接口類型變量。這就是這行代碼分配內存較多的原因。

要實現零內存分配，可以像下面這樣優化代碼：

// chapter8/sources/go-pprof-optimization-demo/step4/demo.go
...
var visitors int64 // 必須被自動訪問
var rxOptionalID = regexp.MustCompile(`^\d*$`)

var bufPool = sync.Pool{
    New: func() interface{} {
        return bytes.NewBuffer(make([]byte, 128))
    },
}

func handleHi(w http.ResponseWriter, r *http.Request) {
    if !rxOptionalID.MatchString(r.FormValue("color")) {
        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
        return
    }

    visitNum := atomic.AddInt64(&visitors, 1)
    buf := bufPool.Get().(*bytes.Buffer)
    defer bufPool.Put(buf)
    buf.Reset()
    buf.WriteString("<h1 style='color: ")
    buf.WriteString(r.FormValue("color"))
    buf.WriteString("'>Welcome!</h1>You are visitor number ")
    b := strconv.AppendInt(buf.Bytes(), visitNum, 10)
    b = append(b, '!')
    w.Write(b)
}

這里有幾點主要優化：

使用sync.Pool減少重新分配bytes.Buffer的次數；
采用預分配底層存儲的bytes.Buffer拼接輸出；
使用strconv.AppendInt將整型數拼接到bytes.Buffer中，strconv.AppendInt的實現如下：

// $GOROOT/src/strconv/itoa.go
func AppendInt(dst []byte, i int64, base int) []byte {
    if fastSmalls && 0 <= i && i < nSmalls && base == 10 {
        return append(dst, small(int(i))...)
    }
    dst, _ = formatBits(dst, uint64(i), base, i < 0, true)
    return dst
}

我們看到AppendInt內置對十進制數的優化。對于我們的代碼而言，這個優化的結果就是沒有新分配內存，而是利用了傳入的bytes.Buffer的實例，這樣，代碼中strconv.AppendInt的返回值變量b就是bytes.Buffer實例的底層存儲切片。

運行一下最新優化后代碼的性能基準測試并采樣內存分配性能數據：

$go test -v -run=^$ -bench=^BenchmarkHi$ -benchtime=2s -memprofile=mem.prof
goos: darwin
goarch: amd64
pkg: chapter8/sources/go-pprof-optimization-demo/step4
BenchmarkHi
BenchmarkHi-8      10765006      234 ns/op      199 B/op      0 allocs/op
PASS
ok         chapter8/sources/go-pprof-optimization-demo/step4  2.884s

可以看到，上述性能基準測試的輸出結果中每op的內存分配次數為0，而且程序性能也有了提升（346 ns/op → 234 ns/op）。剖析一下輸出的內存采樣數據：

$go tool pprof step4.test mem.prof
File: step4.test
Type: alloc_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) list handleHi
Total: 2.12GB
ROUTINE ======================== chapter8/sources/go-pprof-optimization-demo/step4.handleHi in chapter8/sources/go-pprof-optimization-demo/step4/demo.go
         0     2.12GB (flat, cum)   100% of Total
         .          .     33:buf.WriteString("<h1 style='color: ")
         .          .     34:buf.WriteString(r.FormValue("color"))
         .          .     35:buf.WriteString("'>Welcome!</h1>You are visitor
                                                  number ")
         .          .     36:b := strconv.AppendInt(buf.Bytes(), visitNum, 10)
         .          .     37:b = append(b, '!')
         .     2.12GB     38:w.Write(b)
         .          .     39:}
         .          .     40:
         .          .     41:func main() {
         .          .     42:log.Printf("Starting on port 8080")
         .          .     43:http.HandleFunc("/hi", handleHi)
(pprof)

從handleHi代碼展開的結果中已經看不到內存分配的數據了（第一列）。

7. 查看并發下的阻塞情況（step5）

前面進行的性能基準測試都是順序執行的，無法反映出handleHi在并發下多個goroutine的阻塞情況，比如在某個處理環節等待時間過長等。為了了解并發下handleHi的表現，我們為它編寫了一個并發性能基準測試：

// chapter8/sources/go-pprof-optimization-demo/step5/demo_test.go
...
func BenchmarkHiParallel(b *testing.B) {
    r, err := http.ReadRequest(bufio.NewReader(strings.NewReader("GET /hi HTTP/1.0\r\n\r\n")))
    if err != nil {
        b.Fatal(err)
    }
    b.ResetTimer()

    b.RunParallel(func(pb *testing.PB) {
        rw := httptest.NewRecorder()
        for pb.Next() {
            handleHi(rw, r)
        }
    })
}
...

執行該基準測試，并對阻塞時間類型數據（block.prof）進行采樣與剖析：

$go test -bench=Parallel -blockprofile=block.prof
goos: darwin
goarch: amd64
pkg: chapter8/sources/go-pprof-optimization-demo/step5
BenchmarkHiParallel-8     15029988              118 ns/op
PASS
ok         chapter8/sources/go-pprof-optimization-demo/step5   2.092s

$go tool pprof step5.test block.prof
File: step5.test
Type: delay
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 3.70s, 100% of 3.70s total
Dropped 18 nodes (cum <= 0.02s)
Showing top 10 nodes out of 15
      flat  flat%   sum%        cum   cum%
     1.85s 50.02% 50.02%      1.85s 50.02%  runtime.chanrecv1
     1.85s 49.98%   100%      1.85s 49.98%  sync.(*WaitGroup).Wait
         0     0%   100%      1.85s 49.98%  chapter8/sources/go-pprof-optimization-
                                            demo/step5.BenchmarkHiParallel
         0     0%   100%      1.85s 50.02%  main.main
         0     0%   100%      1.85s 50.02%  runtime.main
         0     0%   100%      1.85s 50.02%  testing.(*B).Run
         0     0%   100%      1.85s 49.98%  testing.(*B).RunParallel
         0     0%   100%      1.85s 50.01%  testing.(*B).doBench
         0     0%   100%      1.85s 49.98%  testing.(*B).launch
         0     0%   100%      1.85s 50.01%  testing.(*B).run
(pprof) list handleHi
Total: 3.70s
ROUTINE ======================== chapter8/sources/go-pprof-optimization-demo/step5.handleHi in chapter8/sources/go-pprof-optimization-demo/step5/demo.go
         0    18.78us (flat, cum) 0.00051% of Total
         .          .     19:    return bytes.NewBuffer(make([]byte, 128))
         .          .     20:  },
         .          .     21:}
         .          .     22:
         .          .     23:func handleHi(w http.ResponseWriter, r *http.Request) {
         .    18.78us     24:  if !rxOptionalID.MatchString(r.FormValue("color")) {
         .          .     25:    http.Error(w, "Optional color is invalid",
                                    http. StatusBadRequest)
         .          .     26:    return
         .          .     27:  }
         .          .     28:
         .          .     29:  visitNum := atomic.AddInt64(&visitors, 1)
(pprof)

handleHi并未出現在top10排名中。進一步展開handleHi代碼后，我們發現整個函數并沒有阻塞goroutine過長時間的環節，因此無須對handleHi進行任何這方面的優化。當然這也源于Go標準庫對regexp包的Regexp.MatchString方法做過針對并發的優化（也是采用sync.Pool），具體優化方法這里就不贅述了。

小結

在這一條中，我們學習了如何對Go程序進行性能剖析，講解了使用pprof工具對Go應用進行性能剖析的原理、使用方法，并用一個示例演示了如何實施性能優化。

本條要點：

通過性能基準測試判定程序是否存在性能瓶頸，如存在，可通過Go工具鏈中的pprof對程序性能進行剖析；
性能剖析分為兩個階段——數據采集和數據剖析；
go tool pprof工具支持多種數據采集方式，如通過性能基準測試輸出采樣結果和獨立程序的性能數據采集；
go tool pprof工具支持多種性能數據采樣類型，如CPU類型（-cpuprofile）、堆內存分配類型（-memprofile）、鎖競爭類型（-mutexprofile）、阻塞時間數據類型（-block-profile）等；
go tool pprof支持兩種主要的性能數據剖析方式，即命令行交互式和Web圖形化方式；
在不明確瓶頸原因的情況下，應優先對CPU類型和堆內存分配類型性能采樣數據進行剖析。

官术网_书友最值得收藏!

Go語言精進之路：從新手到高手的編程思想、方法和技巧（2）

47.2 使用pprof進行性能剖析的實例

小結

47.2　使用pprof進行性能剖析的實例