pg电子寻龙探宝

書名： Go語言精進之路：從新手到高手的編程思想、方法和技巧（2）
作者名：白明
本章字數： 5081字
更新時間： 2022-01-04 17:42:27

47.1　pprof的工作原理

使用pprof對程序進行性能剖析的工作一般分為兩個階段：數據采集和數據剖析，如圖47-1所示。

圖47-1　pprof工作原理

1. 采樣數據類型

在數據采集階段，Go運行時會定期對剖析階段所需的不同類型數據進行采樣記錄。當前主要支持的采樣數據類型有如下幾種。

（1）CPU數據（對應圖47-1中的cpu.prof）

CPU類型采樣數據是性能剖析中十分常見的采樣數據類型，它能幫助我們識別出代碼關鍵路徑上消耗CPU最多的函數。一旦啟用CPU數據采樣，Go運行時會每隔一段短暫的時間（10ms）就中斷一次（由SIGPROF信號引發）并記錄當前所有goroutine的函數棧信息（存入cpu.prof）。

（2）堆內存分配數據（對應圖47-1中的mem.prof）

堆內存分配采樣數據和CPU采樣數據一樣，也是性能剖析中十分常見的采樣數據類型，它能幫助我們了解Go程序的當前和歷史內存使用情況。堆內存分配的采樣頻率可配置，默認每1000次堆內存分配會做一次采樣（存入mem.prof）。

（3）鎖競爭數據（對應圖47-1中的mutex.prof）

鎖競爭采樣數據記錄了當前Go程序中互斥鎖爭用導致延遲的操作。如果你認為很大可能是互斥鎖爭用導致CPU利用率不高，那么你可以為go tool pprof工具提供此類采樣文件以供性能剖析階段使用。該類型采樣數據在默認情況下是不啟用的，請參見runtime.SetMutexProfileFraction或go test -bench . xxx_test.go -mutexprofile mutex.out啟用它。

（4）阻塞時間數據（對應圖47-1中的block.prof）

該類型采樣數據記錄的是goroutine在某共享資源（一般是由同步原語保護）上的阻塞時間，包括從無緩沖channel收發數據、阻塞在一個已經被其他goroutine鎖住的互斥鎖、向一個滿了的channel發送數據或從一個空的channel接收數據等。該類型采樣數據在默認情況下也是不啟用的，請參見runtime.SetBlockProfileRate或go test -bench . xxx_test.go -blockprofile block.out啟用它。

注意，采樣不是免費的，因此一次采樣盡量僅采集一種類型的數據，不要同時采樣多種類型的數據，避免相互干擾采樣結果。

2. 性能數據采集的方式

在圖47-1中我們看到，Go目前主要支持兩種性能數據采集方式：通過性能基準測試進行數據采集和獨立程序的性能數據采集。

（1）通過性能基準測試進行數據采集

為應用中的關鍵函數/方法建立起性能基準測試之后，我們便可以通過執行性能基準測試采集到整個測試執行過程中有關被測方法的各類性能數據。這種方式尤其適用于對應用中關鍵路徑上關鍵函數/方法性能的剖析。

我們僅需為go test增加一些命令行選項即可在執行性能基準測試的同時進行性能數據采集。以CPU采樣數據類型為例：

$go test -bench . xxx_test.go -cpuprofile=cpu.prof
$ls
cpu.prof xxx.test* xxx_test.go

一旦開啟性能數據采集（比如傳入-cpuprofile），go test的-c命令選項便會自動開啟，go test命令執行后會自動編譯出一個與該測試對應的可執行文件（這里是xxx.test）。該可執行文件可以在性能數據剖析過程中提供剖析所需的符號信息（如果沒有該可執行文件，go tool pprof的disasm命令將無法給出對應符號的匯編代碼）。而cpu.prof就是存儲CPU性能采樣數據的結果文件，后續將作為數據剖析過程的輸入。

對于其他類型的采樣數據，也可以采用同樣的方法開啟采集并設置輸出文件:

$go test -bench . xxx_test.go -memprofile=mem.prof
$go test -bench . xxx_test.go -blockprofile=block.prof
$go test -bench . xxx_test.go -mutexprofile=mutex.prof

（2）獨立程序的性能數據采集

可以通過標準庫runtime/pprof和runtime包提供的低級API對獨立程序進行性能數據采集。下面是一個獨立程序性能數據采集的例子：

// chapter8/sources/pprof_standalone1.go

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to `file`")
var memprofile = flag.String("memprofile", "", "write memory profile to `file`")
var mutexprofile = flag.String("mutexprofile", "", "write mutex profile to `file`")
var blockprofile = flag.String("blockprofile", "", "write block profile to `file`")

func main() {
    flag.Parse()
    if *cpuprofile != "" {
        f, err := os.Create(*cpuprofile)
        if err != nil {
            log.Fatal("could not create CPU profile: ", err)
        }
        defer f.Close() // 該例子中暫忽略錯誤處理
        if err := pprof.StartCPUProfile(f); err != nil {
            log.Fatal("could not start CPU profile: ", err)
        }
        defer pprof.StopCPUProfile()
    }

    if *memprofile != "" {
        f, err := os.Create(*memprofile)
        if err != nil {
            log.Fatal("could not create memory profile: ", err)
        }
        defer f.Close()
        if err := pprof.WriteHeapProfile(f); err != nil {
            log.Fatal("could not write memory profile: ", err)
        }
    }

    if *mutexprofile != "" {
        runtime.SetMutexProfileFraction(1)
        defer runtime.SetMutexProfileFraction(0)
        f, err := os.Create(*mutexprofile)
        if err != nil {
            log.Fatal("could not create mutex profile: ", err)
        }
        defer f.Close()

        if mp := pprof.Lookup("mutex"); mp != nil {
            mp.WriteTo(f, 0)
        }
    }

    if *blockprofile != "" {
        runtime.SetBlockProfileRate(1)
        defer runtime.SetBlockProfileRate(0)
        f, err := os.Create(*blockprofile)
        if err != nil {
            log.Fatal("could not create block profile: ", err)
        }
        defer f.Close()

        if mp := pprof.Lookup("mutex"); mp != nil {
            mp.WriteTo(f, 0)
        }
    }

    var wg sync.WaitGroup
    c := make(chan os.Signal, 1)
    signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)
    wg.Add(1)
    go func() {
        for {
            select {
            case <-c:
                wg.Done()
                return
            default:
                s1 := "hello,"
                s2 := "gopher"
                s3 := "!"
                _ = s1 + s2 + s3
            }

            time.Sleep(10 * time.Millisecond)
        }
    }()
    wg.Wait()
    fmt.Println("program exit")
}

可以通過指定命令行參數的方式選擇要采集的性能數據類型：

$go run pprof_standalone1.go -help
Usage of /var/folders/cz/sbj5kg2d3m3c6j650z0qfm800000gn/T/go-build221652171/b001/exe/pprof_standalone1:
  -blockprofile file
        write block profile to file
  -cpuprofile file
        write cpu profile to file
  -memprofile file
        write memory profile to file
  -mutexprofile file
        write mutex profile to file

以CPU類型性能數據為例，執行下面的命令：

$go run pprof_standalone1.go -cpuprofile cpu.prof
^Cprogram exit


$ls -l cpu.prof
-rw-r--r--  1 tonybai  staff  734  5 19 13:02 cpu.prof

程序退出后，我們在當前目錄下看到采集后的CPU類型性能數據結果文件cpu.prof，該文件將被提供給go tool pprof工具用作后續剖析。

從上述示例我們看到，這種獨立程序的性能數據采集方式對業務代碼侵入較多，還要自己編寫一些采集邏輯：定義flag變量、創建輸出文件、關閉輸出文件等。每次采集都要停止程序才能獲取結果。（當然可以重新定義更復雜的控制采集時間窗口的邏輯，實現不停止程序也能獲取采集數據結果。）

Go在net/http/pprof包中還提供了一種更為高級的針對獨立程序的性能數據采集方式，這種方式尤其適合那些內置了HTTP服務的獨立程序。net/http/pprof包可以直接利用已有的HTTP服務對外提供用于性能數據采集的服務端點（endpoint）。例如，一個已有的提供HTTP服務的獨立程序代碼如下：

// chapter8/sources/pprof_standalone2.go

func main() {
    http.Handle("/hello", http.HandlerFunc(func(w http.ResponseWriter,
        r *http.Request) {
        fmt.Println(*r)
        w.Write([]byte("hello"))
    }))
    s := http.Server{
        Addr: "localhost:8080",
    }
    c := make(chan os.Signal, 1)
    signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)
    go func() {
        <-c
        s.Shutdown(context.Background())
    }()
    log.Println(s.ListenAndServe())
}

如果要采集該HTTP服務的性能數據，我們僅需在該獨立程序的代碼中像下面這樣導入net/http/pprof包即可：

// chapter8/sources/pprof_standalone2.go

import (
    _ "net/http/pprof"
)

下面是net/http/pprof包的init函數，這就是空導入net/http/pprof的“副作用”：

//$GOROOT/src/net/http/pprof/pprof.go

func init() {
    http.HandleFunc("/debug/pprof/", Index)
    http.HandleFunc("/debug/pprof/cmdline", Cmdline)
    http.HandleFunc("/debug/pprof/profile", Profile)
    http.HandleFunc("/debug/pprof/symbol", Symbol)
    http.HandleFunc("/debug/pprof/trace", Trace)
}

我們看到該包的init函數向http包的默認請求路由器DefaultServeMux注冊了多個服務端點和對應的處理函數。而正是通過這些服務端點，我們可以在該獨立程序運行期間獲取各種類型的性能采集數據。現在打開瀏覽器，訪問http://localhost:8080/debug/pprof/，我們就可以看到如圖47-2所示的頁面。

圖47-2　net/http/pprof提供的性能采集頁面

這個頁面里列出了多種類型的性能采集數據，點擊其中任何一個即可完成該種類型性能數據的一次采集。profile是CPU類型數據的服務端點，點擊該端點后，該服務默認會發起一次持續30秒的性能采集，得到的數據文件會由瀏覽器自動下載到本地。如果想自定義采集時長，可以通過為服務端點傳遞時長參數實現，比如下面就是一個采樣60秒的請求：

http://localhost:8080/debug/pprof/profile?seconds=60

如果獨立程序的代碼中沒有使用http包的默認請求路由器DefaultServeMux，那么我們就需要重新在新的路由器上為pprof包提供的性能數據采集方法注冊服務端點，就像下面的示例一樣：

// chapter8/sources/pprof_standalone3.go
...
func main() {
    mux := http.NewServeMux()
    mux.HandleFunc("/debug/pprof/", pprof.Index)
    mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
    ...
    mux.HandleFunc("/hello", http.HandlerFunc(func(w http.ResponseWriter,
        r *http.Request) {
        fmt.Println(*r)
        w.Write([]byte("hello"))
    }))
    s := http.Server{
        Addr:    "localhost:8080",
        Handler: mux,
    }
    c := make(chan os.Signal, 1)
    signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)
    go func() {
        <-c
        s.Shutdown(context.Background())
    }()
    log.Println(s.ListenAndServe())
}

如果是非HTTP服務程序，則在導入包的同時還需單獨啟動一個用于性能數據采集的goroutine，像下面這樣：

// chapter8/sources/pprof_standalone4.go

...
func main() {
    go func() {
        // 單獨啟動一個HTTP server用于性能數據采集
        fmt.Println(http.ListenAndServe("localhost:8080", nil))
    }()
    ...
}

通過上面幾個示例我們可以看出，相比第一種方式，導入net/http/pprof包進行獨立程序性能數據采集的方式侵入性更小，代碼也更為獨立，并且無須停止程序，通過預置好的各類性能數據采集服務端點即可隨時進行性能數據采集。

3. 性能數據的剖析

Go工具鏈通過pprof子命令提供了兩種性能數據剖析方法：命令行交互式和Web圖形化。命令行交互式的剖析方法更常用，也是基本的性能數據剖析方法；而基于Web圖形化的剖析方法在剖析結果展示上更為直觀。

（1）命令行交互方式

可以通過下面三種方式執行go tool pprof以進入采用命令行交互式的性能數據剖析環節：

$go tool pprof xxx.test cpu.prof // 剖析通過性能基準測試采集的數據
$go tool pprof standalone_app cpu.prof // 剖析獨立程序輸出的性能采集數據
// 通過net/http/pprof注冊的性能采集數據服務端點獲取數據并剖析
$go tool pprof http://localhost:8080/debug/pprof/profile

下面以pprof_standalone1.go這個示例的性能采集數據為例，看一下在命令行交互式的剖析環節，有哪些常用命令可用。首先生成CPU類型性能采集數據：

$go build -o pprof_standalone1 pprof_standalone1.go
$./pprof_standalone1 -cpuprofile pprof_standalone1_cpu.prof
^Cprogram exit

通過go tool pprof命令進入命令行交互模式：

$ go tool pprof pprof_standalone1 pprof_standalone1_cpu.prof
File: pprof_standalone1
Type: cpu
Time: ...
Duration: 16.14s, Total samples = 240ms ( 1.49%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

從pprof子命令的輸出中我們看到：程序運行16.14s，采樣總時間為240ms，占總時間的1.49%。

命令行交互方式下最常用的命令是topN（N為數字，如果不指定，默認等于10）：

(pprof) top
Showing nodes accounting for 240ms, 100% of 240ms total
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
      90ms 37.50% 37.50%       90ms 37.50%  runtime.nanotime1
      50ms 20.83% 58.33%       50ms 20.83%  runtime.pthread_cond_wait
      40ms 16.67% 75.00%       40ms 16.67%  runtime.usleep
      20ms  8.33% 83.33%       20ms  8.33%  runtime.asmcgocall
      20ms  8.33% 91.67%       20ms  8.33%  runtime.kevent
      10ms  4.17% 95.83%       10ms  4.17%  runtime.pthread_cond_signal
      10ms  4.17%   100%       10ms  4.17%  runtime.pthread_cond_timedwait_
                                                relative_np
         0     0%   100%       10ms  4.17%  main.main.func1
         0     0%   100%       30ms 12.50%  runtime.checkTimers
         0     0%   100%      130ms 54.17%  runtime.findrunnable
(pprof)

topN命令的輸出結果默認按flat(flat%)從大到小的順序輸出。

flat列的值表示函數自身代碼在數據采樣過程中的執行時長。
flat%列的值表示函數自身代碼在數據采樣過程中的執行時長占總采樣執行時長的百分比。
sum%列的值是當前行flat%值與排在該值前面所有行的flat%值的累加和。以第三行的sum%值75.00%為例，該值由前三行flat%累加而得，即16.67% + 20.83% + 37.50% = 75.00%。
cum列的值表示函數自身在數據采樣過程中出現的時長，這個時長是其自身代碼執行時長及其等待其調用的函數返回所用時長的總和。越是接近函數調用棧底層的代碼，其cum列的值越大。
cum%列的值表示該函數cum值占總采樣時長的百分比。比如：runtime.findrunnable函數的cum值為130ms，總采樣時長為240ms，則其cum%值為兩者的比值百分化后的值。

命令行交互模式也支持按cum值從大到小的順序輸出采樣結果：

(pprof) top -cum
Showing nodes accounting for 90ms, 37.50% of 240ms total
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
         0     0%     0%      140ms 58.33%  runtime.mcall
         0     0%     0%      140ms 58.33%  runtime.park_m
         0     0%     0%      140ms 58.33%  runtime.schedule
         0     0%     0%      130ms 54.17%  runtime.findrunnable
         0     0%     0%       90ms 37.50%  runtime.nanotime (inline)
      90ms 37.50% 37.50%       90ms 37.50%  runtime.nanotime1
         0     0% 37.50%       70ms 29.17%  runtime.mstart
         0     0% 37.50%       70ms 29.17%  runtime.mstart1
         0     0% 37.50%       70ms 29.17%  runtime.sysmon
         0     0% 37.50%       60ms 25.00%  runtime.semasleep
(pprof)

在命令行交互模式下，可以通過list命令列出函數對應的源碼，比如列出main.main函數的源碼：

(pprof) list main.main
Total: 240ms
ROUTINE ======================== main.main.func1 in chapter8/sources/pprof_standalone1.go
         0       10ms (flat, cum)  4.17% of Total
         .          .     86:                       s2 := "gopher"
         .          .     87:                       s3 := "!"
         .          .     88:                       _ = s1 + s2 + s3
         .          .     89:                   }
         .          .     90:
         .       10ms     91:                   time.Sleep(10 * time.Millisecond)
         .          .     92:           }
         .          .     93:    }()
         .          .     94:    wg.Wait()
         .          .     95:    fmt.Println("program exit")
         .          .     96:}
(pprof)

我們看到，在展開源碼的同時，pprof還列出了代碼中對應行的消耗時長（基于采樣數據）。可以選擇耗時較長的函數，進一步向下展開，這個過程類似一個對代碼進行向下鉆取的過程，直到找到令我們滿意的結果（某個導致性能瓶頸的函數中的某段代碼）：

(pprof) list time.Sleep
Total: 240ms
ROUTINE ======================== time.Sleep in go1.14/src/runtime/time.go
         0       10ms (flat, cum)  4.17% of Total
         .          .    192:           t = new(timer)
         .          .    193:           gp.timer = t
         .          .    194:    }
         .          .    195:    t.f = goroutineReady
         .          .    196:    t.arg = gp
         .       10ms    197:    t.nextwhen = nanotime() + ns
         .          .    198:    gopark(resetForSleep, unsafe.Pointer(t), waitReasonSleep, traceEvGoSleep, 1)
         .          .    199:}
         .          .    200:
(pprof)

在命令行交互模式下，還可以生成CPU采樣數據的函數調用圖，且可以導出為多種格式，如PDF、PNG、JPG、GIF、SVG等。不過要做到這一點，前提是本地已安裝圖片生成所依賴的插件graphviz。

如下導出一幅PNG格式的圖片：

(pprof) png
Generating report in profile001.png

png命令在當前目錄下生成了一幅名為profile001.png的圖片文件，如圖47-3所示。

圖47-3　pprof_standalone1采樣數據生成的函數調用圖

在圖47-3中，我們可以清晰地看到cum%較大的葉子節點（用黑色粗體標出，葉子節點的cum%值與flat%值相等），它們就是我們需要重點關注的優化點。

在命令行交互模式下，通過web命令還可以在輸出SVG格式圖片的同時自動打開本地瀏覽器展示該圖片。要實現這個功能也有一個前提，那就是本地SVG文件的默認打開應用為瀏覽器，否則生成的SVG文件很可能會以其他文本形式被其他應用打開。

（2）Web圖形化方式

對于喜好通過圖形化交互（GUI）方式剖析程序性能的開發者，go tool pprof提供了基于Web的圖形化呈現所采集性能數據的方式。對于已經生成好的各類性能采集數據文件，我們可以通過下面的命令行啟動一個Web服務并自動打開本地瀏覽器、進入圖形化剖析頁面（見圖47-4）：

$go tool pprof -http=:9090 pprof_standalone1_cpu.prof
Serving web UI on http://localhost:9090

圖形化剖析頁面的默認視圖（VIEW）是Graph，即函數調用圖。在圖47-4左上角的VIEW下拉菜單中，還可以看到Top、Source、Flame Graph等菜單項。

圖47-4　go tool pprof自動打開本地瀏覽器并進入圖形化剖析頁面

Top視圖等價于命令行交互模式下的topN命令輸出（見圖47-5）。

圖47-5　圖形化剖析頁面的Top視圖

Source視圖等價于命令行交互模式下的list命令輸出（見圖47-6），只是這里將所有采樣到的函數相關源碼全部在一個頁面列出了。

圖47-6　圖形化剖析頁面的Source視圖

Flame Graph視圖即火焰圖，該類型視圖由性能架構師Brendan Gregg發明，并在近幾年被廣大開發人員接受。Go 1.10版本在go工具鏈中添加了對火焰圖的支持。通過火焰圖（見圖47-7），我們可以快速、準確地識別出執行最頻繁的代碼路徑，因此它多用于對CPU類型采集數據的輔助剖析（其他類型的性能采樣數據也有對應的火焰圖，比如內存分配）。

圖47-7　圖形化剖析頁面的Flame Graph視圖

go tool pprof在瀏覽器中呈現出的火焰圖與標準火焰圖有些差異：它是倒置的，即調用棧最頂端的函數在最下方。在這樣一幅倒置火焰圖中，y軸表示函數調用棧，每一層都是一個函數。調用棧越深，火焰越高。倒置火焰圖每個函數調用棧的最下方就是正在執行的函數，上方都是它的父函數。

火焰圖的x軸表示抽樣數量，如果一個函數在x軸上占據的寬度越寬，就表示它被抽樣到的次數越多，即執行的時間越長。倒置火焰圖就是看最下面的哪個函數占據的寬度最大，這樣的函數可能存在性能問題。

當鼠標懸浮在火焰圖中的任意一個函數上時，圖上方會顯示該函數的性能采樣詳細信息。在火焰圖中任意點擊某個函數棧上的函數，火焰圖都會水平局部放大，該函數會占據所在層的全部寬度，顯示更為詳細的信息。再點擊root層或REFINE下拉菜單中的Reset可恢復火焰圖原來的樣子。

對于通過net/http/pprof暴露性能數據采樣端點的獨立程序，同樣可以采用基于Web的圖形化頁面進行性能剖析。以pprof_standalone4.go的剖析為例：

// 啟動pprof_standalone4.go
$go run pprof_standalone4.go

// 啟動Web圖形化剖析
$go tool pprof -http=:9090 http://localhost:8080/debug/pprof/profile
Fetching profile over HTTP from http://localhost:8080/debug/pprof/profile
Saved profile in /Users/tonybai/pprof/pprof.samples.cpu.001.pb.gz
Serving web UI on http://localhost:9090

執行go tool pprof時，pprof會對pprof_standalone4.go進行默認30秒的CPU類型性能數據采樣，然后將采集的數據下載到本地，存為pprof.samples.cpu.001.pb.gz，之后go tool pprof加載pprof.samples.cpu.001.pb.gz并自動啟動瀏覽器進入性能剖析默認頁面（函數調用圖），如圖47-8所示。

圖47-8　針對采用net/http/pprof的獨立程序的Web圖形化剖析頁面

剩下的操作和之前描述的完全一致，這里就不贅述了。

官术网_书友最值得收藏!

Go語言精進之路：從新手到高手的編程思想、方法和技巧（2）

47.1 pprof的工作原理

47.1　pprof的工作原理