eBPF学习摘要2—工具使用(bpftrace)

概述

bpftrace是基于eBPF实现的动态的工具,使用DSL(Domain Specific Language)编写eBPF程序,使用LLVM编译eBPF字节码,BCC与LinuxBPF系统交互。直接使用DSL编写好的脚本(类似awk语言)可以执行,无需在内核中手动编译和加载。bpftrace在内核中实现动态追踪主要是使用Kprobe探针和Tracepoints探针。使用bpftrace可以更深入的进行操作系统上问题排查,如某个函数的调用次数和延时、追踪系统OOMKILL、TCP连接丢包等。都可以自定义脚本实现。
另外还有一个叫BCC的项目,跟bpftrace区别是BCC可以使用高级语言开发ebfp程序,如Java、Python、Lua……

https://github.com/iovisor/bpftrace
https://github.com/iovisor/bcc

安装和基础使用

系统环境:
ubuntu:20.04
Kernel:5.4.0-125-generic

参考官网安装方式
https://github.com/iovisor/bpftrace/blob/master/INSTALL.md
有各操作系统发行版的安装方式,也有基于Docker的安装方式
我这里为Ubuntu20.04的操作系统,先使用Ubuntu的安装方式进行安装,因为bpftrace依赖ebpf能力,对应不同的内核版本实现的功能有所差异,如4.1 版本实现了kprobes、4.7版本实现了tracepoints官方提供了环境需求检测脚本可以实现对现有环境检测https://github.com/iovisor/bpftrace/blob/master/scripts/check_kernel_features.sh
执行后

1
2
3
./check_kernel_features.sh

All required features present!

安装bpftrace

1
2
sudo apt-get install -y bpftrace bpfcc-tools

安装完成后查看版本

1
2
3
bpftrace --version

bpftrace v0.9.4

列出当前内核支持的Kprobes探针列表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
bpftrace -l 'kprobe:tcp*

kprobe:tcp_mmap

kprobe:tcp_get_info_chrono_stats

kprobe:tcp_init_sock

kprobe:tcp_splice_data_recv

kprobe:tcp_push

kprobe:tcp_send_mss

kprobe:tcp_cleanup_rbuf

kprobe:tcp_set_rcvlowat

kprobe:tcp_recv_timestamp

kprobe:tcp_enter_memory_pressure

kprobe:tcp_leave_memory_pressure

kprobe:tcp_ioctl

kprobe:tcp_get_info

kprobe:tcp_get_md5sig_pool

kprobe:tcp_set_state

kprobe:tcp_shutdown

...

内核静态探针-Tracepoint

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
bpftrace -l 'tracepoint:*'

kprobe:tcp_mmap

kprobe:tcp_get_info_chrono_stats

kprobe:tcp_init_sock

kprobe:tcp_splice_data_recv

kprobe:tcp_push

kprobe:tcp_send_mss

kprobe:tcp_cleanup_rbuf

kprobe:tcp_set_rcvlowat

kprobe:tcp_recv_timestamp

kprobe:tcp_enter_memory_pressure

kprobe:tcp_leave_memory_pressure

kprobe:tcp_ioctl

kprobe:tcp_get_info

kprobe:tcp_get_md5sig_pool

kprobe:tcp_set_state

kprobe:tcp_shutdown

tracepoint:nfsd:nfsd_compound

tracepoint:nfsd:nfsd_compound_status

tracepoint:nfsd:nfsd_read_start

tracepoint:nfsd:nfsd_read_splice

tracepoint:nfsd:nfsd_read_vector

tracepoint:nfsd:nfsd_read_io_done

tracepoint:nfsd:nfsd_read_done

tracepoint:nfsd:nfsd_write_start

tracepoint:nfsd:nfsd_write_opened

tracepoint:nfsd:nfsd_write_io_done

tracepoint:nfsd:nfsd_write_done

tracepoint:nfsd:nfsd_read_err

tracepoint:nfsd:nfsd_write_err

tracepoint:nfsd:nfsd_layoutstate_alloc

tracepoint:nfsd:nfsd_layoutstate_unhash

...

例如列出所有的进程打开的文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

kubelet /sys/fs/cgroup/memory/kubepods.slice/memory.numa_stat

kubelet /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpu.stat

kubelet /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpuacct.stat

kubelet /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpuacct.usage

kubelet /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpuacct.usage_percpu

kubelet /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpuacct.usage_all

kubelet /sys/fs/cgroup/pids/kubepods.slice/pids.current

kubelet /sys/fs/cgroup/pids/kubepods.slice/pids.max

kubelet /sys/fs/cgroup/blkio/kubepods.slice/blkio.bfq.sectors_recursive

kubelet /sys/fs/cgroup/blkio/kubepods.slice/blkio.bfq.io_serviced_recur

kubelet /sys/fs/cgroup/blkio/kubepods.slice/blkio.sectors_recursive

kubelet /sys/fs/cgroup/blkio/kubepods.slice/blkio.throttle.io_serviced_

...

ctrl+c暂停
也可以将复杂的命令写成脚本执行,默认安装完后,在/usr/sbin/目录下已经集成了很多脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
ls /usr/sbin/|grep ".*.bt"

bashreadline.bt

biolatency.bt

biosnoop.bt

biostacks.bt

bitesize.bt

capable.bt

cpuwalk.bt

dcsnoop.bt

ebtables

ebtables-nft

ebtables-nft-restore

ebtables-nft-save

ebtables-restore

ebtables-save

execsnoop.bt

gethostlatency.bt

killsnoop.bt

loads.bt

mdflush.bt

naptime.bt

oomkill.bt

opensnoop.bt

pidpersec.bt

runqlat.bt

runqlen.bt

setuids.bt

statsnoop.bt

swapin.bt

syncsnoop.bt

syscount.bt

tcpaccept.bt

tcpconnect.bt

tcpdrop.bt

tcplife.bt

tcpretrans.bt

tcpsynbl.bt

threadsnoop.bt

vfscount.bt

vfsstat.bt

writeback.bt

xfsdist.bt

https://github.com/iovisor/bpftrace/tree/master/tools 也存在很多脚本和测试用例。

比如执行tcpconnect.bt 可以参考到本机所有的TCP网络连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
tcpconnect.bt 

Attaching 2 probes...

Tracing tcp connections. Hit Ctrl-C to end.

TIME PID COMM SADDR SPORT DADDR DPORT

23:02:09 1686607 coredns 127.0.0.1 42216 127.0.0.1 13429

23:02:10 1686607 coredns 127.0.0.1 42218 127.0.0.1 13429

23:02:11 1680193 kubelet 10.0.1.11 34732 10.0.1.15 13429

23:02:11 1680193 kubelet 10.0.1.11 41570 10.0.1.244 13429

23:02:11 1686607 coredns 127.0.0.1 42224 127.0.0.1 13429

23:02:11 1680193 kubelet 127.0.0.1 55010 127.0.0.1 13429

23:02:11 1680193 kubelet 10.0.1.11 33098 10.0.1.145 13429

23:02:12 1686607 coredns 127.0.0.1 42230 127.0.0.1 13429

23:02:13 1680193 kubelet 127.0.0.1 34214 127.0.0.1 13429

23:02:13 1686607 coredns 127.0.0.1 42234 127.0.0.1 13429

追踪全系统范围内open()调用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
opensnoop.bt

Attaching 6 probes...

Tracing open syscalls... Hit Ctrl-C to end.

PID COMM FD ERR PATH

1686817 AsyncBlockInput 2 0 /var/lib/clickhouse_storage/store/087/0872950b-6ca9-420e-8872-9

1686817 AsyncBlockInput 2 0 /var/lib/clickhouse_storage/store/087/0872950b-6ca9-420e-8872-9

1686817 AsyncBlockInput 2 0 /var/lib/clickhouse_storage/store/435/435debae-6288-4661-835d-e

1686817 AsyncBlockInput 2 0 /var/lib/clickhouse_storage/store/435/435debae-6288-4661-835d-e

1686817 AsyncBlockInput 2 0 /var/lib/clickhouse_storage/store/092/09246824-dd69-4f4c-8924-6

1686817 AsyncBlockInput 2 0 /var/lib/clickhouse_storage/store/092/09246824-dd69-4f4c-8924-6

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

1680193 kubelet 2 0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/

展示最消耗IO的进程及数据写入量

1
2
3
4
5
6
7
8
9
10
11
biotop-bpfcc

20:53:20 loadavg: 0.96 1.40 1.26 2/2392 3295655

PID COMM D MAJ MIN DISK I/O Kbytes AVGms

337 jbd2/vda1-8 W 252 0 vda 2 412.0 5.16

1680139 etcd W 252 0 vda 14 60.0 2.61

3295622 rancher-system- R 252 0 vda 1 56.0 2.07

查看每个进程对应的执行命令和参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
execsnoop-bpfcc -T

TIME PCOMM PID PPID RET ARGS

21:22:05 rancher-system- 3310507 1 0 /usr/local/bin/rancher-system-agent sentinel

21:22:06 cilium-cni 3310516 1680193 0 /opt/cni/bin/cilium-cni

21:22:06 iptables 3310525 1680164 0 /usr/sbin/iptables -w 5 -W 100000 -S KUBE-PROXY-CANARY -t mangle

21:22:06 ip6tables 3310524 1680164 0 /usr/sbin/ip6tables -w 5 -W 100000 -S KUBE-PROXY-CANARY -t mangle

21:22:06 nsenter 3310526 1680193 0 /usr/bin/nsenter --net=/proc/1688099/ns/net -F -- ip -o -4 addr show dev eth0 scope global

21:22:06 ip 3310526 1680193 0 /usr/sbin/ip -o -4 addr show dev eth0 scope global

21:22:06 nsenter 3310527 1680193 0 /usr/bin/nsenter --net=/proc/1688099/ns/net -F -- ip -o -6 addr show dev eth0 scope global

21:22:06 ip 3310527 1680193 0 /usr/sbin/ip -o -6 addr show dev eth0 scope global

21:22:06 runc 3310528 1679752 0 /usr/bin/runc --version

21:22:06 docker-init 3310534 1679752 0 /usr/bin/docker-init --version

常用的一些脚本作用

  • killsnoop.bt——追踪 kill() 系统调用发出的信号
  • tcpconnect.bt——追踪所有的 TCP 网络连接
  • pidpersec.bt——统计每秒钟(通过fork)创建的新进程
  • opensnoop.bt——追踪 open() 系统调用
  • bfsstat.bt——追踪一些 VFS 调用,按秒统计
  • bashreadline.bt——打印从所有运行shell输入的bash命令
  • tcplife.bt——追踪TCP连接生命周期
  • biotop-bpfcc——展示进程io写入

bpftrace执行原理


用户态
1、用户编写 eBPF 程序,可以使用 eBPF 汇编或者 eBPF 特有的 C 语言来编写。
2、使用 LLVM/CLang 编译器,将 eBPF 程序编译成 eBPF 字节码。
3、调用 bpf() 系统调用把 eBPF 字节码加载到内核。

内核态
1、当用户调用 bpf() 系统调用把 eBPF 字节码加载到内核时,内核先会对 eBPF 字节码进行安全验证。
2、使用 JIT(Just In Time)技术将 eBPF 字节编译成本地机器码(Native Code)。
3、然后根据 eBPF 程序的功能,将 eBPF 机器码挂载到内核的不同运行路径上(如用于跟踪内核运行状态的 eBPF 程序将会挂载在 kprobes 的运行路径上)。当内核运行到这些路径时,就会触发执行相应路径上的 eBPF 机器码。
4、通过map与用户空间程序交互

总结
通过bpftrace和bcc可以很形象了解ebpf特性,无需修改内核源码和重新编译内核就可以扩展内核的功能,除了像bpftrace这类追踪类软件,还有通过ebfp实现的POD安全威胁检测Falco、基于ebpf负载均衡器Katran等开源产品。另外ebpf_exporter组件也可以将自定义的ebpf执行脚本输出到Prometheus中进行监控。ebpf的生态将越来越丰富。

Perf工具使用

Perf(Performance Event)是在Linux Kernel2.6集成在Linux Kernel中,主要利用CPU中PMU (Performance Monitoring Unit)和Linux中的 tracepoint实现目标取样和性能分析。Perf工具根eBPF实际上没什么关系,这里写这个工具主要是因为它本身也可以实现应用程序动态追踪,也利用到了tracepoint的能力,但与eBPF不同的是Perf是写死的能力,bpftrace基于eBPF是可以实现脚本灵活的穿插和调用。

安装部署
这里使用的操作系统是Ubuntu20.04。Kernel为5.4.0-125-generic

1
2
3
sudo apt-get install linux-tools-common linux-tools-"``(uname -r)" linux-cloud-tools-"``(uname -r)" linux-tools-generic linux-cloud-tools-generic


验证版本

1
2
3
perf -v

perf version 5.4.195

采样事件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
perf list

List of pre-defined events (to be used in -e):

alignment-faults [Software event]

bpf-output [Software event]

context-switches OR cs [Software event]

cpu-clock [Software event]

cpu-migrations OR migrations [Software event]

dummy [Software event]

emulation-faults [Software event]

major-faults [Software event]

minor-faults [Software event]

page-faults OR faults [Software event]

task-clock [Software event]

duration_time [Tool event]

msr/tsc/ [Kernel PMU event]

rNNN [Raw hardware event descriptor]

cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]

(see 'man perf-list' on how to encode it)

mem:<addr>[/len][:access] [Hardware breakpoint]

alarmtimer:alarmtimer_cancel [Tracepoint event]

alarmtimer:alarmtimer_fired [Tracepoint event]

...

主要分为三类:
Hardware Event :通过PMU获取的硬件CPU的事件,如:cpu-cycles、缓存命中等。
Software Event :软件本身的进程切换和页命中等
Tracepoint event:io命中率、文件系统写入速率等

perf top展示各个进程和函数资源占用情况,-g显示子进程,按e显示子进程函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
perf top -g 

Samples: 284K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 33836570425 lost: 0/0 drop: 0/0

Children Self Shared Object Symbol

- 20.27% 0.09% perf [.] __ordered_events__flush.part.0 ◆

- 2.20% __ordered_events__flush.part.0 ▒

- 2.56% deliver_event ▒

- 3.39% hist_entry_iter__add ▒

- 3.79% iter_add_next_cumulative_entry ▒

- 3.03% __hists__add_entry.constprop.0 ▒

3.79% hists__findnew_entry ▒

- 1.54% callchain_append ▒

- 2.64% append_chain_children ▒

- 2.22% append_chain_children ▒

- 1.73% append_chain_children ▒

- 1.34% append_chain_children ▒

1.07% append_chain_children ▒

+ 20.13% 0.18% perf [.] deliver_event ▒

+ 18.62% 0.04% perf [.] hist_entry_iter__add ▒

+ 14.47% 0.80% perf [.] iter_add_next_cumulative_entry ▒

+ 12.05% 0.96% [kernel] [k] do_syscall_64 ▒

+ 8.99% 0.00% perf [.] process_thread ▒

+ 8.93% 0.22% [kernel] [k] do_idle ▒

+ 8.83% 1.06% [kernel] [k] __softirqentry_text_start ▒

+ 8.24% 6.28% perf [.] append_chain_children ▒

+ 7.11% 0.06% perf [.] callchain_append ▒

+ 5.96% 4.75% libc-2.31.so [.] pthread_attr_setschedparam ▒

+ 5.74% 0.25% perf [.] __hists__add_entry.constprop.0


[.]:表示运行在用户态空间
[k]:表示运行在内核态空间

perf state查看程序运行情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
perf stat -p 1679752   按ctrl+c输出结果

Performance counter stats for process id '1679752':

682.90 msec task-clock # 0.066 CPUs utilized

3154 context-switches # 0.005 M/sec

36 cpu-migrations # 0.053 K/sec

3275 page-faults # 0.005 M/sec

<not supported> cycles

<not supported> instructions

<not supported> branches

<not supported> branch-misses

Task-clock:CPU 利用率
Context-switches:进程切换次数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Samples: 1K of event 'block:block_rq_issue', 1 Hz, Event count (approx.): 136 lost: 0/0 drop: 0/0

Children Self Trace output

+ 16.91% 16.91% 252,0 FF 0 () 0 + 0 [kworker/0:1H]

+ 15.44% 15.44% 252,0 FF 0 () 0 + 0 [kworker/7:1H]

+ 9.56% 9.56% 252,0 FF 0 () 0 + 0 [kworker/3:1H]

+ 8.09% 8.09% 252,0 FF 0 () 0 + 0 [kworker/4:1H]

+ 6.62% 6.62% 252,0 WS 4096 () 18340064 + 8 [etcd]

+ 6.62% 6.62% 252,0 WS 4096 () 18340072 + 8 [etcd]

+ 5.15% 5.15% 252,0 FF 0 () 0 + 0 [kworker/2:1H]

+ 4.41% 4.41% 252,0 FF 0 () 0 + 0 [kworker/6:1H]

+ 2.94% 2.94% 252,0 WS 4096 () 122005280 + 8 [etcd]

2.21% 2.21% 252,0 WS 4096 () 115952144 + 8 [etcd]

2.21% 2.21% 252,0 WS 4096 () 122005272 + 8 [etcd]

+ 1.47% 1.47% 252,0 FF 0 () 0 + 0 [kworker/1:1H]

+ 1.47% 1.47% 252,0 WS 4096 () 116164552 + 8 [etcd]

1.47% 1.47% 252,0 WS 4096 () 116173824 + 8 [etcd]

+ 1.47% 1.47% 252,0 WS 4096 () 122005256 + 8 [etcd]

1.47% 1.47% 252,0 WS 4096 () 122005288 + 8 [etcd]

1.47% 1.47% 252,0 WS 4096 () 122005296 + 8 [etcd]

0.74% 0.74% 252,0 FF 0 () 0 + 0 [kworker/5:1H]

0.74% 0.74% 252,0 WS 516096 () 2388520 + 1008 [jbd2/vda1-8]

0.74% 0.74% 252,0 WS 372736 () 2389528 + 728 [jbd2/vda1-8]

0.74% 0.74% 252,0 WS 4096 () 115700160 + 8 [etcd]

0.74% 0.74% 252,0 WS 4096 () 115948608 + 8 [etcd]

对CPU事件进行检测,采样时间60s,每秒采样99个事件,采样完成后会在本地生成个perf.data文件,如果执行多次,会将上一个重命名为perf.data.old。加-p可以指定进程号输出。

1
2
perf record -F 99 -a -g -- sleep 60

查看报告

1
perf report

生成火焰图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
下载制作火焰图工具

git clone https://github.com/brendangregg/FlameGraph.git

对perf.data进行解析

perf script -i perf.data &> perf.unfold

进行符号处理

FlameGraph/stackcollapse-perf.pl perf.unfold &> perf.folded

生成火焰图

FlameGraph/flamegraph.pl perf.folded > perf.svg

使用chrome浏览器打开

火焰图怎么查看分析可参考
https://www.infoq.cn/article/a8kmnxdhbwmzxzsytlga

参考链接
https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md
http://blog.nsfocus.net/bpftrace-dynamic-tracing-0828/
https://www.cnblogs.com/arnoldlu/p/6241297.html
https://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/counting-events-during-process-execution-with-perf-stat_monitoring-and-managing-system-status-and-performance