概述: Thanos主要用于解决大规模prometheus部署、增强prometheus高可用的工具,由Improbable团队开源,项目地址https://github.com/thanos-io/thanos
架构:
组件: thanos-sidecar: 通过Prometheus附加件与Prometheus进行连接,通过http方式在Prometheus的remote-read API基础之上实现了storeAPI接口,query组件可以直接从此接口读取监控数据,并且还支持将数据上传到对应的对象存储。
thanos-query: 通过thanos-sidecar组件的store-api的grpc接口抓取监控数据,并对监控数据进行聚合去重处理。
thanos-storage-gateway: 对接后端对象存储,当需要查询对象存储中的历史监控数据时,query与其相连查看。
thanos-compact: 对存储在对象存储中的监控数据多个较小的块连续合并为较大的块。这显着减少了存储桶中的总存储大小。提高查询效率。
thanos-rules: 对监控数据进行告警,通知altermanager,并且可以预先计算经常需要或计算量大的表达式,并将其结果保存为一组新的时间序列据提供给query查询和对象存储进行存储。
Thanos能解决什么问题? 大规模集群部署问题 prometheus本身只支持单机部署,没有自带集群模式,所以在目前大规模集群监控主要通过prometheus联邦机制或者做监控指标服务拆分方式去实现并且最难以接受的是prometheus对于监控历史数据的存储问题,在本地不能存储过久的监控数据,只能通过远端存储接口,存储到支持prometheus远端存储接口的数据库中,这带来的问题就是引入新的组件会增加对应的运维工作量。
prometheus高可用问题和扩展问题 prometheus官方高可用的方式是通过部署多个prometheus实例采集同一个target,前端通过LB设备做为统一入口,这带来的问题就是,两个prometheus实例内存储的数据会存在差异,特别是当其中一个prometheus宕机后,另外一个prometheus接管服务,此时宕机的prometheus就会丢失宕机期间的监控数据,当LB的请求转发过来会出现数据不一致情况。
Thanos能够解决上述问题,thanos能够将多个prometheus实例的数据进行聚合去重,来支持prometheus横向扩展和提高prometheus的高可用性,同时也支持将历史监控数据存储到对象存储中,提供监控数据的可靠性,降低运维难度。
部署架构
部署多个prometheus实例,采集相同或不同的targets。
thanos-sidecar通过sidecar方式与prometheus部署在一起,将数据提供给query查询和将本地落盘的数据上传到兼容S3协议的对象存储中。
Query进行数据汇总和去重做数据查询的统一入口,grafana通过query接口进行监控展示。
历史监控数据查询通过store-Gateway进行查询。
compactor组件对存储在对象存储的数据进行压缩,降低采样率,后续查看长时间监控数据提高效率。
部署使用: 这里以一个快速demo的方式进行thanos的功能展示和部署,实际生产可以结合prometheus-operator的方式去部署更佳。
提前准备: 创建一个storageclass,用于提供给prometheus实例使用
部署prometheus 创建ServiceAccount并做权限绑定
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount metadata: name: prometheus --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/proxy verbs: - get - list - watch - apiGroups: - "extensions" resources: - ingresses verbs: - get - list - watch - apiGroups: - "" resources: - configmaps - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: default EOF
创建configmap prometheus-configmap.yaml
将targets改为实际运行node-exporter节点地址
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config data: prometheus.yaml.tmpl: | global: scrape_interval: 15s scrape_timeout: 15s external_labels: cluster: test-thanos replica: $(POD_NAME) scrape_configs: - job_name: "node" static_configs: - targets: ["172.16.1.6:9100"]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus labels: app: prometheus spec: serviceName: "prometheus" replicas: 2 selector: matchLabels: app: prometheus thanos-store-api: "true" template: metadata: labels: app: prometheus thanos-store-api: "true" spec: serviceAccountName: prometheus volumes: - name: prometheus-config configMap: name: prometheus-config - name: prometheus-config-shared emptyDir: {} containers: - name: prometheus image: prom/prometheus:v2.14.0 args: - "--config.file=/etc/prometheus-shared/prometheus.yaml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention.time=6h" - "--storage.tsdb.no-lockfile" - "--storage.tsdb.min-block-duration=2h" # 每隔2小时将数据压缩成一个block,持久化到硬盘中 - "--storage.tsdb.max-block-duration=2h" - "--web.enable-admin-api" # thanos可以通过prometheus admin api管理数据 - "--web.enable-lifecycle" # 支持热更新 localhost:9090/-/reload 加载 ports: - name: http containerPort: 9090 volumeMounts: - name: prometheus-config-shared mountPath: /etc/prometheus-shared/ - name: data mountPath: "/prometheus" - name: thanos image: thanosio/thanos:v0.11.0 args: - sidecar - --tsdb.path=/prometheus - --prometheus.url=http://localhost:9090 - --reloader.config-file=/etc/prometheus/prometheus.yaml.tmpl #配置对应的prometheus.yaml.tmpl文件路径 - --reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yaml # 基于配置文件模板生成配置文件 ports: - name: http-sidecar containerPort: 10902 - name: grpc containerPort: 10901 env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name volumeMounts: - name: prometheus-config mountPath: /etc/prometheus - name: prometheus-config-shared mountPath: /etc/prometheus-shared/ - name: data mountPath: "/prometheus" volumeClaimTemplates: #提供volumesclaimTemplate为每个prometheus实例分配一个PVC - metadata: name: data labels: app: prometheus spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: 2Gi EOF
注意以下参数:
1 2 3 4 5 6 --storage.tsdb.min-block-duration=2h --storage.tsdb.max-block-duration=2h #这两个值使用thanos需要配置成一致,禁用prometheus对监控数据进行压缩,因为使用thanos,compactor组件会对监控数据进行压缩。 --web.enable-admin-api" # thanos可以通过prometheus admin api管理数据 --web.enable-lifecycle" #配置prometheus配置热加载
创建Service
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: thanos-store-gateway spec: type: ClusterIP clusterIP: None ports: - name: grpc port: 10901 targetPort: grpc selector: thanos-store-api: "true" EOF
部署exporter 用于采集监控数据
部署node-exporter
在节点上部署node-exporter用于收集主机资源信息
1 docker run -d --net="host" --pid="host" -v "/:/host:ro,rslave" bitnami/node-exporter --path.rootfs=/host
部署Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: thanos-querier namespace: kube-mon labels: app: thanos-querier spec: selector: matchLabels: app: thanos-querier template: metadata: labels: app: thanos-querier spec: containers: - name: thanos image: thanosio/thanos:v0.11.0 args: - query - --log.level=debug - --query.replica-label=replica # Discover local store APIs using DNS SRV. - --store=dnssrv+thanos-store-gateway:10901 ports: - name: http containerPort: 10902 - name: grpc containerPort: 10901 resources: requests: memory: "2Gi" cpu: "1" limits: memory: "2Gi" cpu: "1" livenessProbe: httpGet: path: /-/healthy port: http initialDelaySeconds: 10 readinessProbe: httpGet: path: /-/healthy port: http initialDelaySeconds: 15 --- apiVersion: v1 kind: Service metadata: name: thanos-querier labels: app: thanos-querier spec: ports: - port: 9090 protocol: TCP targetPort: http nodePort: 30001 name: http selector: app: thanos-querier type: NodePort EOF
访问节点http://ip:30001端口即可打开Query界面
可以看见通过prometheus接口抓取过来的数据
可以看见对应的监控指标和对应的thanos-sidecar节点
通过Stores页面可以看见对应的组件的状态信息
以查看节点负载为例
默认情况下会出现两个值,因为有两个prometheus实例对数据进行收集
勾选deduplication选项将对监控数据进行去重处理,query是根据 prometheus.yaml.tmpl内的external_labels标签进行去重处理。Querier 会取时间戳新的数据进行展示。
部署对象存储 Thanos目前支持多种对象存储接口,包括像国内的阿里OSS、腾讯COS、openstack-swift和兼容S3接口的对象存储如minio等,下图为官方支持的对象存储列表
这里我们主要以Minio为例演示对接
创建pvc
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: # This name uniquely identifies the PVC. This is used in deployment. name: minio-pv-claim spec: # Read more about access modes here: http://kubernetes.io/docs/user-guide/persistent-volumes/#access-modes storageClassName: managed-nfs-storage accessModes: # The volume is mounted as read-write by a single node - ReadWriteOnce resources: # This is the request for storage. Should be available in the cluster. requests: storage: 10Gi EOF
创建Minio
1 kubectl create -f https://raw.githubusercontent.com/minio/minio/master/docs/orchestration/kubernetes/minio-standalone-deployment.yaml
创建对外暴露Service
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: minio-service spec: type: NodePort ports: - port: 9000 targetPort: 9000 protocol: TCP nodePort: 30002 selector: app: minio EOF
访问Minio
http://ip:30002
使用minio/minio123访问Minio
创建名为prometheus的bucket
将落盘的监控数据存储到对象存储中 创建secret
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Secret metadata: name: minio-secret type: Opaque stringData: thanos-secret.yaml: |- type: S3 config: bucket: "prometheus" endpoint: "172.16.1.6:30002" access_key: "minio" insecure: true secret_key: "minio123" EOF
在prometheus启动的yaml文件中引用此secret 添加以下部分
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 spec: serviceAccountName: prometheus volumes: - name: minio-secret secret: secretName: minio-secret containers: - name: thanos image: thanosio/thanos:v0.11.0 args: - sidecar - --tsdb.path=/prometheus - --prometheus.url=http://localhost:9090 - --reloader.config-file=/etc/prometheus/prometheus.yaml.tmpl #配置对应的prometheus.yaml.tmpl文件路径 - --reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yaml # 基于配置文件模板生成配置文件 - --objstore.config-file=/etc/secret/thanos-secret.yaml volumeMounts: - name: minio-secret mountPath: "/etc/secret/"
等待片刻后可以在Minio上看见thanos-sidecar上传过来的prometheus监控数据
thanos-secret.yaml
部署Store-Gateway 部署Store-Gateway方便查看历史存储数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: StatefulSet metadata: labels: app.kubernetes.io/name: thanos-store name: thanos-store spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: thanos-store serviceName: thanos-store-gateway template: metadata: labels: thanos-store-api: "true" app.kubernetes.io/name: thanos-store spec: containers: - args: - store - --data-dir=/var/thanos/store - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:10902 - --objstore.config-file=/etc/secret/thanos-secret.yaml - --experimental.enable-index-header image: quay.io/thanos/thanos:v0.11.0 livenessProbe: failureThreshold: 8 httpGet: path: /-/healthy port: 10902 scheme: HTTP periodSeconds: 30 name: thanos-store ports: - containerPort: 10901 name: grpc - containerPort: 10902 name: http readinessProbe: failureThreshold: 20 httpGet: path: /-/ready port: 10902 scheme: HTTP periodSeconds: 5 volumeMounts: - mountPath: /var/thanos/store name: data readOnly: false - mountPath: /etc/secret/ name: minio-secret volumes: - name: minio-secret secret: secretName: minio-secret volumeClaimTemplates: - metadata: labels: app.kubernetes.io/name: thanos-store name: data spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: 10Gi EOF
注意label thanos-store-api: “true”这样才能与thanos-store-gateway这个headless service关联,另外与它关联的Querier才能发现它。
部署Compact compact需要本地磁盘空间来存储中间数据以进行处理,建议使用大约100GB的空间以保持正常工作。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: StatefulSet metadata: labels: app.kubernetes.io/component: database-compactor app.kubernetes.io/instance: thanos-compact app.kubernetes.io/name: thanos-compact app.kubernetes.io/version: v0.11.0 name: thanos-compact spec: replicas: 1 selector: matchLabels: app.kubernetes.io/component: database-compactor app.kubernetes.io/instance: thanos-compact app.kubernetes.io/name: thanos-compact serviceName: thanos-compact template: metadata: labels: app.kubernetes.io/component: database-compactor app.kubernetes.io/instance: thanos-compact app.kubernetes.io/name: thanos-compact app.kubernetes.io/version: v0.11.0 spec: containers: - args: - compact - --wait - --objstore.config-file=/etc/secret/thanos-secret.yaml - --data-dir=/var/thanos/compact - --debug.accept-malformed-index image: quay.io/thanos/thanos:v0.11.0 livenessProbe: failureThreshold: 4 httpGet: path: /-/healthy port: 10902 scheme: HTTP periodSeconds: 30 name: thanos-compact ports: - containerPort: 10902 name: http readinessProbe: failureThreshold: 20 httpGet: path: /-/ready port: 10902 scheme: HTTP periodSeconds: 5 terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /var/thanos/compact name: data readOnly: false - mountPath: /etc/secret/ name: minio-secret volumes: - name: minio-secret secret: secretName: minio-secret volumeClaimTemplates: - metadata: labels: app.kubernetes.io/component: database-compactor app.kubernetes.io/instance: thanos-compact app.kubernetes.io/name: thanos-compact name: data spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: 10Gi EOF
需要本地磁盘空间来存储中间数据以进行处理,建议使用大约100GB的空间以保持正常工作。在重新启动之间可以安全地删除磁盘上的数据
Receiver介绍 主要解决大规模场景下 Query都调所有 thanos-Sidecar会消耗很多资源,所以统一通过prometheus的remote-write接口将数据传送给thanos-Receiver,Query从thanos-Receiver获取数据https://thanos.io/proposals/201812_thanos-remote-receive.md/
Grafana对接 使用thanos后grafana对接直接对接Query就可以了
启动grafana
1 docker run -d --name=grafana -p 3000:3000 grafana/grafana
添加DataSource这填写Query的地址
添加dashboard,因为我们之有node-exporter的数据所以这块需要使直接使用主机节点的dashboard,导入dashboard输入id号8919
总结 Thanos做为prometheus的一个附加组件,还是能解决目前在使用prometheus时的一些高可用,历史数据存储和大规模集群的问题,目前社区也在大力发展。
参考链接:https://github.com/thanos-io/thanos/blob/master/docs/quick-tutorial.md https://engineering.hellofresh.com/monitoring-at-hellofresh-part-1-architecture-677b4bd6b728 https://github.com/thanos-io/kube-thanos/tree/master/manifests https://github.com/thanos-io/thanos/blob/master/docs/components/sidecar.md https://www.qikqiak.com/k8strain/monitor/thanos/#ruler https://github.com/thanos-io/kube-thanos/tree/release-0.11/examples/all/manifests