基于prometheus的告警监控

prometheus系统架构图

手动试验:

1.部署一个简单的webhook服务


apiVersion: apps/v1
kind: Deployment
metadata:
  name:  alert-webhook
  namespace: default
  labels:
    app:  alert-webhook
spec:
  selector:
    matchLabels:
      app: alert-webhook
  replicas: 1
  template:
    metadata:
      labels:
        app:  alert-webhook
    spec:
      containers:
      - name:  alert-webhook
        image:  172.32.1.175:10043/demo/alert-webhook:1.0
        ports:
        - containerPort:  8080
          name:  alert-webhook
      imagePullSecrets:
      - name: secret-wangsu

---
apiVersion: v1
kind: Service
metadata:
  name: alert-webhook
  namespace: default
spec:
  selector:
    app: alert-webhook
  type: NodePort
  ports:
  - name: alert-webhook
    protocol: TCP
    port: 8080
    targetPort: 8080
  1. 调整 kubectl -n monitoring get secrets 路径下的alertmanager.yaml配置文件

参考模板:
global:
  resolve_timeout: '5m'
  smtp_smarthost: 'smtp.126.com:25'   
  smtp_from: 'looknicemm@126.com' 
  smtp_auth_username: 'looknicemm@126.com' 
  smtp_auth_password: 'KNPYYGBKYJKKOQQA'  #邮箱的授权密码 (如果是163邮箱,需要在设置->常规设置->点击左侧的客户端授权密码->开启授权密码)
templates:
- '*.tmpl' #加载所有消息通知模板
inhibit_rules:
- equal:   # 如果告警出现critical 那么在相同的namespace和alertname下的 warning|info级别的告警都会被抑制
  - namespace
  - alertname
  source_matchers:
  - severity = critical
  target_matchers:
  - severity =~ warning|info
- equal:    # 确保这个配置下的标签内容相同才会抑制,也就是说警报中必须有这三个标签值才会被抑制。
  - namespace
  - alertname
  source_matchers:
  - severity = warning
  target_matchers:
  - severity = info
route:
  receiver: 'default-receiver'
  group_by: ['alertname','namespace']  #将传入警报分组在一起的标签。例如,cluster=A和alertname=LatencyHigh的多个警报将批处理为单个组。
  group_wait: 30s  #当传入的警报创建新的警报组时,至少等待"30s"发送初始通知。
  group_interval: 5m  #当发送第一个通知时,等待"5m"发送一批新的警报,这些警报开始针对该组触发。 (如果是group_by里的内容为新的如:alertname=1,alertname=2 会马上发送2封邮件, 如果是group_by之外的会等待5m触发一次)
  repeat_interval: 4h #如果警报已成功发送,请等待"4h"重新发送,重复发送邮件的时间间隔
  routes: #所有与下列子路由不匹配的警报将保留在根节点,并被分派到'default-receiver'。
  - receiver: 'database-pager'
    group_wait: 10s
    group_by: [service]
    match_re:
      service: mysql|cassandra  #带有service=mysql或service=cassandra的所有警报都被发送到'database-pager'
receivers:
- name: default-receiver #不同的报警 发送给不同的邮箱
  webhook_configs:
      - url: http://alert-webhook.default:8080/webhook
        send_resolved: true # 恢复通知提醒
- name: database-pager
  email_configs:
  - to: '690081650@qq.com'
    html: '{{ template "alertemp.html" . }}'     #应用哪个模板
    headers: { Subject: "[INFO] 报警邮件test" }   #邮件头信息
  1. 删除pod 查看zstack-prometheus-alertmanager svc对应下的status看是否已经修改成功 http://172.31.12.6:32140/#/alerts

  2. 收到相关的告警信息

Alatermanager:访问webhook的访问数据


{
    "version": "4",
    "groupKey": "{}:{alertname=\"CPUThrottlingHigh\", namespace=\"zstack-middleware\"}",
    "status": "firing",
    "receiver": "default-receiver",
    "groupLabels": {
        "alertname": "CPUThrottlingHigh",
        "namespace": "zstack-middleware"
    },
    "commonLabels": {
        "alertname": "CPUThrottlingHigh",
        "container": "metrics",
        "namespace": "zstack-middleware",
        "prometheus": "monitoring/zstack-prometheus-prometheus",
        "severity": "info"
    },
    "commonAnnotations": {
        "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh",
        "summary": "Processes experience elevated CPU throttling."
    },
    "externalURL": "http://zstack-prometheus-alertmanager.monitoring:9093",
    "alerts": [
        {
            "labels": {
                "alertname": "CPUThrottlingHigh",
                "container": "metrics",
                "namespace": "zstack-middleware",
                "pod": "postgres-postgresql-0",
                "prometheus": "monitoring/zstack-prometheus-prometheus",
                "severity": "info"
            },
            "annotations": {
                "description": "36.43% throttling of CPU in namespace zstack-middleware for container metrics in pod postgres-postgresql-0.",
                "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh",
                "summary": "Processes experience elevated CPU throttling."
            },
            "startsAt": "2022-10-26T06:44:50.751Z",
            "endsAt": "0001-01-01T00:00:00Z"
        },
        {
            "labels": {
                "alertname": "CPUThrottlingHigh",
                "container": "metrics",
                "namespace": "zstack-middleware",
                "pod": "postgres-postgresql-1",
                "prometheus": "monitoring/zstack-prometheus-prometheus",
                "severity": "info"
            },
            "annotations": {
                "description": "36.28% throttling of CPU in namespace zstack-middleware for container metrics in pod postgres-postgresql-1.",
                "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh",
                "summary": "Processes experience elevated CPU throttling."
            },
            "startsAt": "2022-10-25T21:18:20.751Z",
            "endsAt": "0001-01-01T00:00:00Z"
        }
    ]
}

自定义报警规则

1、参考prometheus规则(包含大部分通用规则):https://awesome-prometheus-alerts.grep.to/


apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    custom-alerting-rule-level: cluster
    role: alert-rules
  name: custom-alerting-rule
  namespace: monitoring
spec:
  groups:
  - name: alerting.custom.defaults.0
    rules:
    - alert: fe
      annotations:
        kind: Node
        message: ""
        resources: '["172-20-17-188"]'
        rule_update_time: "2022-10-09T10:14:48Z"
        rules: '[{"_metricType":"instance:node_cpu_utilisation:rate5m{$1}","condition_type":"<","thresholds":"90","unit":"%"}]'
        summary: 节点 172.32.1.174 CPU 用量 < 90%
      expr:  instance:node_cpu_utilisation:rate5m{ instance=~"172.32.1.174.*|172.32.1.179.*"}  <0.9
      for: 1m
      labels:
        alerttype: metric
        severity: warning
        service: mysql

2、 基于工作负载的某项指标进行检测

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate * ON(pod) GROUP_LEFT() namespace_workload_pod:kube_pod_owner:relabel{ workload_type="deployment",workload=~"alert-webhook|asd"} >0

workload_type:取值为:  deployment,statefulset,daemonset

3、基于helm应用的所有工作负载(sts,deploy,ds)的某项指标检测 —– 需要补充一个自己开发helm-exporter采集器

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate  * ON(pod) GROUP_LEFT() (namespace_workload_pod:kube_pod_owner:relabel{} * on(workload,workload_type) GROUP_LEFT() helm_chart_workload_list{chart=~"helm-exporter|zstack-edge"})

自定义采集器:

1.基于client-golang的SDK工具包进行exporter采集器开发参考Demo:https://github.com/prometheus/client_golang/blob/main/examples/gocollector/main.go

2.配置prometheus拉取采集信息servicemonitors (最终会写到prometheus中/etc/prometheus/config_out/prometheus.env.yaml的配置文件中)


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: prometheus
    meta.helm.sh/release-namespace: monitoring
  labels:
    app.kubernetes.io/instance: helm-exporter
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: helm-exporter
    app.kubernetes.io/version: 1.2.3
    helm.sh/chart: helm-exporter-1.2.3_4dc0dfc
  name: helm-exporter
  namespace: monitoring
spec:
  endpoints:
  - path: /metrics
    port: http
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      app.kubernetes.io/name: helm-exporter
  1. 规则设置后可以到http://172.32.1.210:10006/targets查看是否配置成功

设计思路:

整个报警链路完成:

  1. 首先是Prometheus根据告警规则告警,如果增删改规则参考PrometheusRule(https://blog.csdn.net/wzy_168/article/details/103290235)
  2. Prometheus的告警经过Alertmanager静默、抑制、分组等配置到达Alertmanager
  3. AlterManager通过配置webhook,地址填Edge自定义转换器的地址。
  4. Edge自定义转换器中webhook地址去调用其他通知渠道(如:邮箱、钉钉、微信…..)。

词语解释:

​ 1. alert State:

Inactive 警报未激活;

Pending:警报已满足测试表达式条件,但未达到for指定的持续时间;

Firing:警报满足测试表达式条件,且持续时间达到了for指定的持续时间

参考链接:

  1. https://blog.csdn.net/wzy_168/article/details/103295971 prometheus-operator使用【了解Alertmanager通知机制】
  2. https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/alert-manager-route 基于标签的告警处理路由
  3. https://github.com/prometheus/client_golang prometheus开发sdk包,支持自定义采集器开发
  4. https://github.com/sstarcher/helm-exporter 应用采集器包,可以进行二次开发
  5. https://prometheus.io/docs/prometheus/latest/configuration/configuration/ 官网配置字段说明
  6. https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ 官网alert rules 字段说明

   转载规则


《基于prometheus的告警监控》 by XieJiayi is licensed under a 知识共享署名 4.0 国际许可协议 许可协议。转载请注明来源
 上一篇
基于Karmada的多集群管理(一) 基于Karmada的多集群管理(一)
基于Karmada的多集群管理(一)简介Karmada(Kubernetes Armada)是基于Kubernetes原生API的多集群管理系统。在多云和混合云场景下,Karmada提供可插拔,全自动化管理多集群应用,实现多云集中管理、高可
2023-04-07
下一篇 
Dockerfile的使用 Dockerfile的使用
DockerFile介绍镜像的定制实际上就是定制每一层所添加的配置、文件。如果我们可以把每一层修改、安装、构建、操作的命令都写入一个脚本,用这个脚本来构建、定制镜像,那么之前提及的无法重复的问题、镜 像构建透明性的问题、体积的问题就都会解决
2023-04-06
  目录