目录
显示
1.直接上干货
接上一篇:脚本安装prometheus监控组件套
每个组件访问地址:
Grafana: http://IP:3000/
Prometheus: http://IP:9090/
Alertmanager: http://IP:9093/
Prometheus: http://IP:9090/
Prometheus Webhook Dingtalk: http://IP:8060/UI
2.各组件配置文件示例
我是docker-compose组件安装,给出docker外部挂载文件目录树
看这里,上面只是给出截图万一不对的时候看看验证一下,这里是配置文件截图说明
以上配置文件为实现监控及钉钉预警所有配置文件
3.配置示例:
从上向下:
alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: webhook
group_wait: 30s
group_interval: 5m
repeat_interval: 5m
group_by: [alertname]
routes:
- receiver: webhook
group_wait: 10s
receivers:
- name: webhook
webhook_configs:
- url: http://IP:8060/dingtalk/webhook1/send
send_resolved: true
#IP 换成自己的IP
dingtalk
config.yml
## Request timeout # timeout: 5s ## Customizable templates path templates: - /etc/prometheus-webhook-dingtalk/templates/legacy/template.tmpl ## You can also override default template using `default_message` ## The following example to use the 'legacy' template from v0.3.0 # default_message: # title: '{{ template "legacy.title" . }}' # text: '{{ template "legacy.content" . }}' ## Targets, previously was known as "profiles" targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=xxxx # secret for signature secret: xxxx #xxxx 替换成自己申请的钉钉机器人配置,怎么申请去百度下吧 :P
template.tmpl
别人写好的模板,长这样,我拿来直接用了
{{ define "__subject" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }} ({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }} {{ define "__text_alert_list" }}{{ range . }} **Labels** {{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Annotations** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }}) {{ end }}{{ end }} {{ define "default.__text_alert_list" }}{{ range . }} --- **告警级别:** {{ .Labels.severity | upper }} **运营团队:** {{ .Labels.team | upper }} **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **事件信息:** {{ range .Annotations.SortedPairs }} > - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **事件标签:** {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }}{{ end }} {{ end }} {{ end }} {{ define "default.__text_alertresovle_list" }}{{ range . }} --- **告警级别:** {{ .Labels.severity | upper }} **运营团队:** {{ .Labels.team | upper }} **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }} **事件信息:** {{ range .Annotations.SortedPairs }} > - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **事件标签:** {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }}{{ end }} {{ end }} {{ end }} {{/* Default */}} {{ define "default.title" }}{{ template "__subject" . }}{{ end }} {{ define "default.content" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ if gt (len .Alerts.Firing) 0 -}}  **====侦测到故障====** {{ template "default.__text_alert_list" .Alerts.Firing }} {{- end }} {{ if gt (len .Alerts.Resolved) 0 -}} {{ template "default.__text_alertresovle_list" .Alerts.Resolved }} {{- end }} {{- end }} {{/* Legacy */}} {{ define "legacy.title" }}{{ template "__subject" . }}{{ end }} {{ define "legacy.content" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ template "__text_alert_list" .Alerts.Firing }} {{- end }} {{/* Following names for compatibility */}} {{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }} {{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
dingtalk-compose.yml
go写好的源码包:
钉钉docker-compose配置文件,第一篇里面脚本全生成了,后面不适合我用,我重新写了一个,不想更新了,强迫症患者请离开,这一步需要先把之前的webhook容器删掉,使用docker-compose重建
version: '3.7' services: #钉钉插件 webhook: image: timonwong/prometheus-webhook-dingtalk restart: "always" ports: - 8060:8060 container_name: "webhook" #token指定钉钉 volumes: - "/data/PGAWD/dingtalk/config.yml:/etc/prometheus-webhook-dingtalk/config.yml" - "/data/PGAWD/dingtalk/template.tmpl:/etc/prometheus-webhook-dingtalk/templates/legacy/template.tmpl" command: - '--web.enable-ui' - '--log.format=logfmt' - '--config.file=/etc/prometheus-webhook-dingtalk/config.yml'
grafana
grafana.ini 这玩意是脚本从容器里面拷出来的
prometheus
这里需要说明,配置监控任务JOB,可分分组配置多个,这样配置如果监控任务下所有节点都挂了,会一归类在一起起发送
prometheus.yml
global: scrape_interval: 15s # 多久 收集 一次数据 evaluation_interval: 15s # 多久 评估 一次规则 scrape_timeout: 10s # 每次 收集数据的 超时时间 # 收集数据 配置 列表 scrape_configs: - job_name: prometheus # 必须配置, 自动附加的job labels, 必须唯一 static_configs: - targets: ['IP:9090'] # 指定prometheus ip端口 labels: instance: prometheus #标签 #===========================服务器节点监控配置=========================================================== - job_name: node133-linux static_configs: - targets: ['IP:9100'] labels: instance: node133-node-exporter #===========================前置服务节点监控配置=================================================== - job_name: zkjy-data-statistic metrics_path: '/actuator/prometheus' static_configs: - targets: ['IP:38082'] labels: instance: 'zkjy-data-statistic-yc-142' name: 'zkjy-data-statistic-yc-142' - targets: ['IP:38082'] labels: instance: 'zkjy-data-statistic-yc-148' name: 'zkjy-data-statistic-yc-148' #===========================前置服务节点监控配置=================================================== - job_name: zkjy-data-original metrics_path: '/actuator/prometheus' static_configs: - targets: ['IP:38081'] labels: instance: 'zkjy-data-original-yc-142' name: 'zkjy-data-original-yc-142' - targets: ['IP:38081'] labels: instance: 'zkjy-data-original-yc-148' name: 'zkjy-data-original-yc-148' alerting: #Alertmanager相关的配置 alertmanagers: - static_configs: - targets: - IP:9093 #指定告警模块 rule_files: #告警规则文件, 可以使用通配符 - "/etc/prometheus/rules/*.yml"
rules
alert-rules.yml
groups: - name: prometheus-alert rules: - alert: prometheus-down expr: prometheus:up == 0 for: 1m labels: severity: "高危" annotations: summary: "instance: {{ $labels.instance }} 宕机了" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 关机了, 时间已经1分钟了。" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-cpu-high expr: prometheus:cpu:total:percent > 80 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已经持续一分钟高过80% 。" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-cpu-iowait-high expr: prometheus:cpu:iowait:percent >= 12 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已经持续三分钟高过12%" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-load-load1-high expr: (prometheus:load:load1) > (prometheus:cpu:count) * 1.2 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-memory-high expr: prometheus:memory:used:percent > 85 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} memory 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-disk-high expr: prometheus:disk:used:percent > 80 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} disk 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-disk-read:count-high expr: prometheus:disk:read:count:rate > 2000 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-disk-write-count-high expr: prometheus:disk:write:count:rate > 2000 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-disk-read-mb-high expr: prometheus:disk:read:mb:rate > 60 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 读取字节数 高于 {{ $value }}" description: "" instance: "{{ $labels.instance }}" value: "{{ $value }}" - alert: prometheus-disk-write-mb-high expr: prometheus:disk:write:mb:rate > 60 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 写入字节数 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-filefd-allocated-percent-high expr: prometheus:filefd_allocated:percent > 80 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 打开文件描述符 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-network-netin-error-rate-high expr: prometheus:network:netin:error:rate > 4 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包进入的错误速率 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-network-netin-packet-rate-high expr: prometheus:network:netin:packet:rate > 35000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包进入速率 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-network-netout-packet-rate-high expr: prometheus:network:netout:packet:rate > 35000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-network-tcp-total-count-high expr: prometheus:network:tcp:total:count > 40000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} tcp连接数量 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-process-zoom-total-count-high expr: prometheus:process:zoom:total:count > 10 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 僵死进程数量 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-time-offset-high expr: prometheus:time:offset > 0.03 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} {{ $labels.desc }} {{ $value }} {{ $labels.unit }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}"
node-up.yml
groups: - name: 服务运行监控告警 rules: - alert: original-up expr: up{job="zkjy-data-original"} == 0 for: 15s labels: severity: "高危" team: "原始数据服务" annotations: summary: "{{ $labels.instance }} 已停止运行超过 15s!" - alert: statistic-up expr: up{job="zkjy-data-statistic"} == 0 for: 15s labels: severity: "高危" team: "统计数据服务" annotations: summary: "{{ $labels.instance }} 已停止运行超过 15s!" # expr:up{job=“node-exporter”} == 0表示 服务下线 # # for:15s 表示持续15s # # annotations.summary 表示提示语 # # 上述配置表示job(node-exporter)下线持续15s,则启动告警
record-rules.yml
groups: - name: prometheus-record rules: - expr: up{job!="prometheus"} record: prometheus:up labels: desc: "节点是否在线, 在线1,不在线0" unit: " " job: "prometheus" - expr: time() - node_boot_time_seconds{} record: prometheus:node_uptime labels: desc: "节点的运行时间" unit: "s" job: "prometheus" ############################################################################################## # cpu # - expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m]))) * 100 record: prometheus:cpu:total:percent labels: desc: "节点的cpu总消耗百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m]))) * 100 record: prometheus:cpu:idle:percent labels: desc: "节点的cpu idle百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="iowait"}[5m]))) * 100 record: prometheus:cpu:iowait:percent labels: desc: "节点的cpu iowait百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="system"}[5m]))) * 100 record: prometheus:cpu:system:percent labels: desc: "节点的cpu system百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="user"}[5m]))) * 100 record: prometheus:cpu:user:percent labels: desc: "节点的cpu user百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode=~"softirq|nice|irq|steal"}[5m]))) * 100 record: prometheus:cpu:other:percent labels: desc: "节点的cpu 其他的百分比" unit: "%" job: "prometheus" ############################################################################################## ############################################################################################## # memory # - expr: node_memory_MemTotal_bytes{job!="prometheus"} record: prometheus:memory:total labels: desc: "节点的内存总量" unit: byte job: "prometheus" - expr: node_memory_MemFree_bytes{job!="prometheus"} record: prometheus:memory:free labels: desc: "节点的剩余内存量" unit: byte job: "prometheus" - expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemFree_bytes{job!="prometheus"} record: prometheus:memory:used labels: desc: "节点的已使用内存量" unit: byte job: "prometheus" - expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemAvailable_bytes{job!="prometheus"} record: prometheus:memory:actualused labels: desc: "节点用户实际使用的内存量" unit: byte job: "prometheus" - expr: (1-(node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100 record: prometheus:memory:used:percent labels: desc: "节点的内存使用百分比" unit: "%" job: "prometheus" - expr: ((node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100 record: prometheus:memory:free:percent labels: desc: "节点的内存剩余百分比" unit: "%" job: "prometheus" ############################################################################################## # load # - expr: sum by (instance) (node_load1{job!="prometheus"}) record: prometheus:load:load1 labels: desc: "系统1分钟负载" unit: " " job: "prometheus" - expr: sum by (instance) (node_load5{job!="prometheus"}) record: prometheus:load:load5 labels: desc: "系统5分钟负载" unit: " " job: "prometheus" - expr: sum by (instance) (node_load15{job!="prometheus"}) record: prometheus:load:load15 labels: desc: "系统15分钟负载" unit: " " job: "prometheus" ############################################################################################## # disk # - expr: node_filesystem_size_bytes{job!="prometheus" ,fstype=~"ext4|xfs"} record: prometheus:disk:usage:total labels: desc: "节点的磁盘总量" unit: byte job: "prometheus" - expr: node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} record: prometheus:disk:usage:free labels: desc: "节点的磁盘剩余空间" unit: byte job: "prometheus" - expr: node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} record: prometheus:disk:usage:used labels: desc: "节点的磁盘使用的空间" unit: byte job: "prometheus" - expr: (1 - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"}) * 100 record: prometheus:disk:used:percent labels: desc: "节点的磁盘的使用百分比" unit: "%" job: "prometheus" - expr: irate(node_disk_reads_completed_total{job!="prometheus"}[1m]) record: prometheus:disk:read:count:rate labels: desc: "节点的磁盘读取速率" unit: "次/秒" job: "prometheus" - expr: irate(node_disk_writes_completed_total{job!="prometheus"}[1m]) record: prometheus:disk:write:count:rate labels: desc: "节点的磁盘写入速率" unit: "次/秒" job: "prometheus" - expr: (irate(node_disk_written_bytes_total{job!="prometheus"}[1m]))/1024/1024 record: prometheus:disk:read:mb:rate labels: desc: "节点的设备读取MB速率" unit: "MB/s" job: "prometheus" - expr: (irate(node_disk_read_bytes_total{job!="prometheus"}[1m]))/1024/1024 record: prometheus:disk:write:mb:rate labels: desc: "节点的设备写入MB速率" unit: "MB/s" job: "prometheus" ############################################################################################## # filesystem # - expr: (1 -node_filesystem_files_free{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_files{job!="prometheus",fstype=~"ext4|xfs"}) * 100 record: prometheus:filesystem:used:percent labels: desc: "节点的inode的剩余可用的百分比" unit: "%" job: "prometheus" ############################################################################################# # filefd # - expr: node_filefd_allocated{job!="prometheus"} record: prometheus:filefd_allocated:count labels: desc: "节点的文件描述符打开个数" unit: "%" job: "prometheus" - expr: node_filefd_allocated{job!="prometheus"}/node_filefd_maximum{job!="prometheus"} * 100 record: prometheus:filefd_allocated:percent labels: desc: "节点的文件描述符打开百分比" unit: "%" job: "prometheus" ############################################################################################# # network # - expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netin:bit:rate labels: desc: "节点网卡eth0每秒接收的比特数" unit: "bit/s" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netout:bit:rate labels: desc: "节点网卡eth0每秒发送的比特数" unit: "bit/s" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netin:packet:rate labels: desc: "节点网卡每秒接收的数据包个数" unit: "个/秒" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netout:packet:rate labels: desc: "节点网卡发送的数据包个数" unit: "个/秒" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netin:error:rate labels: desc: "节点设备驱动器检测到的接收错误包的数量" unit: "个/秒" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netout:error:rate labels: desc: "节点设备驱动器检测到的发送错误包的数量" unit: "个/秒" job: "prometheus" - expr: node_tcp_connection_states{job!="prometheus", state="established"} record: prometheus:network:tcp:established:count labels: desc: "节点当前established的个数" unit: "个" job: "prometheus" - expr: node_tcp_connection_states{job!="prometheus", state="time_wait"} record: prometheus:network:tcp:timewait:count labels: desc: "节点timewait的连接数" unit: "个" job: "prometheus" - expr: sum by (environment,instance) (node_tcp_connection_states{job!="prometheus"}) record: prometheus:network:tcp:total:count labels: desc: "节点tcp连接总数" unit: "个" job: "prometheus"
4.关于配置说明
组件所有配置文件示例已贴出,需要换IP为自己的IP,有的是普罗米修斯宿主机IP,有的是监控所在服务IP,自己替换,配置成功后,去看看学习一下promQL,学习一下告警规则,就明白了
5.最重要的事情
服务探针配置,先说明我的版本
<registry.prometheus.version>1.8.3</registry.prometheus.version> <micrometer.version>1.5.1</micrometer.version>
<dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-dependencies</artifactId> <version>2.2.1.RELEASE</version> <type>pom</type> <scope>import</scope> </dependency>
使用包pom详细
<dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> <version>${registry.prometheus.version}</version> </dependency> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-core</artifactId> <version>${micrometer.version}</version> </dependency>
<dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency>
自己加到项目里面
增加一个配置yuml文件
application-prom.yml
management: metrics: export: prometheus: enabled: true tags: application: ${spring.application.name} endpoint: metrics: enabled: true prometheus: enabled: true endpoints: web: exposure: include: ["prometheus","health"]
启动项目后可以通过如下URL获取探针内容
http://IP:38082/actuator/prometheus