登录
  • 欢迎访问悠扬的技术博客,推荐使用最新版火狐浏览器和Chrome浏览器访问本网站😉

springboot 配置普罗米修斯监控钉钉预警

Docker 悠扬 1041次浏览 已收录

springboot 配置普罗米修斯监控钉钉预警

 

1.直接上干货

接上一篇:脚本安装prometheus监控组件套

每个组件访问地址:

Grafana:    http://IP:3000/

Prometheus:    http://IP:9090/

Alertmanager:    http://IP:9093/

Prometheus:    http://IP:9090/

Prometheus Webhook Dingtalk:    http://IP:8060/UI

 

2.各组件配置文件示例

我是docker-compose组件安装,给出docker外部挂载文件目录树

springboot 配置普罗米修斯监控钉钉预警

 :D 看这里,上面只是给出截图万一不对的时候看看验证一下,这里是配置文件截图说明

springboot 配置普罗米修斯监控钉钉预警

以上配置文件为实现监控及钉钉预警所有配置文件

3.配置示例:

从上向下:

alertmanager.yml

global:
  resolve_timeout: 5m
route:
  receiver: webhook
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5m
  group_by: [alertname]
  routes:
  - receiver: webhook
    group_wait: 10s
receivers:
- name: webhook
  webhook_configs:
  - url: http://IP:8060/dingtalk/webhook1/send
    send_resolved: true
#IP 换成自己的IP

dingtalk

config.yml

## Request timeout
# timeout: 5s

## Customizable templates path
templates:
  - /etc/prometheus-webhook-dingtalk/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
# default_message:
#   title: '{{ template "legacy.title" . }}'
#   text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxx
    # secret for signature
    secret: xxxx
#xxxx 替换成自己申请的钉钉机器人配置,怎么申请去百度下吧 :P 

template.tmpl

别人写好的模板,长这样,我拿来直接用了

springboot 配置普罗米修斯监控钉钉预警

{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .GroupLabels.SortedPairs.Values | join " " }}
{{ if gt (len .CommonLabels) (len .GroupLabels) }}
({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}
{{ define "__text_alert_list" }}{{ range . }}

**Labels**

{{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}

{{ end }}

**Annotations**

{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}

{{ end }}

**Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})

{{ end }}{{ end }}
{{ define "default.__text_alert_list" }}{{ range . }}

---

**告警级别:** {{ .Labels.severity | upper }}

**运营团队:** {{ .Labels.team | upper }}

**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**事件信息:**

{{ range .Annotations.SortedPairs }}

> - {{ .Name }}: {{ .Value | markdown | html }}

{{ end }}

**事件标签:**

{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}

{{ end }}{{ end }}

{{ end }}

{{ end }}

{{ define "default.__text_alertresovle_list" }}{{ range . }}

---

**告警级别:** {{ .Labels.severity | upper }}

**运营团队:** {{ .Labels.team | upper }}

**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}

**事件信息:**

{{ range .Annotations.SortedPairs }}

> - {{ .Name }}: {{ .Value | markdown | html }}

{{ end }}

**事件标签:**

{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}

{{ end }}{{ end }}

{{ end }}

{{ end }}
{{/* Default */}}

{{ define "default.title" }}{{ template "__subject" . }}{{ end }}

{{ define "default.content" }}
 [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
 **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**

{{ if gt (len .Alerts.Firing) 0 -}}
![警报 图标](https://ss0.bdstatic.com/70cFuHSh_Q1YnxGkpoWK1HF6hhy/it/u=3626076420,1196179712&fm=15&gp=0.jpg)

**====侦测到故障====**

{{ template "default.__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}

{{ template "default.__text_alertresovle_list" .Alerts.Resolved }}
{{- end }}

{{- end }}
{{/* Legacy */}}

{{ define "legacy.title" }}{{ template "__subject" . }}{{ end }}

{{ define "legacy.content" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
 **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**

{{ template "__text_alert_list" .Alerts.Firing }}

{{- end }}
{{/* Following names for compatibility */}}

{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}

{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}

dingtalk-compose.yml

go写好的源码包:

源码包下载

钉钉docker-compose配置文件,第一篇里面脚本全生成了,后面不适合我用,我重新写了一个,不想更新了,强迫症患者请离开,这一步需要先把之前的webhook容器删掉,使用docker-compose重建

version: '3.7'
services:
  #钉钉插件
  webhook:
    image: timonwong/prometheus-webhook-dingtalk
    restart: "always"
    ports:
      - 8060:8060
    container_name: "webhook"           #token指定钉钉
    volumes:
      - "/data/PGAWD/dingtalk/config.yml:/etc/prometheus-webhook-dingtalk/config.yml"
      - "/data/PGAWD/dingtalk/template.tmpl:/etc/prometheus-webhook-dingtalk/templates/legacy/template.tmpl"
    command:
      - '--web.enable-ui'
      - '--log.format=logfmt'
      - '--config.file=/etc/prometheus-webhook-dingtalk/config.yml'

grafana

grafana.ini 这玩意是脚本从容器里面拷出来的

prometheus

这里需要说明,配置监控任务JOB,可分分组配置多个,这样配置如果监控任务下所有节点都挂了,会一归类在一起起发送

springboot 配置普罗米修斯监控钉钉预警

prometheus.yml

global:
  scrape_interval:     15s    # 多久 收集 一次数据
  evaluation_interval: 15s    # 多久 评估 一次规则
  scrape_timeout:      10s    # 每次 收集数据的 超时时间

# 收集数据 配置 列表
scrape_configs:
  - job_name: prometheus            # 必须配置, 自动附加的job labels, 必须唯一
    static_configs:
      - targets: ['IP:9090']       # 指定prometheus ip端口
        labels:
          instance: prometheus                 #标签
#===========================服务器节点监控配置===========================================================
  - job_name: node133-linux
    static_configs:
      - targets: ['IP:9100']
        labels:
          instance: node133-node-exporter
#===========================前置服务节点监控配置===================================================
  - job_name: zkjy-data-statistic
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['IP:38082']
        labels:
          instance: 'zkjy-data-statistic-yc-142'
          name: 'zkjy-data-statistic-yc-142'
      - targets: ['IP:38082']
        labels:
          instance: 'zkjy-data-statistic-yc-148'
          name: 'zkjy-data-statistic-yc-148'

#===========================前置服务节点监控配置===================================================
  - job_name: zkjy-data-original
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['IP:38081']
        labels:
          instance: 'zkjy-data-original-yc-142'
          name: 'zkjy-data-original-yc-142'
      - targets: ['IP:38081']
        labels:
          instance: 'zkjy-data-original-yc-148'
          name: 'zkjy-data-original-yc-148'


alerting:                         #Alertmanager相关的配置
  alertmanagers:
    - static_configs:
        - targets:
            - IP:9093         #指定告警模块

rule_files:                      #告警规则文件, 可以使用通配符
  - "/etc/prometheus/rules/*.yml"

rules

alert-rules.yml

groups:
  - name: prometheus-alert
    rules:
      - alert: prometheus-down
        expr: prometheus:up == 0
        for: 1m
        labels:
          severity: "高危"
        annotations:
          summary: "instance: {{ $labels.instance }} 宕机了"
          description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 关机了, 时间已经1分钟了。"
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-cpu-high
        expr: prometheus:cpu:total:percent > 80
        for: 3m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}"
          description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已经持续一分钟高过80% 。"
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-cpu-iowait-high
        expr: prometheus:cpu:iowait:percent >= 12
        for: 3m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}"
          description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已经持续三分钟高过12%"
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-load-load1-high
        expr: (prometheus:load:load1) > (prometheus:cpu:count) * 1.2
        for: 3m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-memory-high
        expr: prometheus:memory:used:percent > 85
        for: 3m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} memory 使用率高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-disk-high
        expr: prometheus:disk:used:percent > 80
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} disk 使用率高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-disk-read:count-high
        expr: prometheus:disk:read:count:rate > 2000
        for: 2m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-disk-write-count-high
        expr: prometheus:disk:write:count:rate > 2000
        for: 2m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-disk-read-mb-high
        expr: prometheus:disk:read:mb:rate > 60
        for: 2m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} 读取字节数 高于 {{ $value }}"
          description: ""
          instance: "{{ $labels.instance }}"
          value: "{{ $value }}"
      - alert: prometheus-disk-write-mb-high
        expr: prometheus:disk:write:mb:rate > 60
        for: 2m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} 写入字节数 高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-filefd-allocated-percent-high
        expr: prometheus:filefd_allocated:percent > 80
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} 打开文件描述符 高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-network-netin-error-rate-high
        expr: prometheus:network:netin:error:rate > 4
        for: 1m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} 包进入的错误速率 高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-network-netin-packet-rate-high
        expr: prometheus:network:netin:packet:rate > 35000
        for: 1m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} 包进入速率 高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-network-netout-packet-rate-high
        expr: prometheus:network:netout:packet:rate > 35000
        for: 1m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-network-tcp-total-count-high
        expr: prometheus:network:tcp:total:count > 40000
        for: 1m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} tcp连接数量 高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-process-zoom-total-count-high
        expr: prometheus:process:zoom:total:count > 10
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} 僵死进程数量 高于 {{ $value }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"
      - alert: prometheus-time-offset-high
        expr: prometheus:time:offset > 0.03
        for: 2m
        labels:
          severity: info
        annotations:
          summary: "instance: {{ $labels.instance }} {{ $labels.desc }}  {{ $value }} {{ $labels.unit }}"
          description: ""
          value: "{{ $value }}"
          instance: "{{ $labels.instance }}"

node-up.yml

groups:
  - name: 服务运行监控告警
    rules:
      - alert: original-up
        expr: up{job="zkjy-data-original"} == 0
        for: 15s
        labels:
          severity: "高危"
          team: "原始数据服务"
        annotations:
          summary: "{{ $labels.instance }} 已停止运行超过 15s!"
      - alert: statistic-up
        expr: up{job="zkjy-data-statistic"} == 0
        for: 15s
        labels:
          severity: "高危"
          team: "统计数据服务"
        annotations:
          summary: "{{ $labels.instance }} 已停止运行超过 15s!"

# expr:up{job=“node-exporter”} == 0表示 服务下线
# # for:15s 表示持续15s
# # annotations.summary 表示提示语
# # 上述配置表示job(node-exporter)下线持续15s,则启动告警

record-rules.yml

groups:
  - name: prometheus-record
    rules:
    - expr: up{job!="prometheus"}
      record: prometheus:up
      labels:
        desc: "节点是否在线, 在线1,不在线0"
        unit: " "
        job: "prometheus"
    - expr: time() - node_boot_time_seconds{}
      record: prometheus:node_uptime
      labels:
        desc: "节点的运行时间"
        unit: "s"
        job: "prometheus"
##############################################################################################
#                              cpu                                                           #
    - expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m])))  * 100
      record: prometheus:cpu:total:percent
      labels:
        desc: "节点的cpu总消耗百分比"
        unit: "%"
        job: "prometheus"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m])))  * 100
      record: prometheus:cpu:idle:percent
      labels:
        desc: "节点的cpu idle百分比"
        unit: "%"
        job: "prometheus"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="iowait"}[5m])))  * 100
      record: prometheus:cpu:iowait:percent
      labels:
        desc: "节点的cpu iowait百分比"
        unit: "%"
        job: "prometheus"



    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="system"}[5m])))  * 100
      record: prometheus:cpu:system:percent
      labels:
        desc: "节点的cpu system百分比"
        unit: "%"
        job: "prometheus"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="user"}[5m])))  * 100
      record: prometheus:cpu:user:percent
      labels:
        desc: "节点的cpu user百分比"
        unit: "%"
        job: "prometheus"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode=~"softirq|nice|irq|steal"}[5m])))  * 100
      record: prometheus:cpu:other:percent
      labels:
        desc: "节点的cpu 其他的百分比"
        unit: "%"
        job: "prometheus"
##############################################################################################

##############################################################################################
#                                    memory                                                  #
    - expr: node_memory_MemTotal_bytes{job!="prometheus"}
      record: prometheus:memory:total
      labels:
        desc: "节点的内存总量"
        unit: byte
        job: "prometheus"

    - expr: node_memory_MemFree_bytes{job!="prometheus"}
      record: prometheus:memory:free
      labels:
        desc: "节点的剩余内存量"
        unit: byte
        job: "prometheus"

    - expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemFree_bytes{job!="prometheus"}
      record: prometheus:memory:used
      labels:
        desc: "节点的已使用内存量"
        unit: byte
        job: "prometheus"

    - expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemAvailable_bytes{job!="prometheus"}
      record: prometheus:memory:actualused
      labels:
        desc: "节点用户实际使用的内存量"
        unit: byte
        job: "prometheus"

    - expr: (1-(node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100
      record: prometheus:memory:used:percent
      labels:
        desc: "节点的内存使用百分比"
        unit: "%"
        job: "prometheus"

    - expr: ((node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100
      record: prometheus:memory:free:percent
      labels:
        desc: "节点的内存剩余百分比"
        unit: "%"
        job: "prometheus"
##############################################################################################
#                                   load                                                     #
    - expr: sum by (instance) (node_load1{job!="prometheus"})
      record: prometheus:load:load1
      labels:
        desc: "系统1分钟负载"
        unit: " "
        job: "prometheus"

    - expr: sum by (instance) (node_load5{job!="prometheus"})
      record: prometheus:load:load5
      labels:
        desc: "系统5分钟负载"
        unit: " "
        job: "prometheus"

    - expr: sum by (instance) (node_load15{job!="prometheus"})
      record: prometheus:load:load15
      labels:
        desc: "系统15分钟负载"
        unit: " "
        job: "prometheus"

##############################################################################################
#                                 disk                                                       #
    - expr: node_filesystem_size_bytes{job!="prometheus" ,fstype=~"ext4|xfs"}
      record: prometheus:disk:usage:total
      labels:
        desc: "节点的磁盘总量"
        unit: byte
        job: "prometheus"

    - expr: node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"}
      record: prometheus:disk:usage:free
      labels:
        desc: "节点的磁盘剩余空间"
        unit: byte
        job: "prometheus"

    - expr: node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"}
      record: prometheus:disk:usage:used
      labels:
        desc: "节点的磁盘使用的空间"
        unit: byte
        job: "prometheus"

    - expr:  (1 - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"}) * 100
      record: prometheus:disk:used:percent
      labels:
        desc: "节点的磁盘的使用百分比"
        unit: "%"
        job: "prometheus"

    - expr: irate(node_disk_reads_completed_total{job!="prometheus"}[1m])
      record: prometheus:disk:read:count:rate
      labels:
        desc: "节点的磁盘读取速率"
        unit: "次/秒"
        job: "prometheus"

    - expr: irate(node_disk_writes_completed_total{job!="prometheus"}[1m])
      record: prometheus:disk:write:count:rate
      labels:
        desc: "节点的磁盘写入速率"
        unit: "次/秒"
        job: "prometheus"

    - expr: (irate(node_disk_written_bytes_total{job!="prometheus"}[1m]))/1024/1024
      record: prometheus:disk:read:mb:rate
      labels:
        desc: "节点的设备读取MB速率"
        unit: "MB/s"
        job: "prometheus"

    - expr: (irate(node_disk_read_bytes_total{job!="prometheus"}[1m]))/1024/1024
      record: prometheus:disk:write:mb:rate
      labels:
        desc: "节点的设备写入MB速率"
        unit: "MB/s"
        job: "prometheus"

##############################################################################################
#                                filesystem                                                  #
    - expr:   (1 -node_filesystem_files_free{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_files{job!="prometheus",fstype=~"ext4|xfs"}) * 100
      record: prometheus:filesystem:used:percent
      labels:
        desc: "节点的inode的剩余可用的百分比"
        unit: "%"
        job: "prometheus"
#############################################################################################
#                                filefd                                                     #
    - expr: node_filefd_allocated{job!="prometheus"}
      record: prometheus:filefd_allocated:count
      labels:
        desc: "节点的文件描述符打开个数"
        unit: "%"
        job: "prometheus"

    - expr: node_filefd_allocated{job!="prometheus"}/node_filefd_maximum{job!="prometheus"} * 100
      record: prometheus:filefd_allocated:percent
      labels:
        desc: "节点的文件描述符打开百分比"
        unit: "%"
        job: "prometheus"

#############################################################################################
#                                network                                                    #
    - expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netin:bit:rate
      labels:
        desc: "节点网卡eth0每秒接收的比特数"
        unit: "bit/s"
        job: "prometheus"

    - expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netout:bit:rate
      labels:
        desc: "节点网卡eth0每秒发送的比特数"
        unit: "bit/s"
        job: "prometheus"

    - expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netin:packet:rate
      labels:
        desc: "节点网卡每秒接收的数据包个数"
        unit: "个/秒"
        job: "prometheus"


    - expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netout:packet:rate
      labels:
        desc: "节点网卡发送的数据包个数"
        unit: "个/秒"
        job: "prometheus"

    - expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netin:error:rate
      labels:
        desc: "节点设备驱动器检测到的接收错误包的数量"
        unit: "个/秒"
        job: "prometheus"

    - expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: prometheus:network:netout:error:rate
      labels:
        desc: "节点设备驱动器检测到的发送错误包的数量"
        unit: "个/秒"
        job: "prometheus"

    - expr: node_tcp_connection_states{job!="prometheus", state="established"}
      record: prometheus:network:tcp:established:count
      labels:
        desc: "节点当前established的个数"
        unit: "个"
        job: "prometheus"

    - expr: node_tcp_connection_states{job!="prometheus", state="time_wait"}
      record: prometheus:network:tcp:timewait:count
      labels:
        desc: "节点timewait的连接数"
        unit: "个"
        job: "prometheus"

    - expr: sum by (environment,instance) (node_tcp_connection_states{job!="prometheus"})
      record: prometheus:network:tcp:total:count
      labels:
        desc: "节点tcp连接总数"
        unit: "个"
        job: "prometheus"

4.关于配置说明

         组件所有配置文件示例已贴出,需要换IP为自己的IP,有的是普罗米修斯宿主机IP,有的是监控所在服务IP,自己替换,配置成功后,去看看学习一下promQL,学习一下告警规则,就明白了

5.最重要的事情

服务探针配置,先说明我的版本

<registry.prometheus.version>1.8.3</registry.prometheus.version>
<micrometer.version>1.5.1</micrometer.version>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-dependencies</artifactId>
    <version>2.2.1.RELEASE</version>
    <type>pom</type>
    <scope>import</scope>
</dependency>

使用包pom详细

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
    <version>${registry.prometheus.version}</version>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-core</artifactId>
    <version>${micrometer.version}</version>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

自己加到项目里面

增加一个配置yuml文件

application-prom.yml

management:
    metrics:
        export:
            prometheus:
                enabled: true
        tags:
            application: ${spring.application.name}
    endpoint:
        metrics:
            enabled: true
        prometheus:
            enabled: true
    endpoints:
        web:
            exposure:
                include: ["prometheus","health"]

启动项目后可以通过如下URL获取探针内容

http://IP:38082/actuator/prometheus

 


版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权 , 转载请注明springboot 配置普罗米修斯监控钉钉预警
喜欢 (2)
支付宝[]
分享 (0)
悠扬
关于作者:
10年以上工作经验:6年以上微服务架构设计搭建经验。 曾任岗位:项目经理、架构师。 擅长领域:大数据、数据库,架构设计,资源优化。 获得业绩: 1.实用新型发明专利1个,修改Apache Sharding源码设计实现分库分表程序增强方案。 2.开源项目一个:https://gitee.com/zsiyang/ruoyi-vue-atomikos (加入开源生态圈)。 3.个人技术博客地址:https://www.nxhz1688.com