Prometheus监控系统

Prometheus是一个开源的监控系统,时间序列数据库。Prometheus一个很重要的方面,它是拥有一个多维的数据模型,并且有相应的查询语言。altermanager用来处理告警信息,与prometheus是分开的两个组件。目前官方已不再维护Promdash,前端绘图采用grafana。

Prometheus简要配置文件:

global:
  scrape_interval: 10s
  scrape_timeout: 10s
  evaluation_interval: 10s
rule_files:
  - '/etc/alertmanager/alert.rules'
scrape_configs:
  - job_name: 'elk'
    static_configs:
      - targets: [ '10.202.129.101:9100',  '10.202.129.60:9100',  '10.202.129.106:9100' ]
  # https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml#L37
  - job_name: 'kubernetes-nodes'
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:10255'
        target_label: __address__

global为全局配置块。rule_files指向告警规则配置文件。因为promethus工作在拉模型下,所以scrape_configs里的配置告诉prometheus从什么endpoints拉取数据。

job_name: 'elk':为静态配置,直接告诉prometheus需要从哪些endpoints拉取metrics。这些endpoints上边都有运行相应的prometheus exporter,并且监听相应的端口。

job_name: 'kubernetes-nodes': 为自动服务发现配置,prometheus目前支持k8s,aws等的服务发现。上述配置proemtheus可以自动的发现k8s集群中所有的node,并直接拉取metrics。

prometheus的配置具体说明文档

打开 http://your_server_ip:port/targets,查看所有的 targets信息:

Prometheus exporter

类似于zabbix agent运行在所有被监控的节点上,负责采集本节点的metrics,并且监控一个特定的端口等待prometheus主动连接拉取metrics.目前第三方已经有很多比较成熟的exporter可以直接使用。官方提供了常用的client libiary,可以只自己开发exporter。官方文档

prometheus查询metrics

Prometheus将所有数据都存储为时间序列(time series). 每一个是时间序列都是一个metric name作为其标示,在没个时间序列中有一个key-values 对组成的集合,prometheus称之为 标签(labels)。metric name应该命名的有意义一些,比如:从服务启动以来所有的http请求数量http_requests_total。labels用区分子维度的度量值,如表示post请求的http总的求情数量:method="POST",标示某一个路径的请求数量:path="/api/foo",完整表示方法为:http_requests_total{method="POST"}http_requests_total{method="POST",path="/api/foo"}。最后,形成了一系列的实际数据序列的样本。每个样本包含一个时间戳和一个值,时间戳是精确到毫秒级的,值是64位浮点数。

我可以通过prometheus的web ui执行查询, 打开http://your_server_ip:port/graph

node_load1显示所有endpoint最近一分钟的平均负载

我也可以根据label查询更具体的series.根据job="elk"筛选出来和elk主机相关的信息:

也可以只显示elk某一个主机的负载值

label中除了支持等号匹配(=),还支持不等(!=),正则表达式(=~),取反(!~)。

Calculating Rates and Other Derivatives

In this section, we will learn how to calculate rates or deltas of a metric over time.

One of the most frequent functions you will use in Prometheus is rate(). Instead of calculating event rates directly in the instrumented service, in Prometheus it is common to track events using raw counters and to let the Prometheus server calculate rates ad-hoc during query time (this has a number of advantages, such as not losing rate spikes between scrapes, as well as being afrontble to choose dynamic averaging windows at query time). Counters start at 0 when a monitored service starts and get continuously incremented over the service process’s life time. Occasionally, when a monitored process restarts, its counters reset to 0 and begin climbing again from there. Graphing raw counters is usually not very useful, as you will only see an ever-increasing line with occasional resets. You can see that by graphing the demo service’s API request counts:

http_requests_total{job="elk"},返回最近一次的metrics值,这个只在序列中是递增的,如果只是根据这个值绘图的话,将毫无意义,图只是一个递增的线而已,并没有任何参考价值。

http_requests_total{job="elk"}[5m]显示最近五分中的值:

最近五分钟的平均增长率:

rate(http_requests_total{job="elk"}[5m])

prometheus还支持irate,deriv,sum等函数。

Prometheus启动

无论是直接启动程序,还是启动docker容器,都可以添加启动参数:

-storage.local.retention=6h -storage.local.memory-chunks=500000 -config.file=/etc/prometheus/prometheus.yml -alertmanager.url=http://alertmanager:9093

storage.local.retention保留多长时间的数据,storage.local.memory-chunks指定chunks数量, config.file指定配置文件,alertmanager.url指定altermanager地址,prometheus会将需要发送的告警信息通过该地址传递给altermanager,由alertmanager处理告警信息。

Alertmanager

Altermanager 从prometheus接受到告警信息之后,会将告警信息路由到相应的receiver,如email,PagerDuty, OpsGenie,slack等。

alertmanger还可以对告警信息做如下处理: grouping,Inhibition,Silences.官方文档

具体告警规则的配置,是在prometheus配置中指定,prometheus配置文件中:

   rule_files:
      - '/etc/alertmanager/alert.rules'

告警规则的格式为:

ALERT <alert name>
  IF <expression>
  [ FOR <duration> ]
  [ LABELS <label set> ]
  [ ANNOTATIONS <label set> ]

alert.rules:

    ALERT InstanceHighCpu#告警名称
    IF 100-(avg by (instance) (irate(node_cpu{mode="idle"}[5m]))*100) > 5 
    #prometheus的查询表达式,当条件为真时满足告警
    FOR 10m #持续时长,持续10分钟才会发送告警信息
    LABELS {serverity = "page"}
    ANNOTATIONS {
      summary = "Instance {{$labels.instance}}: cpu high", #告警简介
      description = "{{$labels.instance}} has high cpu activity" #告警内容
    }

    ALERT InstanceLowMemory
    IF node_memory_MemAvailable < 268435456
    FOR 10m
    LABELS {serverity = "page"}
    ANNOTATIONS {
      summary = "Instance {{$labels.instance}}: memory low",
      description = "{{$labels.instance}} has less than 256M memory available"
    }

    ALERT InstalceLowDisk
    IF node_filesystem_avail{mountpoint="/etc/hosts"} < 10737418240
    FOR 10m LABELS {serverity = "page"}
    ANNOTATIONS {
      summary = "Instance {{$labels.instance}}: low disk space",
      description = "{{$labels.instance}} has less than 10G space"
    }

官方文档

启动之后,在altermanager界面上可以看到相应的报警信息:http://your_alertmanager:port/#/alerts

prometheus界面也可以看到相关的报警信息状态:http://your_prometheus:port/alerts

PENDING 表示告警信息还没有发送出去 FIRING表示此告警已发送

由于官方的promdash已经不再维护,一般都会采用grafana绘图:

发表评论

电子邮件地址不会被公开。 必填项已用*标注