聊聊给公司搭建的 Prometheus + Grafana 监控（一）-周刊幻想zZZ

最近一个月从无到有一手搭建了公司的网络设备监控，聊聊我的搭建过程和遇到的一些技术难点。

初级阶段：微型虚拟环境测试

因为我此前未曾亲身搭建过 Prometheus 和 Grafana，因此，我搭建的过程也是学习的过程，加之此系统未来要上线生产环境使用，所以每一步都要谨慎进行。

因此，第一步，我选择在自己的笔记本上部署几台虚拟机，进行一些最佳实践的实操：

Prometheus 主机：Ubuntu 2004

依赖安装

apt install -y wget curl vim net-tools telnet htop

创建用户

useradd -r -M -s /sbin/nologin prometheus

创建目录

mkdir -p /data/prometheus/{data,config,rules,exporters}
mkdir -p /data/grafana/{data,plugins,dashboards}
mkdir -p /data/alertmanager/{data,templates}
mkdir -p /var/log/prometheus

设置权限

chown -R prometheus:prometheus /data/prometheus
chown -R prometheus:prometheus /var/log/prometheus

下载各个组件

cd /usr/local/src

#Prometheus
PROM_VERSION="3.5.0"
wget "https://github.com/prometheus/prometheus/releases/download/v3.5.0/prometheus-3.5.0.linux-amd64.tar.gz"

#Alertmanager
AM_VERSION="0.29.0"
wget "https://github.com/prometheus/alertmanager/releases/download/v0.29.0/alertmanager-0.29.0.linux-amd64.tar.gz"

#Node Exporter
NE_VERSION="1.10.2"
wget "https://github.com/prometheus/node_exporter/releases/download/v1.10.2/node_exporter-1.10.2.linux-amd64.tar.gz"

解压安装

tar -zxvf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
tar -zxvf alertmanager-${AM_VERSION}.linux-amd64.tar.gz
tar -zxvf node_exporter-${NE_VERSION}.linux-amd64.tar.gz

mv ./prometheus-${PROM_VERSION}.linux-amd64 /usr/local/prometheus
mv ./alertmanager-${AM_VERSION}.linux-amd64 /usr/local/alertmanager
mv ./node_exporter-${NE_VERSION}.linux-amd64 /usr/local/node_exporter

创建软连接

ln -s /usr/local/prometheus/prometheus /usr/local/bin/prometheus
ln -s /usr/local/prometheus/promtool /usr/local/bin/promtool
ln -s /usr/local/alertmanager/alertmanager /usr/local/bin/alertmanager
ln -s /usr/local/node_exporter/node_exporter /usr/local/bin/node_exporter

验证安装

prometheus --version
alertmanager --version
node_exporter --version

预期输出：

root@ubuntu:/usr/local/src# alertmanager --version
alertmanager, version 0.29.0 (branch: HEAD, revision: 2f0cff51fd1cc761eeb671db43736341ca2ab511)
  build user:       root@f4d6cb29d2f5
  build date:       20251104-13:09:23
  go version:       go1.25.3
  platform:         linux/amd64
  tags:             netgo
root@ubuntu:/usr/local/src# prometheus --version
prometheus, version 3.5.0 (branch: HEAD, revision: 8be3a9560fbdd18a94dedec4b747c35178177202)
  build user:       root@4451b64cb451
  build date:       20250714-16:15:23
  go version:       go1.24.5
  platform:         linux/amd64
  tags:             netgo,builtinassets
root@ubuntu:/usr/local/src# node_exporter --version
node_exporter, version 1.10.2 (branch: HEAD, revision: 654f19dee6a0c41de78a8d6d870e8c742cdb43b9)
  build user:       root@b29b4019149a
  build date:       20251025-20:05:32
  go version:       go1.25.3
  platform:         linux/amd64
  tags:             unknown

最小实例

在上述步骤中，我们已经成功安装并启动了需要的各个组件，下一步就是对它们进行配置，搭建起一个最小运行实例：

实例包含：

Prometheus 配置文件（采集自身指标 + 一个本地 Node Exporter）
启动 Prometheus 和 Alertmanager
设置基本告警规则
验证数据采集与告警

配置文件

vim /data/prometheus/config/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/data/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["localhost:9093"]

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]

告警规则

vim /data/prometheus/rules/host_alerts.yml

groups:
- name: host-monitoring
  rules:

  - alert: HighPerCoreCPUUsage
    expr: 100 - (min by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High per-core CPU usage on {{ $labels.instance }}"
      description: "At least one CPU core is above 80% usage for more than 2 minutes."

  - alert: HostDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Host {{ $labels.instance }} is down"
      description: "Scrape target is no longer reachable."

创建 node_exporter 用户

useradd --no-create-home --shell /bin/false node_exporter

启动本地 node_exporter

vim /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reexec
sudo systemctl enable --now node_exporter

验证：

root@ubuntu:/data/prometheus# systemctl status node_exporter
● node_exporter.service - Node Exporter
   Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2025-11-11 19:03:40 PST; 1min 27s ago
 Main PID: 94856 (node_exporter)
   CGroup: /system.slice/node_exporter.service
           └─94856 /usr/local/bin/node_exporter

Nov 11 19:03:40 ubuntu node_exporter[94856]: time=2025-11-11T19:03:40.773-08:00 level=INFO source=node_exporter.go:141 msg=ti
Nov 11 19:03:40 ubuntu node_exporter[94856]: time=2025-11-11T19:03:40.773-08:00 level=INFO source=node_exporter.go:141 msg=ti
Nov 11 19:03:40 ubuntu node_exporter[94856]: time=2025-11-11T19:03:40.773-08:00 level=INFO source=node_exporter.go:141 msg=ud
Nov 11 19:03:40 ubuntu node_exporter[94856]: time=2025-11-11T19:03:40.773-08:00 level=INFO source=node_exporter.go:141 msg=un
Nov 11 19:03:40 ubuntu node_exporter[94856]: time=2025-11-11T19:03:40.773-08:00 level=INFO source=node_exporter.go:141 msg=vm
Nov 11 19:03:40 ubuntu node_exporter[94856]: time=2025-11-11T19:03:40.773-08:00 level=INFO source=node_exporter.go:141 msg=wa
Nov 11 19:03:40 ubuntu node_exporter[94856]: time=2025-11-11T19:03:40.773-08:00 level=INFO source=node_exporter.go:141 msg=xf
Nov 11 19:03:40 ubuntu node_exporter[94856]: time=2025-11-11T19:03:40.773-08:00 level=INFO source=node_exporter.go:141 msg=zf
Nov 11 19:03:40 ubuntu node_exporter[94856]: time=2025-11-11T19:03:40.774-08:00 level=INFO source=tls_config.go:346 msg="List
Nov 11 19:03:40 ubuntu node_exporter[94856]: time=2025-11-11T19:03:40.774-08:00 level=INFO source=tls_config.go:349 msg="TLS

root@ubuntu:/data/prometheus# curl http://localhost:9100/metrics | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 5.4281e-05
go_gc_duration_seconds{quantile="0.25"} 5.4281e-05
go_gc_duration_seconds{quantile="0.5"} 5.4281e-05
go_gc_duration_seconds{quantile="0.75"} 5.4281e-05
go_gc_duration_seconds{quantile="1"} 5.4281e-05
go_gc_duration_seconds_sum 5.4281e-05
go_gc_duration_seconds_count 1
# HELP go_gc_gogc_percent Heap size target percentage configured by the user, otherwise 100. This value is set by the GOGC environment variable, and the runtime/debug.SetGCPercent function. Sourced from /gc/gogc:percent.
100 24332    0 24332    0     0  1382k      0 --:--:-- --:--:-- --:--:-- 1485k
curl: (23) Failed writing body (0 != 2048)

配置 Alertmanager 并临时启动

vim /data/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 10s
  group_interval: 30s
  repeat_interval: 1h
  receiver: 'default'

receivers:
- name: 'default'
  # 暂不配置通知方式，仅通过 Web UI 查看告警

alertmanager --config.file=/data/alertmanager/alertmanager.yml

启动 Prometheus

sudo -u prometheus prometheus --config.file=/data/prometheus/config/prometheus.yml --storage.tsdb.path=/data/prometheus --web.listen-address=":9090" --web.enable-lifecycle

配置成服务：

vim /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus Time Series Collection and Processing Server
Documentation=https://prometheus.io/docs/
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/data/prometheus/config/prometheus.yml \
  --storage.tsdb.path=/data/prometheus \
  --web.listen-address=:9090 \
  --web.enable-lifecycle
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

打开 Web 页面

提高 CPU 负载以触发报警测试

yes > /dev/null &

监控面板上 CPU 使用率状态由 inactive 变为 Firing

同时 alertmanager 页面出现告警：

测试结束，kill yes 进程：

root@ubuntu:~# killall yes
[1]+  Terminated              yes > /dev/null

实例二：加入一台主机

创建一台 alpine 虚拟机作为实验目标，我们要监控这台主机的各项数据。

主机参数：

Ver	3.21
Core	1 C
RAM	768 M
Storge	8G

前置准备

alpine 虚拟机默认跑在 RAM 中，需要先配置网卡（也可以 setup-alpine 装入硬盘里）：

vi /etc/network/interfaces

auto eth0
iface eth0 inet static
address 192.168.172.134
netmask 255.255.255.0
gateway 192.168.172.2

再设置一下主机名：

echo 'slave1' > /etc/hostname
hostname -F /etc/hostname #立即生效

配置 DNS 服务器：

vi /etc/resolv.conf

nameserver 114.114.114.114
nameserver 8.8.8.8

最后重启网络服务：

/etc/init.d/networking restart

ping 测试互通：

alpine3:~# ping 192.168.172.133
PING 192.168.172.133 (192.168.172.133): 56 data bytes
64 bytes from 192.168.172.133: seq=0 ttl=64 time=2.049 ms
64 bytes from 192.168.172.133: seq=1 ttl=64 time=0.544 ms
64 bytes from 192.168.172.133: seq=2 ttl=64 time=0.471 ms
64 bytes from 192.168.172.133: seq=3 ttl=64 time=0.417 ms
64 bytes from 192.168.172.133: seq=4 ttl=64 time=0.496 ms

安装 node_exporter

下载包

NE_VERSION="1.10.2"
wget "https://github.com/prometheus/node_exporter/releases/download/v1.10.2/node_exporter-1.10.2.linux-amd64.tar.gz"

安装

# 解压 
tar xvfz node_exporter-${NE_VERSION}.linux-amd64.tar.gz # 移动二进制文件到系统路径（例如 /usr/local/bin） 
mv node_exporter-${NE_VERSION}.linux-amd64/node_exporter /usr/local/bin/ 
# 清理 
rm -rf node_exporter-${NE_VERSION}.linux-amd64*

创建 node_exporter 账号

adduser -D -s /sbin/nologin node_exporter

创建服务

vi /etc/init.d/node_exporter

#!/sbin/openrc-run

name="node_exporter"
description="Prometheus Node Exporter"
command="/usr/local/bin/node_exporter"
command_args=""
command_background=true
pidfile="/var/run/node_exporter.pid"
user="node_exporter"
group="node_exporter"
# 添加输出重定向
stdout_log="/var/log/node_exporter.log"
stderr_log="/var/log/node_exporter.log"

depend() {
    need net
    after firewall
}

start_pre() {
    # 确保日志文件存在并有正确权限
    checkpath --file --owner "$user:$group" --mode 0644 "$stdout_log"
}

启动

chmod +x /etc/init.d/node_exporter
chown root:root /etc/init.d/node_exporter

# 添加到开机启动
rc-update add node_exporter default

# 启动服务
rc-service node_exporter start

验证

# 访问指标（在本地或通过 curl）
curl http://localhost:9100/metrics|head

在 Prometheus 服务器上添加 alpine 节点

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
	      - 'localhost:9100'             # 本地 Node Exporter（如果有）
	      - '192.168.172.134:9100'       # 新增 Alpine 主机

触发告警

yes >/dev/null &

正常触发告警：

实例三：引入 snmp_exporter 监控交换机

这里采用 eNSP 虚拟环境创建一个典型三层网络：

回程路由问题

在完成了交换机的配置后，发现 PC1234 ping 不通 192.168.172.1、133 以及 134，在 12800 的 GE1/0/2 和 GE1/0/0 分别抓包，确认去程正常，回程包在 12800 上被丢弃。

经过分析，由 VMware 创建的虚拟网络，实际上处于一个虚拟交换机下，默认网关是 192.168.172.2，这个网关提供 NAT 和 DHCP 服务，不具备 ICMP 响应模块，也不具有路由功能，所以此时，只有同网段可以二层互通，想要三层互通的话，需要手动配置路由，将网关（也就是下一跳）指向上图中 12800 的三层口 GE1/0/2。

# alpine 添加永久静态路由
vi /etc/network/interfaces

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
        address 192.168.172.134
        netmask 255.255.255.0
        gateway 192.168.172.2
        up route add -net 192.168.10.0/24 gw 192.168.172.3
        up route add -net 192.168.20.0/24 gw 192.168.172.3

# 重启网络服务
rc-service networking restart

# Windows 添加永久静态路由
route add 192.168.10.0 mask 255.255.255.0 192.168.172.3 -p
route add 192.168.20.0 mask 255.255.255.0 192.168.172.3 -p

# ubuntu16 添加永久静态路由
vi /etc/network/interfaces

auto lo
iface lo inet loopback

auto ens38
iface ens38 inet static
    address 192.168.172.133
    netmask 255.255.255.0
    gateway 192.168.172.2
    dns-nameservers 114.114.114.114 8.8.8.8
    # 添加静态路由
    up ip route add 192.168.10.0/24 via 192.168.172.3 dev ens38
    up ip route add 192.168.20.0/24 via 192.168.172.3 dev ens38

# 重启网络服务
/etc/init.d/networking restart

至此所有设备均可互通。

交换机启用 SNMP

[SNMP-Switch] snmp-agent
[SNMP-Switch] snmp-agent sys-info version v2c
[SNMP-Switch] snmp-agent community read <这里写一个8位以上的复杂密码（团体字）>
[SNMP-Switch] snmp-agent community write <这里写一个8位以上的复杂密码（团体字）>

# 创建 acl
[SNMP-Switch] acl 2000
[SNMP-Switch-acl-basic-2000] rule 5 permit source 192.168.1.50 0
[SNMP-Switch-acl-basic-2000] rule 10 deny source any
[SNMP-Switch-acl-basic-2000] quit

[SNMP-Switch] snmp-agent community read <团体字> acl 2000
[SNMP-Switch] snmp-agent community write <团体字> acl 2000

在 ubuntu 上测试：

root@ubuntu:/usr/local/bin# snmpwalk -v 2c -c <团体字> 192.168.172.3 1.3.6.1.2.1.1.1.0
iso.3.6.1.2.1.1.1.0 = STRING: "Huawei Versatile Routing Platform Software
VRP (R) software, Version 8.130 (CE12800 V800R013C00SPC560B560)
Copyright (C) 2012-2016 Huawei Technologies Co., Ltd.
HUAWEI CE12800
"

安装 SNMP Exporter

在 Prometheus 服务器上安装 SNMP Exporter：

cd /usr/local/src

# 下载SNMP Exporter
SNMP_VERSION="0.29.0"
wget "https://github.com/prometheus/snmp_exporter/releases/download/v${SNMP_VERSION}/snmp_exporter-${SNMP_VERSION}.linux-amd64.tar.gz"

# 解压安装
tar -zxvf snmp_exporter-${SNMP_VERSION}.linux-amd64.tar.gz
mv snmp_exporter-${SNMP_VERSION}.linux-amd64 /usr/local/snmp_exporter
ln -s /usr/local/snmp_exporter/snmp_exporter /usr/local/bin/snmp_exporter

# 创建用户
useradd -r -M -s /sbin/nologin snmp_exporter

配置 SNMP Exporter

创建 SNMP Exporter 配置文件：

auths:
  public_v1:
    version: 1
  public_v2:
    version: 2

modules:
  # 接口信息
  if_mib:
    walk:
      - "IF-MIB::interfaces"

  # 系统信息
  system:
    walk:
      - "SNMPv2-MIB::system"

chown snmp_exporter:snmp_exporter /data/prometheus/config/snmp.yml
chmod 644 /data/prometheus/config/snmp.yml
chmod 755 /usr/local/bin/snmp_exporter

创建SNMP Exporter服务

创建 systemd 服务文件 /etc/systemd/system/snmp_exporter.service：

[Unit]
Description=SNMP Exporter
After=network.target

[Service]
User=snmp_exporter
ExecStart=/usr/local/bin/snmp_exporter --config.file=/data/prometheus/config/snmp.yml
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

启动服务

systemctl daemon-reload
systemctl enable --now snmp_exporter
systemctl status snmp_exporter

root@ubuntu:/data/prometheus/config# systemctl status snmp_exporter
● snmp_exporter.service - SNMP Exporter
   Loaded: loaded (/etc/systemd/system/snmp_exporter.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2025-11-16 16:00:40 CST; 1min 48s ago
 Main PID: 10879 (snmp_exporter)
   CGroup: /system.slice/snmp_exporter.service
           └─10879 /usr/local/bin/snmp_exporter --config.file=/data/prometheus/config/snmp.yml

加入 Prometheus 配置中

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/data/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["localhost:9093"]

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets: 
          - "localhost:9100"
          - "192.168.172.134:9100"

  # 添加SNMP监控
  - job_name: "snmp-switch"
    static_configs:
      - targets:
          - "192.168.172.3"  # 交换机的IP地址
    metrics_path: /snmp
    params:
      module: [if_mib]  # 使用if_mib模块，您可以后续添加其他模块
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9116  # SNMP Exporter的地址

调用接口重启 Prometheus：

curl -X POST http://localhost:9090/-/reload

可以看到已经出现 snmp 的指标了，但这些都是 exporter 自身的参数，我们还需要根据交换机的 MIB 文件用生成器生成 snmp.yml 替换现有的版本。

# 下载 go 编译环境：
wget https://golang.google.cn/dl/go1.21.0.linux-amd64.tar.gz
# 解压
tar -C /usr/local -xzf go1.21.0.linux-amd64.tar.gz
# 清理
rm go1.21.0.linux-amd64.tar.gz
# 换源
go env -w GOPROXY=https://goproxy.cn,direct

# 配置环境变量
vim ~/.bashrc
# 在末尾添加：
export PATH=$PATH:/usr/local/go/bin
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin
# 使生效
source ~/.bashrc

# 进入 generator 目录（没有的话去 github 上下载）
cd /usr/local/snmp_exporter/generator
# 使用 Go modules 下载依赖
go mod init snmp_exporter_generator  # 如果还没有 go.mod 文件
go mod tidy
# 安装 snmp 开发依赖
apt install libsnmp-dev snmp-mibs-downloader

# 编译
go build

# 如果有报错，可能没有安装 C/C++ 的编译环境，
apt install gcc g++

可以看到多了一个可执行文件：

开始生成，忽略生成过程的报错：

./generator generate -m mibs/MIBs --no-fail-on-parse-errors

用这个 snmp.yml 替换掉 Prometheus 中的那个：

重启 snmp_exporter：

systemctl restart snmp_exporter

验证：

root@ubuntu:/data/prometheus/config# curl "http://ubuntu:9116/snmp?auth=public_v2&module=huawei_common&module=huawei_core&target=192.168.172.3"
# HELP hwStorageSpace Specifies the total size of the storage devices indexed by hwStorageTable. - 1.3.6.1.4.1.2011.6.9.1.4.2.1.3
# TYPE hwStorageSpace gauge
hwStorageSpace{hwStorageIndex="1",hwStorageName="cfcard:"} 1.75414e+06
hwStorageSpace{hwStorageIndex="2",hwStorageName="cfcard2:"} 1.75414e+06
...

成功采集到 snmp 指标
注意到接口流量计数一直是 0，怀疑是模拟器 bug，决定更换实体交换机继续测试。

H3C 交换机 S5130

启用 snmp：

snmp-agent
snmp-agent sys-info version all
snmp-agent community read <团体名>
snmp-agent community write <团体名>
save

配置一个 VlanIF 接口，把电脑配置成同段 ip：

interface Vlan-interface 100
ip address 192.168.100.254 255.255.255.0

在 ubuntu 上：

root@ubuntu:~# snmpwalk -v 2c -c <这里填写团体字名> 192.168.100.254 1.3.6.1.2.1.1.5.0
iso.3.6.1.2.1.1.5.0 = STRING: "H3C"

修改 generator.yml：

---
auths:
  # 认证模块名称
  public_v2:
    # snmp v2c版本
    version: 2
    # snmp 团体名
    community: ******

modules:
  H3C:
    walk:
      - 1.3.6.1.2.1.1.1                     #sysDescr
      - 1.3.6.1.2.1.1.3                     #sysUpTimeInstance
      - 1.3.6.1.2.1.1.5                     #sysName
      - 1.3.6.1.2.1.2.2.1.1                 #ifIndex
      - 1.3.6.1.2.1.2.2.1.2                 #IfDescr
      - 1.3.6.1.2.1.31.1.1.1.1              #ifName
      - 1.3.6.1.2.1.31.1.1.1.6              #ifHCInOctets
      - 1.3.6.1.2.1.31.1.1.1.10             #ifHCOutOctets
      - 1.3.6.1.2.1.47.1.1.1.1.2            #entPhysicalDescr
      - 1.3.6.1.2.1.47.1.1.1.1.5            #entPhysicalClass
      - 1.3.6.1.2.1.47.1.1.1.1.7            #entPhysicalName
      - 1.3.6.1.4.1.25506.2.6.1.1.1.1.6     #hh3cEntityExtCpuUsage
      - 1.3.6.1.4.1.25506.2.6.1.1.1.1.8     #hh3cEntityExtMemUsage
      - 1.3.6.1.4.1.25506.2.6.1.1.1.1.12    #hh3cEntityExtTemperature
      - 1.3.6.1.4.1.25506.8.35.9.1.1.1.2    #hh3cDevMFanStatus
      - 1.3.6.1.4.1.25506.8.35.9.1.2.1.2    #hh3cDevMPowerStatus
    max_repetitions: 3
    retries: 3
    timeout: 25s
    version: 3
    auth:
      username: hcuser
      password: hcpass234
      auth_protocol: SHA
      priv_protocol: DES
      priv_password: hcdes234
      security_level: authPriv

    lookups:
      - source_indexes: [ifIndex]
        lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
      - source_indexes: [ifIndex]
        lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
      - source_indexes: [hh3cEntityExtPhysicalIndex]
        lookup: 1.3.6.1.2.1.47.1.1.1.1.2  #entPhysicalDescr
      - source_indexes: [hh3cEntityExtPhysicalIndex]
        lookup: 1.3.6.1.2.1.47.1.1.1.1.5  #entPhysicalClass
      - source_indexes: [hh3cEntityExtPhysicalIndex]
        lookup: 1.3.6.1.2.1.47.1.1.1.1.7  #entPhysicalName

    overrides:
      ifAlias:
        ignore: true # Lookup metric
      ifDescr:
        ignore: true # Lookup metric
      ifName:
        ignore: true # Lookup metric
      entPhysicalDescr:
        ignore: true # Lookup metric
      entPhysicalName:
        ignore: true # Lookup metric
      entPhysicalClass:
        ignore: true # Lookup metric

下载对应的 MIB 文件，用生成器生成 snmp.yml

./generator generate -m mibs/MIBs --no-fail-on-parse-errors

把内容复制到 /data/prometheus/config/snmp.yml
修改 /data/prometheus/config/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/data/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["localhost:9093"]

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets: 
          - "localhost:9100"
          - "192.168.172.134:9100"

  - job_name: "snmp-switch"
    static_configs:
      - targets:
          - "192.168.100.254"  # 修改交换机的IP地址
    metrics_path: /snmp
    params:
      module: [H3C]  # 修改这里的模块名
      auth: [public_v2]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9116

重启 node_exporter 和 Prometheus，查看相应的指标：

正常看到流量计数了

Grafana 面板配置

依赖安装：

apt-get install -y adduser libfontconfig1 musl

下载 dpkg 包：

wget https://dl.grafana.com/enterprise/release/grafana-enterprise_12.0.0_amd64.deb

安装 dpkg 包：

dpkg -i grafana-enterprise_12.0.0_amd64.deb

启动 grafana 服务：

systemctl start grafana-server

设置开机启动：

systemctl enable grafana-server.service

访问 http://ubuntu:3000
添加数据源 prometheus：
添加仪表板
在这里修改纵轴的单位，比如字节（Byte），grafana 会自动换算成 MiB 等更大的单位

Grafana 进一步配置

更改中文，更换主题
- 右上角 - profile - Language - 中文（简体）
- UI 主题：Sapphire dusk
Drilldown：这里有 grafana 自动分析你加入的数据指标创建的 panel，可以当作模板使用：

配置 Grafana 告警

以 alpine 的 CPU 使用率为例，Prometheus 的查询语句如下：

100 * (1 - sum(increase(node_cpu_seconds_total{mode="idle",instance="192.168.172.134:9100"}[1m])) / sum(increase(node_cpu_seconds_total{instance="192.168.172.134:9100"}[1m])))

先添加了一个仪表板：

添加联络点
为了后续安全外联方便基于 smtp 的出向端口号制定防火墙策略，这里采用邮箱联络：
- 打开 grafana 安装目录下的配置文件：
  -> vim /etc/grafana/grafana.ini
- 修改 smtp 的部分，注意 from_address 和 user 字段要保持一致。
- 重启 grafana：
  -> systemctl restart grafana-server
测试成功：

再添加警报规则：

grafana 的可视化效果很好，可以比较方便地添加修改告警规则，所以之后会选择 grafana 的告警，取代 Prometheus 的 AlertManager。

生产环境部署

生产环境不能随意打通公网，决定先在内部进行一段时间的灰度测试，并慢慢完善功能。

一些基本的安装和配置在此不再赘述，和之前的本地测试大同小异。

查询性能优化

生产环境中需要查询大量的指标，如果和之前一样无脑全部查找会导致查询性能低下，贸然调高超时时长会导致真正出现报警情况时不能及时响应，所以要对查询指标进行优化。

我们将查询的参数根据基础索引分为两个模块，一个是接口模块，以 ifindex 为索引，一个是实体模块，以 entPhysicalIndex 为索引，每个模块定义为一个 job：

HUAWEI_Entity_OPTIMIZED

这是生成器 generator.yml 的片段，可以看到增加了过滤的机制，例如，只有 1.3.6.1.2.1.47.1.1.1.1.5 的值为 '9' 时，也就是物理实体类型为“板卡”时，snmp_exporter 才会采集温度指标：

modules:
  HUAWEI_Entity_OPTIMIZED:
    walk:
      - 1.3.6.1.2.1.47.1.1.1.1.5            #entPhysicalClass
      - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.11 #hwEntityTemperature
      - 1.3.6.1.4.1.2011.5.25.31.1.1.3.1.5  #hwEntityOpticalTemperature
      - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.7  #hwEntityMemUsage
      - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.5  #hwEntityCpuUsage
      - 1.3.6.1.4.1.2011.5.25.31.1.1.10.1.7 #hwEntityFanState
      - 1.3.6.1.4.1.2011.6.157.1.6          #hwCurrentPower
    lookups:
      - source_indexes: [hwEntityTemperature]
        lookup: 1.3.6.1.2.1.47.1.1.1.1.5
      - source_indexes: [entPhysicalIndex]
        lookup: 1.3.6.1.2.1.47.1.1.1.1.7            #entPhysicalName
    overrides:
      entPhysicalClass: 
        ignore: true
        regex_extracts:
          entity_class:
            - regex: '(.*)'
              value: '$1'
    filters:
      dynamic:
        - oid: 1.3.6.1.2.1.47.1.1.1.1.5
          targets:
            - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.11
          values: ["9"]
        - oid: 1.3.6.1.2.1.47.1.1.1.1.5
          targets:
            - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.7
          values: ["9"]
        - oid: 1.3.6.1.2.1.47.1.1.1.1.5
          targets:
            - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.5
          values: ["9"]
        - oid: 1.3.6.1.2.1.47.1.1.1.1.5
          targets:
            - 1.3.6.1.4.1.2011.5.25.31.1.1.10.1.7
          values: ["7"]
        - oid: 1.3.6.1.2.1.47.1.1.1.1.5
          targets:
            - 1.3.6.1.4.1.2011.5.25.31.1.1.3.1.5
          values: ["10"]

这样可以大大降低采集需要的时间：

snmp_scrape_walk_duration_seconds{module="HUAWEI_Entity_OPTIMIZED"} 1.359563477

图例优化

很多端口没有流量，例如一些 VlanIf，过多的图例会让寻找不方便，怎么样能只在图表中显示有实际数据的项目呢？

可以在表达式中使用 !=0 来过滤：

(rate(ifHCInOctets{instance="172.1.0.51", job="HUAWEI_IF_OPTIMIZED"}[5m])!=0)/1000000/(ifHighSpeed{instance="172.1.0.51", job="HUAWEI_IF_OPTIMIZED"}!=0)*8

这样在图中只会出现不为 0 的数据了。

SNMP Exporter 采集问题

发现两台设备在 Prometheus 的 target health 中显示状态为 down：

尝试直接在命令行获取，发现有一台可以获取基础 OID 对应的信息，有一台不可以：

root@prometheus:~# snmpwalk -v 2c -c ****** 172.1.0.59 1.3.6.1.2.1.1.1.0
Timeout: No Response from 172.1.0.59
root@prometheus:~# snmpwalk -v 2c -c ****** 172.1.0.58 1.3.6.1.2.1.1.1.0
iso.3.6.1.2.1.1.1.0 = STRING: "S5720-36C-EI-28S-AC
Huawei Versatile Routing Platform Software
 VRP (R) software,Version 5.170 (S5720 V200R019C10SPC500)
 Copyright (C) 2007 Huawei Technologies Co., Ltd."
root@prometheus:~#

一样的 5720，一样的 snmp，何意味？

排查后确认为交换机补丁版本问题，更新最新的交换机补丁后恢复正常；

出现了不同的团体字怎么办

实际上，使用团体字验证的 auth 模块上是可以有多个的：

---
auths:
  # 认证模块名称
  huawei_switch:
    # snmp v2c版本
    version: 2
    # snmp 团体名
    community: <这里填写团体字名>

  # 适用于olt
  huawei_olt:
    # snmp v2c版本
    version: 2
    # snmp 团体名
    community: public

在 Prometheus 中，在不同的 job 中启用不同的 auth 即可。

光接入设备（olt）的 OID 解释问题

这些厂商真是无敌了，为什么 OID 的解释不能大大方方公开呢？

目前只在华为官网找到了 S7700 系列的交换机对应的文件，中兴 C300 和华为 MA5680T 完全是没有账号去官网下载。

问了设备的代理商，他们也没有，这玩意儿是什么需要保密的信息吗？懒得喷。人家国外的厂商 Juniper 就大大方方的打个压缩包供用户下载，还提供所有 OID 的查询，华三在这一点上做的也很不错。你们中兴和华为在顾虑什么，不会自己也知道自己那个破网管系统做的太烂很难卖出去吧？

事已至此，再不多说了，我还有一计，把 OID 全部遍历一遍，然后丢给 AI，让它分析总结。

虽然也取得了一些成果，但是最后还是花了功夫用一些手段找到了所需的各种 mib 文件。

单台 5720 交换机 SNMP 信息读取速度慢

经过排查后，确定是交换机补丁版本问题，更新补丁后获取速度恢复正常。

Prometheus “隐式指标”问题

Prometheus的指标经常会包含在查询的标签中，而不是最终返回的值，一般是因为设置的输出格式为数值，而实际的值是一个字符串，比如：
zxGponUnCfgSnOntSN{zxGponMgmtPonOltId="268633344",zxGponUnCfgSnIdx="1",zxGponUnCfgSnOntSN="0x52544B4711111111"} 1

但是，这里中兴的定义里似乎采用unicode编码，Prometheus 默认采用 ASCII 转化 hex 值，从而导致了乱码，也没有办法可以直接转换 16 进制的数值，计划未来编写 grafana 插件转换该值。