feat(docker): PP-04 完善 — Grafana HMS 概览 dashboard + postgres/redis exporter + 渠道文档
延续 PP-04 MVP,补全可观测性闭环:
- grafana/provisioning/dashboards/json/hms-overview.json: HMS 概览 dashboard
(服务状态/DB 连接池/EventBus 积压/内存 CPU/API 5xx 错误率,基于 app metrics)
- postgres-exporter + redis-exporter 服务: 之前 prometheus.yml 配了 target 但
服务未部署(pg_stat_activity/redis_memory 等告警永不触发),现补齐
- alertmanager 启用 --config.expand-env: 支持渠道 token 用 \${VAR} 从 .env 注入
(避免重蹈 PP-03 Redis 密码明文入 git 覆辙)
- alertmanager/README.md: 钉钉/企微/邮件渠道配置文档(上线前填)
nginx-exporter 跳过(alerts.yml 无 nginx 规则 + 需改 nginx.conf 配 stub_status)
This commit is contained in:
63
docker/alertmanager/README.md
Normal file
63
docker/alertmanager/README.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# Alertmanager 告警通知配置
|
||||
|
||||
> PP-04 可观测性。当前 `config.yml` 使用占位 webhook(`http://placeholder.invalid/alert`),告警会 POST 失败但记日志。
|
||||
> **上线前必须**替换为真实通知渠道,否则 11 条告警规则触发了也没人收到。
|
||||
|
||||
alertmanager 已启用 `--config.expand-env=true`,支持 `${VAR}` 从环境变量展开。
|
||||
|
||||
## 方案 A:钉钉 / 企业微信 webhook(推荐)
|
||||
|
||||
1. `config.yml` 的 receiver 改为环境变量引用:
|
||||
|
||||
```yaml
|
||||
receivers:
|
||||
- name: "default"
|
||||
webhook_configs:
|
||||
- url: "${ALERT_WEBHOOK_URL}"
|
||||
send_resolved: true
|
||||
```
|
||||
|
||||
2. `.env`(不入 git)加:
|
||||
```
|
||||
# 钉钉机器人
|
||||
ALERT_WEBHOOK_URL=https://oapi.dingtalk.com/robot/send?access_token=XXX
|
||||
# 或企业微信群机器人
|
||||
# ALERT_WEBHOOK_URL=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=XXX
|
||||
```
|
||||
|
||||
> token 必须放 `.env`,不能写进 `config.yml`(git 追踪)——避免重蹈 PP-03 Redis 密码明文泄露覆辙。
|
||||
|
||||
## 方案 B:邮件 SMTP
|
||||
|
||||
```yaml
|
||||
global:
|
||||
smtp_smarthost: "smtp.exmail.qq.com:465"
|
||||
smtp_from: "alert@hms.example.com"
|
||||
smtp_auth_username: "alert@hms.example.com"
|
||||
smtp_auth_password: "${SMTP_PASSWORD}"
|
||||
receivers:
|
||||
- name: "default"
|
||||
email_configs:
|
||||
- to: "ops@hms.example.com"
|
||||
send_resolved: true
|
||||
```
|
||||
|
||||
`.env` 加 `SMTP_PASSWORD=...`。
|
||||
|
||||
## 验证
|
||||
|
||||
部署后用 Alertmanager API 触发测试告警:
|
||||
|
||||
```bash
|
||||
curl -XPOST http://<host>:9093/api/v2/alerts \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '[{"labels":{"alertname":"test","severity":"critical"}}]'
|
||||
```
|
||||
|
||||
应收到渠道通知(钉钉/企微/邮件)。Alertmanager UI:`http://<host>:9093`。
|
||||
|
||||
## 当前路由策略
|
||||
|
||||
- 按 `alertname + service` 分组
|
||||
- `severity=critical`(DB 宕机/5xx 飙升/Redis 不可达)即时通知,5 分钟重复
|
||||
- 其他告警 30s 聚合,4 小时重复
|
||||
Reference in New Issue
Block a user