feat(docker): PP-04 完善 — Grafana HMS 概览 dashboard + postgres/redis exporter + 渠道文档

延续 PP-04 MVP,补全可观测性闭环:
- grafana/provisioning/dashboards/json/hms-overview.json: HMS 概览 dashboard
  (服务状态/DB 连接池/EventBus 积压/内存 CPU/API 5xx 错误率,基于 app metrics)
- postgres-exporter + redis-exporter 服务: 之前 prometheus.yml 配了 target 但
  服务未部署(pg_stat_activity/redis_memory 等告警永不触发),现补齐
- alertmanager 启用 --config.expand-env: 支持渠道 token 用 \${VAR} 从 .env 注入
  (避免重蹈 PP-03 Redis 密码明文入 git 覆辙)
- alertmanager/README.md: 钉钉/企微/邮件渠道配置文档(上线前填)

nginx-exporter 跳过(alerts.yml 无 nginx 规则 + 需改 nginx.conf 配 stub_status)
This commit is contained in:
iven
2026-06-26 10:03:21 +08:00
parent 6457c53d9c
commit ffbe5a797f
4 changed files with 204 additions and 0 deletions

View File

@@ -0,0 +1,63 @@
# Alertmanager 告警通知配置
> PP-04 可观测性。当前 `config.yml` 使用占位 webhook`http://placeholder.invalid/alert`),告警会 POST 失败但记日志。
> **上线前必须**替换为真实通知渠道,否则 11 条告警规则触发了也没人收到。
alertmanager 已启用 `--config.expand-env=true`,支持 `${VAR}` 从环境变量展开。
## 方案 A钉钉 / 企业微信 webhook推荐
1. `config.yml` 的 receiver 改为环境变量引用:
```yaml
receivers:
- name: "default"
webhook_configs:
- url: "${ALERT_WEBHOOK_URL}"
send_resolved: true
```
2. `.env`(不入 git
```
# 钉钉机器人
ALERT_WEBHOOK_URL=https://oapi.dingtalk.com/robot/send?access_token=XXX
# 或企业微信群机器人
# ALERT_WEBHOOK_URL=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=XXX
```
> token 必须放 `.env`,不能写进 `config.yml`git 追踪)——避免重蹈 PP-03 Redis 密码明文泄露覆辙。
## 方案 B邮件 SMTP
```yaml
global:
smtp_smarthost: "smtp.exmail.qq.com:465"
smtp_from: "alert@hms.example.com"
smtp_auth_username: "alert@hms.example.com"
smtp_auth_password: "${SMTP_PASSWORD}"
receivers:
- name: "default"
email_configs:
- to: "ops@hms.example.com"
send_resolved: true
```
`.env` 加 `SMTP_PASSWORD=...`。
## 验证
部署后用 Alertmanager API 触发测试告警:
```bash
curl -XPOST http://<host>:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{"labels":{"alertname":"test","severity":"critical"}}]'
```
应收到渠道通知(钉钉/企微/邮件。Alertmanager UI`http://<host>:9093`。
## 当前路由策略
- 按 `alertname + service` 分组
- `severity=critical`DB 宕机/5xx 飙升/Redis 不可达即时通知5 分钟重复
- 其他告警 30s 聚合4 小时重复

View File

@@ -145,6 +145,7 @@ services:
- alertmanager_data:/alertmanager
command:
- "--config.file=/etc/alertmanager/config.yml"
- "--config.expand-env=true"
- "--storage.path=/alertmanager"
expose:
- "9093"
@@ -171,6 +172,30 @@ services:
networks:
- hms-internal
# ── Prometheus exportersPP-04之前 prometheus.yml 配了 target 但服务未部署,告警永不触发)──
postgres-exporter:
image: prometheuscommunity/postgres-exporter:v0.15.0
container_name: hms-postgres-exporter
restart: unless-stopped
environment:
DATA_SOURCE_NAME: "postgresql://${POSTGRES_USER:-erp}:${POSTGRES_PASSWORD}@postgres:${POSTGRES_PORT:-5432}/${POSTGRES_DB:-erp}?sslmode=disable"
expose:
- "9187"
networks:
- hms-internal
redis-exporter:
image: oliver006/redis_exporter:v1.66.0
container_name: hms-redis-exporter
restart: unless-stopped
environment:
REDIS_ADDR: "redis://redis:${REDIS_PORT:-6379}"
REDIS_PASSWORD: "${REDIS_PASSWORD:-erp_redis_dev}"
expose:
- "9121"
networks:
- hms-internal
volumes:
app-uploads:
driver: local

View File

@@ -0,0 +1,115 @@
{
"uid": "hms-overview",
"title": "HMS 概览",
"tags": ["HMS", "overview"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"panels": [
{
"id": 1,
"type": "stat",
"title": "HMS 服务状态",
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
"datasource": { "type": "prometheus", "uid": "prometheus" },
"targets": [{ "refId": "A", "expr": "up{job=\"hms\"}", "legendFormat": "" }],
"options": {
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"colorMode": "background",
"mappings": [
{
"type": "value",
"options": {
"0": { "text": "DOWN", "color": "red", "index": 0 },
"1": { "text": "UP", "color": "green", "index": 1 }
}
}
]
}
},
{
"id": 2,
"type": "stat",
"title": "EventBus 积压 (pending)",
"gridPos": { "h": 4, "w": 6, "x": 6, "y": 0 },
"datasource": { "type": "prometheus", "uid": "prometheus" },
"targets": [{ "refId": "A", "expr": "eventbus_pending_total", "legendFormat": "" }],
"options": {
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"colorMode": "value",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 100 },
{ "color": "red", "value": 500 }
]
}
}
},
{
"id": 3,
"type": "stat",
"title": "API 5xx 错误率 (5m)",
"gridPos": { "h": 4, "w": 12, "x": 12, "y": 0 },
"datasource": { "type": "prometheus", "uid": "prometheus" },
"targets": [
{
"refId": "A",
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / clamp_min(sum(rate(http_requests_total[5m])), 1)",
"legendFormat": "5xx ratio"
}
],
"options": {
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"colorMode": "background",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.01 },
{ "color": "red", "value": 0.05 }
]
},
"mappings": []
},
"fieldConfig": { "defaults": { "unit": "percentunit" } }
},
{
"id": 4,
"type": "timeseries",
"title": "DB 连接池(活跃 / 空闲)",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
"datasource": { "type": "prometheus", "uid": "prometheus" },
"targets": [
{ "refId": "A", "expr": "db_pool_connections_active", "legendFormat": "活跃" },
{ "refId": "B", "expr": "db_pool_connections_idle", "legendFormat": "空闲" }
],
"fieldConfig": { "defaults": { "unit": "short" } }
},
{
"id": 5,
"type": "timeseries",
"title": "进程内存 / CPU",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
"datasource": { "type": "prometheus", "uid": "prometheus" },
"targets": [
{ "refId": "A", "expr": "process_resident_memory_bytes", "legendFormat": "内存 (bytes)" },
{ "refId": "B", "expr": "rate(process_cpu_seconds_total[5m])", "legendFormat": "CPU (cores/s)" }
]
},
{
"id": 6,
"type": "timeseries",
"title": "EventBus 积压趋势",
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 12 },
"datasource": { "type": "prometheus", "uid": "prometheus" },
"targets": [
{ "refId": "A", "expr": "eventbus_pending_total", "legendFormat": "pending events" }
],
"fieldConfig": { "defaults": { "unit": "short" } }
}
]
}

View File

@@ -4,6 +4,7 @@ apiVersion: 1
datasources:
- name: Prometheus
uid: prometheus
type: prometheus
access: proxy
url: http://prometheus:9090