在Linux环境中配置高可用的容器监控系统需要考虑监控工具的可靠性、数据持久性和故障转移能力。以下是基于Prometheus、Grafana和Alertmanager的完整高可用监控方案。
# 使用Docker Compose部署两个Prometheus实例
version: '3.7'
services:
prometheus-primary:
image: prom/prometheus:latest
container_name: prometheus-primary
restart: always
volumes:
- ./prometheus-primary.yml:/etc/prometheus/prometheus.yml
- prometheus-primary-data:/prometheus
ports:
- "9090:9090"
networks:
- monitor-net
prometheus-secondary:
image: prom/prometheus:latest
container_name: prometheus-secondary
restart: always
volumes:
- ./prometheus-secondary.yml:/etc/prometheus/prometheus.yml
- prometheus-secondary-data:/prometheus
ports:
- "9091:9090"
networks:
- monitor-net
volumes:
prometheus-primary-data:
prometheus-secondary-data:
networks:
monitor-net:
driver: bridge
在每个Prometheus实例的配置文件中添加:
global:
scrape_interval: 15s
evaluation_interval: 15s
# 启用HA功能
prometheus:
ha:
enabled: true
cluster: "monitor-cluster"
peer: "prometheus-secondary:9090" # 对于primary配置
# 或 "prometheus-primary:9090" # 对于secondary配置
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
docker run -d --name victoriametrics \
-p 8428:8428 \
-v victoria-data:/storage \
victoriametrics/victoria-metrics:latest \
-retentionPeriod=12m -storageDataPath=/storage
# thanos-sidecar配置
thanos-sidecar-primary:
image: thanosio/thanos:latest
container_name: thanos-sidecar-primary
command:
- "sidecar"
- "--prometheus.url=http://prometheus-primary:9090"
- "--tsdb.path=/prometheus"
- "--objstore.config-file=/etc/thanos/minio-bucket.yaml"
volumes:
- ./minio-bucket.yaml:/etc/thanos/minio-bucket.yaml
- prometheus-primary-data:/prometheus
networks:
- monitor-net
depends_on:
- prometheus-primary
alertmanager-primary:
image: prom/alertmanager:latest
container_name: alertmanager-primary
restart: always
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.peer=alertmanager-secondary:9094'
networks:
- monitor-net
alertmanager-secondary:
image: prom/alertmanager:latest
container_name: alertmanager-secondary
restart: always
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9094:9093"
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.peer=alertmanager-primary:9093'
networks:
- monitor-net
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: always
volumes:
- grafana-data:/var/lib/grafana
ports:
- "3000:3000"
networks:
- monitor-net
depends_on:
- prometheus-primary
- prometheus-secondary
数据持久化:
服务发现:
scrape_configs:
- job_name: 'docker'
dockerswarm_sd_configs:
- host: unix:///var/run/docker.sock
role: tasks
relabel_configs:
- source_labels: [__meta_dockerswarm_task_name]
regex: (.*)
target_label: container
告警规则高可用:
负载均衡:
测试故障转移:
docker stop prometheus-primary
# 验证prometheus-secondary是否接管工作
检查数据一致性:
curl -s http://localhost:9090/api/v1/query?query=up | jq
curl -s http://localhost:9091/api/v1/query?query=up | jq
测试告警:
# 人为制造一个告警条件,验证两个Alertmanager实例都能收到
容器编排集成:
日志监控:
性能优化:
此方案提供了完整的容器监控高可用解决方案,可根据实际环境规模和需求进行调整。