插件窝 干货文章 如何在Linux上配置高可用的容器监控

如何在Linux上配置高可用的容器监控

Prometheus Alertmanager primary secondary 823    来源:    2025-04-16

Linux上配置高可用的容器监控方案

方案概述

在Linux环境中配置高可用的容器监控系统需要考虑监控工具的可靠性、数据持久性和故障转移能力。以下是基于Prometheus、Grafana和Alertmanager的完整高可用监控方案。

核心组件

  1. Prometheus - 主监控系统,负责指标收集
  2. Grafana - 数据可视化仪表板
  3. Alertmanager - 告警管理
  4. cAdvisor - 容器指标收集
  5. Node Exporter - 主机指标收集
  6. VictoriaMetricsThanos - 长期存储和查询

高可用部署架构

1. Prometheus高可用配置

# 使用Docker Compose部署两个Prometheus实例
version: '3.7'

services:
  prometheus-primary:
    image: prom/prometheus:latest
    container_name: prometheus-primary
    restart: always
    volumes:
      - ./prometheus-primary.yml:/etc/prometheus/prometheus.yml
      - prometheus-primary-data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - monitor-net

  prometheus-secondary:
    image: prom/prometheus:latest
    container_name: prometheus-secondary
    restart: always
    volumes:
      - ./prometheus-secondary.yml:/etc/prometheus/prometheus.yml
      - prometheus-secondary-data:/prometheus
    ports:
      - "9091:9090"
    networks:
      - monitor-net

volumes:
  prometheus-primary-data:
  prometheus-secondary-data:

networks:
  monitor-net:
    driver: bridge

2. 配置Prometheus HA对等

在每个Prometheus实例的配置文件中添加:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

# 启用HA功能
prometheus:
  ha:
    enabled: true
    cluster: "monitor-cluster"
    peer: "prometheus-secondary:9090"  # 对于primary配置
    # 或 "prometheus-primary:9090"    # 对于secondary配置

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

3. 部署长期存储解决方案

方案A: VictoriaMetrics

docker run -d --name victoriametrics \
  -p 8428:8428 \
  -v victoria-data:/storage \
  victoriametrics/victoria-metrics:latest \
  -retentionPeriod=12m -storageDataPath=/storage

方案B: Thanos

# thanos-sidecar配置
  thanos-sidecar-primary:
    image: thanosio/thanos:latest
    container_name: thanos-sidecar-primary
    command:
      - "sidecar"
      - "--prometheus.url=http://prometheus-primary:9090"
      - "--tsdb.path=/prometheus"
      - "--objstore.config-file=/etc/thanos/minio-bucket.yaml"
    volumes:
      - ./minio-bucket.yaml:/etc/thanos/minio-bucket.yaml
      - prometheus-primary-data:/prometheus
    networks:
      - monitor-net
    depends_on:
      - prometheus-primary

4. 部署Alertmanager集群

  alertmanager-primary:
    image: prom/alertmanager:latest
    container_name: alertmanager-primary
    restart: always
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.peer=alertmanager-secondary:9094'
    networks:
      - monitor-net

  alertmanager-secondary:
    image: prom/alertmanager:latest
    container_name: alertmanager-secondary
    restart: always
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9094:9093"
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.peer=alertmanager-primary:9093'
    networks:
      - monitor-net

5. 部署Grafana

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: always
    volumes:
      - grafana-data:/var/lib/grafana
    ports:
      - "3000:3000"
    networks:
      - monitor-net
    depends_on:
      - prometheus-primary
      - prometheus-secondary

关键配置要点

  1. 数据持久化:

    • 为所有关键组件配置持久化卷
    • 使用网络存储(NFS、Ceph等)或云存储确保数据可靠性
  2. 服务发现:

    scrape_configs:
     - job_name: 'docker'
       dockerswarm_sd_configs:
         - host: unix:///var/run/docker.sock
           role: tasks
       relabel_configs:
         - source_labels: [__meta_dockerswarm_task_name]
           regex: (.*)
           target_label: container
    
  3. 告警规则高可用:

    • 使用ConfigMap或集中式配置管理确保所有Prometheus实例使用相同的告警规则
  4. 负载均衡:

    • 使用Nginx或HAProxy对Prometheus查询和Alertmanager进行负载均衡

验证高可用性

  1. 测试故障转移:

    docker stop prometheus-primary
    # 验证prometheus-secondary是否接管工作
    
  2. 检查数据一致性:

    curl -s http://localhost:9090/api/v1/query?query=up | jq
    curl -s http://localhost:9091/api/v1/query?query=up | jq
    
  3. 测试告警:

    # 人为制造一个告警条件,验证两个Alertmanager实例都能收到
    

扩展建议

  1. 容器编排集成:

    • 在Kubernetes环境中考虑使用Prometheus Operator
    • 在Docker Swarm中使用服务标签自动发现监控目标
  2. 日志监控:

    • 集成Loki或ELK栈实现日志监控
    • 配置Fluentd或Filebeat收集容器日志
  3. 性能优化:

    • 根据负载调整Prometheus的scrape_interval
    • 配置适当的保留策略和存储压缩

此方案提供了完整的容器监控高可用解决方案,可根据实际环境规模和需求进行调整。