插件窝 干货文章 如何在Linux上配置高可用的容器编排工具监控

如何在Linux上配置高可用的容器编排工具监控

Prometheus Kubernetes name 配置 634    来源:    2025-04-15

Linux上配置高可用的容器编排工具监控方案

方案概述

在Linux环境中配置高可用的容器编排工具(如Kubernetes、Docker Swarm等)监控系统,需要从多个层面确保监控的可靠性和持续性。以下是完整的解决方案:

1. 监控架构设计

1.1 核心组件选择

  • Prometheus:作为主监控系统,负责指标收集和存储
  • Alertmanager:处理告警通知和去重
  • Grafana:可视化监控数据
  • Node Exporter:收集主机级指标
  • cAdvisor:收集容器指标
  • kube-state-metrics(K8s专用):收集K8s资源状态

1.2 高可用架构

[容器编排集群]
    │
    ├── [Prometheus A]───[Alertmanager集群]───[通知渠道]
    │        │
    │        └── [长期存储]
    └── [Prometheus B]

2. 具体实施步骤

2.1 Prometheus高可用配置

# prometheus-ha.yaml (Kubernetes示例)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  serviceName: prometheus
  replicas: 2
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus
        args:
        - "--config.file=/etc/prometheus/prometheus.yml"
        - "--storage.tsdb.path=/prometheus"
        - "--web.console.templates=/etc/prometheus/consoles"
        - "--web.console.libraries=/etc/prometheus/console_libraries"
        - "--storage.tsdb.retention.time=30d"
        - "--web.enable-lifecycle"
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: config-volume
          mountPath: /etc/prometheus
        - name: prometheus-storage
          mountPath: /prometheus
      volumes:
      - name: config-volume
        configMap:
          name: prometheus-config
  volumeClaimTemplates:
  - metadata:
      name: prometheus-storage
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 50Gi

2.2 Alertmanager集群配置

# alertmanager-cluster.yaml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'user'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'team-email'

receivers:
- name: 'team-email'
  email_configs:
  - to: 'team@example.com'
    send_resolved: true

2.3 使用Thanos实现长期存储和全局视图

# 安装Thanos组件
helm install thanos bitnami/thanos \
  --set objstore.config.type=GCS \
  --set objstore.config.gcs.bucket=<BUCKET_NAME> \
  --set objstore.config.gcs.service_account_key=<BASE64_ENCODED_SA_KEY>

3. 监控目标配置

3.1 Kubernetes监控配置示例

# prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

4. 高可用保障措施

4.1 数据持久化

  • 配置Prometheus使用持久化存储卷
  • 设置适当的保留策略
  • 实现远程写入到长期存储(如Thanos、Cortex或VictoriaMetrics)

4.2 自动恢复

  • 为所有监控组件配置健康检查
  • 设置Pod反亲和性规则,避免单点故障
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - prometheus
      topologyKey: "kubernetes.io/hostname"

4.3 资源限制

resources:
  limits:
    cpu: 2
    memory: 4Gi
  requests:
    cpu: 1
    memory: 2Gi

5. 验证与测试

  1. 功能测试

    curl http://prometheus-service:9090/api/v1/targets
    

    检查所有目标是否健康

  2. 故障转移测试

    • 手动停止一个Prometheus实例
    • 验证告警和查询是否继续工作
  3. 负载测试

    • 使用prombench工具模拟高负载
    • 监控系统资源使用情况

6. 维护与优化

  1. 定期维护

    • 检查存储使用情况
    • 审查告警规则
    • 更新监控配置
  2. 性能优化

    • 调整抓取间隔
    • 优化PromQL查询
    • 考虑使用记录规则
  3. 安全加固

    • 启用TLS通信
    • 配置适当的RBAC规则
    • 定期轮换凭证

通过以上配置,您可以在Linux环境中建立一个高可用的容器编排工具监控系统,确保即使在部分组件故障时,监控功能仍能持续工作。