在Linux环境中配置高可用的容器编排工具(如Kubernetes、Docker Swarm等)监控系统,需要从多个层面确保监控的可靠性和持续性。以下是完整的解决方案:
[容器编排集群]
│
├── [Prometheus A]───[Alertmanager集群]───[通知渠道]
│ │
│ └── [长期存储]
└── [Prometheus B]
# prometheus-ha.yaml (Kubernetes示例)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
spec:
serviceName: prometheus
replicas: 2
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--web.console.templates=/etc/prometheus/consoles"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-lifecycle"
ports:
- containerPort: 9090
volumeMounts:
- name: config-volume
mountPath: /etc/prometheus
- name: prometheus-storage
mountPath: /prometheus
volumes:
- name: config-volume
configMap:
name: prometheus-config
volumeClaimTemplates:
- metadata:
name: prometheus-storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 50Gi
# alertmanager-cluster.yaml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'user'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'team-email'
receivers:
- name: 'team-email'
email_configs:
- to: 'team@example.com'
send_resolved: true
# 安装Thanos组件
helm install thanos bitnami/thanos \
--set objstore.config.type=GCS \
--set objstore.config.gcs.bucket=<BUCKET_NAME> \
--set objstore.config.gcs.service_account_key=<BASE64_ENCODED_SA_KEY>
# prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- prometheus
topologyKey: "kubernetes.io/hostname"
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 1
memory: 2Gi
功能测试:
curl http://prometheus-service:9090/api/v1/targets
检查所有目标是否健康
故障转移测试:
负载测试:
定期维护:
性能优化:
安全加固:
通过以上配置,您可以在Linux环境中建立一个高可用的容器编排工具监控系统,确保即使在部分组件故障时,监控功能仍能持续工作。