Docker容器监控维度有哪些

容器监控 container Docker 513 来源： 2025-04-29

Docker容器监控是保障应用健康运行的关键环节，需从多维度进行系统性观测。以下是核心监控维度及详细说明：

CPU
- 监控容器CPU使用率、各核心利用率、Throttled时间（被限制的CPU时间）
- 工具示例：docker stats、cAdvisor、Prometheus的node_exporter
- 关键指标：container_cpu_usage_seconds_total（Prometheus）
内存
- 监控内存使用量、缓存、Swap使用、OOM（Out-of-Memory）事件
- 关键指标：container_memory_usage_bytes、container_memory_swap
- 注意：需区分RSS（实际物理内存）与Cache占用。
磁盘I/O
- 读写吞吐量（MB/s）、IOPS、延迟（ms）
- 工具：docker stats --no-stream（显示Block I/O）、iostat
- 关键指标：container_fs_reads_bytes_total、container_fs_writes_bytes_total
网络
- 流入/流出带宽（bps）、包量、丢包率、连接数
- 关键指标：container_network_receive_bytes_total、container_network_transmit_packets_dropped_total

运行状态
- Up（运行中）、Exited（退出）、Restarting（频繁重启需告警）
- 工具：docker ps -a --filter "status=exited"
重启次数
- 监控异常重启（如崩溃或配置错误）
- 指标：docker inspect --format='{{.RestartCount}}' <container>
启动时间
- 长时间运行的容器需关注uptime，短生命周期容器需统计启动耗时。

服务响应时间
- HTTP请求延迟（P99、P95）、吞吐量（RPS）
- 工具：Prometheus + Grafana、New Relic、Datadog APM
应用日志
- 错误日志（如HTTP 5xx）、异常堆栈、业务日志关键字（如ERROR）
- 工具：ELK、Fluentd、Loki
JVM/运行时指标（如Java/Python应用）
- GC次数、堆内存使用、线程池状态（适用于Java应用）。

场景	工具链示例
基础监控	cAdvisor + Prometheus + Grafana
日志分析	Fluentd + Elasticsearch + Kibana
全链路APM	SkyWalking + OpenTelemetry
商业解决方案	Datadog、New Relic、Sysdig

CPU Throttling：容器被限制CPU，导致性能下降。
排查：检查cpu.cfs_quota_us（/sys/fs/cgroup/cpu/docker/）。
内存泄漏：容器内存持续增长直至OOM。
排查：结合docker stats和jmap（Java应用）分析。
僵尸容器：已退出但未清理的容器占用资源。
清理命令：docker container prune。

通过以上多维监控，可快速定位性能瓶颈（如CPU竞争）、资源泄漏或应用异常，结合告警规则（如Prometheus Alertmanager）实现主动运维。

推荐文章