在Linux环境中进行集群管理需要结合工具、协议和最佳实践来确保高可用性、负载均衡和资源优化。以下是详细的解决方案:
工具 | 用途 |
---|---|
Pacemaker | 资源管理与故障转移 |
Corosync | 集群通信(心跳检测) |
Keepalived | IP漂移(VIP管理) |
Ansible | 批量配置管理(无Agent) |
Kubernetes | 容器编排与自动化 |
# Ubuntu/Debian
sudo apt install pacemaker corosync pcs
# RHEL/CentOS
sudo yum install pacemaker corosync pcs
sudo systemctl enable --now pcsd
编辑 /etc/corosync/corosync.conf
:
totem {
version: 2
cluster_name: my_cluster
transport: udpu
interface {
ringnumber: 0
bindnetaddr: <节点IP网段> # 如192.168.1.0
}
}
quorum {
provider: corosync_votequorum
expected_votes: 2 # 节点数
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
}
sudo systemctl start corosync pacemaker
sudo pcs cluster auth node1 node2 # 节点间认证
sudo pcs cluster setup --name my_cluster node1 node2
sudo pcs cluster start --all
sudo pcs resource create nginx_ip ocf:heartbeat:IPaddr2 ip=192.168.1.100 cidr_netmask=24 op monitor interval=30s
sudo pcs resource create nginx_svc systemd:nginx op monitor interval=5s
sudo pcs constraint colocation add nginx_svc with nginx_ip INFINITY
sudo apt install ipvsadm keepalived # 或 yum install
编辑 /etc/keepalived/keepalived.conf
(主节点):
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100 # 备节点设为50
virtual_ipaddress {
192.168.1.100/24
}
}
virtual_server 192.168.1.100 80 {
delay_loop 6
lb_algo wrr # 加权轮询
lb_kind DR # 直接路由模式
protocol TCP
real_server 192.168.1.101 80 {
weight 1
TCP_CHECK {
connect_timeout 3
}
}
}
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
mkdir -p $HOME/.kube
sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# 安装网络插件(如Calico)
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
kubeadm join <MASTER_IP>:6443 --token <TOKEN> --discovery-token-ca-cert-hash <HASH>
bash
pcs status # 查看集群状态
crm_verify -L -V # 检查Pacemaker配置
kubectl get nodes # 查看K8s节点
ipvsadm -Ln # 查看LVS规则
ping
、telnet
端口)。top
或htop
查看CPU/内存。journalctl -u corosync
或 /var/log/cluster/*.log
。quorum
策略。通过上述方法,可以构建稳定、高效的Linux集群环境。根据实际需求选择合适的技术栈,并定期测试故障转移场景以确保可靠性。