ansible-devops/scripts/README.md

320 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<h2 align="center">GPU 环境标 准化部署脚本使用说明:</h2>
<p align="center">
<img src="https://img.shields.io/github/languages/code-size/nanchengcyu/TechMindWave-frontend" alt="code size"/>
<img src="https://img.shields.io/badge/ofed-17.0.2-blue" alt="ofed"/>
<img src="https://img.shields.io/badge/NVIDIA-565.57.01-brightgreen" alt="NVIDIA"/>
<img src="https://img.shields.io/badge/fabricmanager-565.57.01-blue" alt="fabricmanager"/>
<img src="https://img.shields.io/badge/CUDA-12.6.3-brightgreen" alt="CUDA"/>
<br>
<img src="https://img.shields.io/badge/Author-王云龙-orange" alt="Author" />
</p>
<hr>
### 一、脚本概述
该脚本旨在简化 GPU 相关应用的安装流程,适用于需要快速部署 GPU 环境的场景。
- **核心功能**
```bash
脚本可批量完成网卡驱动、显卡驱动、fabricmanager互联管理器、CUDA 工具包、Nvidia-dcgm、DCGM-EXporter、Node-EXporter 核心组件的安装与卸载操作
```
- **配置说明**
```bash
用户管理:若需删除 ubuntu 用户,需手动执行相关用户删除命令,并妥善处理该用户关联的数据与权限。​
磁盘管理:磁盘分区扩容需通过磁盘管理工具,根据实际需求对磁盘进行分区调整与扩容操作,以满足应用存储需求。​
网络配置:网卡重命名需手动修改网络配置文件,根据实际网络环境对网卡名称进行重新定义,确保网络连接正常。
```
- **使用建议**
```bash
新系统推荐使用一键自动安装脚本,可快速、全面地完成 GPU 相关应用的部署,具体使用方法详见文章末尾说明。​若系统之前已存在相关安装内容,或需要对各组件进行独立、定制化部署,建议使用单独部署脚本安装。
```
### 二、使用说明
#### 1系统初始化
```bash
cd /opt/ && wget -qO- http://116.205.97.109/scripts/system_optimize.sh|bash
#磁盘扩容(初始化脚本已集成无须重新执行)
#lvresize --extents +100%FREE --resizefs /dev/mapper/ubuntu--vg-ubuntu--lv
#修改主机名(初始化脚本已集成无须重新执行)
#IP=$(ip addr | awk '/^[0-9]+: / {}; /inet.*global/ {print gensub(/(.*)\/(.*)/, "\\1", "g", $2)}' | grep `ip route | grep default | awk '{print $3}' | awk -F. '{print $1"."$2}' | head -1` | head -1 | sed 's/\./-/g')
#hostnamectl set-hostname $IP
#bash
#内核锁定(初始化脚本已集成无须重新执行)
#apt-mark hold $(dpkg -l | grep -E "linux-(headers|image|unsigned|modules|modules-extra)" | grep "6.8.0-53" | awk '{print $2}')
#dpkg --get-selections | grep hold #查看
```
#### 2MLNX_OFED 网络套件安装/卸载
```bash
#支持版本[23.10-1.1.9.0]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh|bash -s -- --install --version "24.10-2.1.8.0" --distro "ubuntu24.04"
```
#### 3IB 网卡排序
```bash
#支持版本[23.10-1.1.9.0]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib.sh|bash -s -- --install
```
#### 4Nvidia 显卡驱动安装/卸载
```bash
#支持版本[565.57.01] [570.124.06]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '565.57.01'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '570.124.06'
```
#### 5GPU 互联管理器安装/卸载
```bash
#支持版本[565_565.57.01-1] [570_570.124.06-1]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --distro ubuntu22.04 --version 565_565.57.01-1
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --distro ubuntu24.04 --version 570_570.124.06-1
```
#### 6NVIDIA CUDA 工具包部署/卸载
```bash
#支持版本[12.6.3_560.35.05] [12.8.1_570.124.06]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.6.3_560.35.05'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.8.1_570.124.06'
```
#### 7dcgm/node exporter 部署/卸载
```bash
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --install #针对dcgm-exporter 进行自定义扩展后期集成到dcgm中
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --uninstall
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --uninstall
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --uninstall
cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --uninstall
```
#### 8Docker 安装/卸载
```bash
#支持版本[5:28.4.0-1~ubuntu.24.04~noble]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/docker.sh | bash -s -- --install --version '5:28.4.0-1~ubuntu.24.04~noble'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/docker.sh | bash -s -- --uninstall --version '5:28.4.0-1~ubuntu.24.04~noble'
```
#### 9nvidia-container-toolkit 安装/卸载
```bash
#支持版本[1.17.6-1,1.17.7-1,1.17.8-1.....]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-container-toolkit.sh | bash -s -- --install --version '1.17.6-1'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-container-toolkit.sh | bash -s -- --uninstall --version '1.17.6-1'
#查看版本nvidia-container-runtime --version|head -1
```
#### 10nfs 安装/卸载
```bash
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nfs.sh | bash -s -- --install --share-dirs=/opt/data
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nfs.sh | bash -s -- --install --share-dirs=/opt/data
```
#### 10Clonezilla 母机增强配置
```bash
#在再生龙克隆系统前,对目标母机系统进行增强配置,实现克隆还原后系统 “开箱即用”无需手动修改主机名、带内IP、修复udev规则文件丢失问题等。
cd /opt/ && wget -qO- http://116.205.97.109/scripts/clonezilla_config.sh | bash
#=注意:配置 IP 映射关系
脚本执行完成后,需在母机的 /opt/ip.txt 文件中,按格式填写还原后主机的带内 IP带外 IP 对应关系(后续克隆到目标机后,系统会自动匹配配置)如:
cat /opt/ip.txt
#第一列:带内IP,第二列子网掩码第三列带内网关第四列带外IP
172.51.4.1 26 172.51.4.126 172.51.2.50
172.51.4.2 26 172.51.4.126 172.51.2.50
.......
```
#### 11k8s集群部署
```bash
配置免密:
#注ip.txt
cat > /opt/ip.txt << EOF
192.168.61.131
192.168.61.132
192.168.61.133
192.168.61.134
EOF
cd /opt/ && wget -qO- http://116.205.97.109/scripts/auto_ssh_auth_setup.sh |bash -s -- --file=/opt/ip.txt --user=root --passwd=xxxx
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-ubuntu-init.sh | bash #系统初始化
cd /opt/ && wget -qO- http://116.205.97.109/scripts/containerd.sh |bash -s -- --install --version '1.7.28-1' #containerd 安装所有节点执行
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-base-setup.sh |bash -s -- --install --version '1.30.5' #k8s基础组件所有节点执行
cd /opt/ && wget -qO- http://116.205.97.109/scripts/haproxy.sh |bash -s -- --install --backend 192.168.61.131:6443,192.168.61.132:6443,192.168.61.133:6443 --port 36443
#master 节点3,5,7...
cd /opt/ && wget -qO- http://116.205.97.109/scripts/keepalived.sh |bash -s -- --install --vip 192.168.61.200/24 --priority 150 #主节点执行
cd /opt/ && wget -qO- http://116.205.97.109/scripts/keepalived.sh |bash -s -- --install --vip 192.168.61.200/24 --priority 140 #备节点执行
cd /opt/ && wget -qO- http://116.205.97.109/scripts/keepalived.sh |bash -s -- --install --vip 192.168.61.200/24 --priority 130 #备节点执行
#配置分发kubeadm配置文件
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-config-deploy.sh |bash -s -- --local-ip=192.168.61.131 --hostname=master-01 --k8s-version=1.30.5 --cluster-vip=192.168.61.200 --cluster-port=36443 --master1-ip=192.168.61.131 --master2-ip=192.168.61.132 --master3-ip=192.168.61.133 #master01
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-config-deploy.sh |bash -s -- --local-ip=192.168.61.132 --hostname=master-02 --k8s-version=1.30.5 --cluster-vip=192.168.61.200 --cluster-port=36443 --master1-ip=192.168.61.131 --master2-ip=192.168.61.132 --master3-ip=192.168.61.133 #master02
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-config-deploy.sh |bash -s -- --local-ip=192.168.61.133 --hostname=master-03 --k8s-version=1.30.5 --cluster-vip=192.168.61.200 --cluster-port=36443 --master1-ip=192.168.61.131 --master2-ip=192.168.61.132 --master3-ip=192.168.61.133 #master03
#初始化集群,在这个地方拍摄快照调试代码
#k8s-master 初始化集群
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-cluster-deploy.sh |bash -s -- --master-ips=192.168.61.132,192.168.61.133 --node-ips=192.168.61.134
#脚本存在bug 请手动执行初始化kubeadm init --config kubeadm-init.yaml --upload-certs
#常用工具组件安装helm 工具安装
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
#安装nfs存储类
# 依赖环境所有节点必须安装nfs客户端apt install -y nfs-common
# 若没有node节点,可取消master污点让其可调度kubectl taint nodes master-03 node-role.kubernetes.io/control-plane:NoSchedule-
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nfs.sh | bash -s -- --install --share-dirs=/opt/data #安装nfs
cd /opt/ && wget -qO- http://116.205.97.109/scripts/install-nfs-storageclass-pro.sh | bash -s -- --nfs-server 192.168.61.131 --share-dirs /opt/data #指定nfs信息
#安装Metrics Server组件
cd /opt/k8s-install-conf/ && wget http://116.205.97.109/scripts/metrics-server.yaml && kubectl apply -f /opt/k8s-install-conf/metrics-server.yaml
#测试 kubectl top node
#安装ingress-nginx组件
cd /opt/k8s-install-conf/ && wget http://116.205.97.109/scripts/ingress.yaml && kubectl apply -f /opt/k8s-install-conf/ingress.yaml
#安装网络插件
#wget -q -c -O /opt/k8s-install-conf/calico.yaml http://116.205.97.109/scripts/calico.yaml --show-progress && kubectl apply -f /opt/k8s-install-conf/calico.yaml
#static pod 安装 kuboard-UI
Kubernetes master 节点上执行如下两行指令即可在根据提示完成 kuboard 安装默认用户名/密码: admin/Kuboard123
cd /opt/k8s-install-conf && curl -fsSL https://addons.kuboard.cn/kuboard/kuboard-static-pod.sh -o kuboard.sh
sed -i 's#eipwork/kuboard:v3#swr.cn-east-2.myhuaweicloud.com/kuboard/kuboard:v3#g' kuboard.sh
sh kuboard.sh
#Argocd部署【集群内部部署-dev环境】
kubectl create namespace argocd && cd /opt/k8s-install-conf && wget https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
kubectl apply -n argocd -f install.yaml
#获取密码kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
```
#### (12) AL多租户平台 GPU资源基准测试
```bash
cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu_bench_auto.sh|bash -s -- --tests=bandwidthTest,deviceQuery,gpu_burn,p2pBandwidthLatencyTest #gpu基准测试
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm_diag_bg_log.sh | bash #默认后台压测压测级别4预计时间2-3小时。
cd /opt/ && wget -qO- http://116.205.97.109/scripts/mpi_allreduce_perf_test.sh|bash -s -- --host=node01-ip:8,node02-ip:8 # allreduce 测试
```
#### (13) docker实现监控快速搭建
```bash
#环境依赖docker,docker-compose
#promtheus配置1.支持dcgm-exporter,node-exporterconsul 自动发现和file_sd自动发现2.支持ipmi-exporterfile_sd自动发现
#自动部署服务ipmi-exporter,prometheus, alertmanager,consul,grafana,prometheus,prometheus-alert
wget -O - http://116.205.97.109/scripts/prometheus-monitor.tgz | tar -xvz && cd prometheus-monitor && docker-compose up -d
#以下命令可通过ansible 批量执行或在任意节点发起PUT请求批量循环注册
bash /opt/prometheus-monitor/dcgm-consul.sh --register/deregister #dcgm-exporter->consul注册/注销
bash /opt/prometheus-monitor/node-consul.sh --register/deregister #node-exporter->consul 注册/注销
```
#### 13批量安装/卸载
![Static Badge](https://img.shields.io/badge/组件[1]-orange?style=flat-square)
![Static Badge](https://img.shields.io/badge/mlnx_ofed-23.10.1.1.9.0-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/nvidia_drive-565.57.01-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/cuda-12.6.3.560.35.05-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/fabricmanager-565_565.57.01.1-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/推荐一键安装脚本-orange?style=flat-square)
```bash
#安装/卸载服务(安装或卸载时间较长,建议放后台执行。)
#组合[1]-----------------------------------------------------------------------------------------------------------------------------------
screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --install --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log";
tail -f /opt/gpu-manager.log
screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --uninstall --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log";
tail -f /opt/gpu-manager.log
#组合[2]-----------------------------------------------------------------------------------------------------------------------------------
screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --install --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log";
tail -f /opt/gpu-manager.log
screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --uninstall --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log";
tail -f /opt/gpu-manager.log
#说明:
#version 1 表示安装/卸载七.[1]组件版本mlnx_ofed-23.10.1.1.9.0+nvidia_drive-565.57.01 +cuda-12.6.3.560.35.05 +fabricmanager-565_565.57.01.1
#version 2 表示安装/卸载七.[2]组件版本mlnx_ofed-23.10.1.1.9.0+nvidia_drive-570.124.06+cuda-12.8.1.570.124.06+fabricmanager-570.124.06.1
#--include=exporter 指定该参数,脚本将安装/卸载exporter组件中的相关服务[dcgm-exporter,node-exporter,nvidia-dcgm],默认不安装/卸载。
```
**特别提醒**
```bash
GPU:B200系列
1B200系列安装fabricmanager 时需要安装nvlsm,否则faricmanager无法启动。
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/nvlsm_2025.03.1-1_amd64.deb
dpkg -i nvlsm_2025.03.1-1_amd64.deb
超威机型:
1超威机型需要在:/etc/default/grub文件下添加GRUB_CMDLINE_LINUX="quiet splash nokaslr"参数否则CUDA初始化失败。
2超威机型需要在:/etc/default/grub文件下添加GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=off" #正确关闭iommu
(3) 超威机型挂载镜像: http协议:file_server:http://10.51.151.201 镜像路径:/iso/ubu22043.iso
再生龙镜像还原:
1华擎B200:再生龙镜像NFS10.102.35.99/nfs/clone.iso 备份路径:/nfs/2025-05-26-09-B200-960g-img #华擎机型对再生龙引导镜像版本无要求
2超威B200:再生龙镜像:10.102.35.99/nfs/clone.iso 备份路径: /nfs/chaowei-B200-1.7T-img #注意超威机型对再生龙引导镜像对版本有要求,最新版本无法引导。
(3) 技嘉A100:再生龙镜像:10.101.0.86:/nfs/ 备份路径: /nfs/2025-07-15-03-Jijia-A100-960G-img #技嘉A100-磁盘960G-CX7
```
**ubuntu2404:(临时)**
```bash
cd /opt/
wget https://content.mellanox.com/ofed/MLNX_OFED-24.10-2.1.8.0/MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu24.04-x86_64.tgz #[ubuntu24.04]
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/nvidia-fabricmanager-570_570.124.06-1_amd64.deb #[ubuntu24.04]
wget https://cn.download.nvidia.com/tesla/570.124.06/NVIDIA-Linux-x86_64-570.124.06.run #[无版本要求]
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run #[无版本要求]
cd /opt/ && git clone http://116.205.97.109:3000/yindun/ansible-devops.git
cd /opt/ansible-devops/scripts/
#-----临时替换适配ubuntu24.04
sed -i -e 's/5.8-6.0.4.2/24.10-2.1.8.0/g' -e 's/22.04/24.04/g' ib-drive.sh && sed -i 's/2204/2404/g' nvidia-fabricmanager.sh
bash system_optimize.sh --install
bash ib-drive.sh --install --version "24.10-2.1.8.0"
bash nvidia-driver.sh --install --version '570.124.06'
bash nvidia-fabricmanager.sh --install --version "570_570.124.06-1"
bash cuda.sh --install --version "12.8.1_570.124.06"
#安装exporter
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --install #针对dcgm-exporter 进行自定义扩展后期集成到dcgm中
#修改主机名,内核版本锁定,根分区扩容已集成在初始化脚本中无须重复执行。
# 平湖引导镜像10.101.0.34 /nfs/iso/gpu550.iso
```