GPU 环境标准化部署脚本使用说明:

code size ofed NVIDIA fabricmanager CUDA
Author


### 一、脚本概述 该脚本旨在简化 GPU 相关应用的安装流程,适用于需要快速部署 GPU 环境的场景。 - **核心功能**: ```bash 脚本可批量完成网卡驱动、显卡驱动、fabricmanager互联管理器、CUDA 工具包、Nvidia-dcgm、DCGM-EXporter、Node-EXporter 核心组件的安装与卸载操作 ``` - **配置说明**: ```bash 用户管理:若需删除 ubuntu 用户,需手动执行相关用户删除命令,并妥善处理该用户关联的数据与权限。​ 磁盘管理:磁盘分区扩容需通过磁盘管理工具,根据实际需求对磁盘进行分区调整与扩容操作,以满足应用存储需求。​ 网络配置:网卡重命名需手动修改网络配置文件,根据实际网络环境对网卡名称进行重新定义,确保网络连接正常。 ``` - **使用建议**: ```bash 新系统推荐使用一键自动安装脚本,可快速、全面地完成 GPU 相关应用的部署,具体使用方法详见文章末尾说明。​若系统之前已存在相关安装内容,或需要对各组件进行独立、定制化部署,建议使用单独部署脚本安装。 ``` ### 二、使用说明 #### (1)系统初始化 ```bash cd /opt/ && wget -qO- http://116.205.97.109/scripts/system_optimize.sh|bash #磁盘扩容(初始化脚本已集成无须重新执行) #lvresize --extents +100%FREE --resizefs /dev/mapper/ubuntu--vg-ubuntu--lv #修改主机名(初始化脚本已集成无须重新执行) #IP=$(ip addr | awk '/^[0-9]+: / {}; /inet.*global/ {print gensub(/(.*)\/(.*)/, "\\1", "g", $2)}' | grep `ip route | grep default | awk '{print $3}' | awk -F. '{print $1"."$2}' | head -1` | head -1 | sed 's/\./-/g') #hostnamectl set-hostname $IP #bash #内核锁定(初始化脚本已集成无须重新执行) #apt-mark hold $(dpkg -l | grep -E "linux-(headers|image|unsigned|modules|modules-extra)" | grep "6.8.0-53" | awk '{print $2}') #dpkg --get-selections | grep hold #查看 ``` #### (2)MLNX_OFED 网络套件安装/卸载 ```bash #支持版本[23.10-1.1.9.0] cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh|bash -s -- --install --version "24.10-2.1.8.0" --distro "ubuntu24.04" ``` #### (3)IB 网卡排序 ```bash #支持版本[23.10-1.1.9.0] cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib.sh|bash -s -- --install ``` #### (4)Nvidia 显卡驱动安装/卸载 ```bash #支持版本[565.57.01] [570.124.06] cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '565.57.01' cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '570.124.06' ``` #### (5)GPU 互联管理器安装/卸载 ```bash #支持版本[565_565.57.01-1] [570_570.124.06-1] cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --distro ubuntu22.04 --version 565_565.57.01-1 cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --distro ubuntu24.04 --version 570_570.124.06-1 ``` #### (6)NVIDIA CUDA 工具包部署/卸载 ```bash #支持版本[12.6.3_560.35.05] [12.8.1_570.124.06] cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.6.3_560.35.05' cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.8.1_570.124.06' ``` #### (7)dcgm/node exporter 部署/卸载 ```bash cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --install cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --install cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --install cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --install #针对dcgm-exporter 进行自定义扩展,后期集成到dcgm中 cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --uninstall cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --uninstall cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --uninstall cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --uninstall ``` #### (8)Docker 安装/卸载 ```bash #支持版本[5:28.4.0-1~ubuntu.24.04~noble] cd /opt/ && wget -qO- http://116.205.97.109/scripts/docker.sh | bash -s -- --install --version '5:28.4.0-1~ubuntu.24.04~noble' cd /opt/ && wget -qO- http://116.205.97.109/scripts/docker.sh | bash -s -- --uninstall --version '5:28.4.0-1~ubuntu.24.04~noble' ``` #### (9)nvidia-container-toolkit 安装/卸载 ```bash #支持版本[1.17.6-1,1.17.7-1,1.17.8-1.....] cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-container-toolkit.sh | bash -s -- --install --version '1.17.6-1' cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-container-toolkit.sh | bash -s -- --uninstall --version '1.17.6-1' #查看版本:nvidia-container-runtime --version|head -1 ``` #### (10)Clonezilla 母机增强配置 ```bash #在再生龙克隆系统前,对目标母机系统进行增强配置,实现克隆还原后系统 “开箱即用”:无需手动修改主机名、带内IP、修复udev规则文件丢失问题等。 cd /opt/ && wget -qO- http://116.205.97.109/scripts/clonezilla_config.sh | bash #=注意:配置 IP 映射关系 脚本执行完成后,需在母机的 /opt/ip.txt 文件中,按格式填写还原后主机的带内 IP,带外 IP 对应关系(后续克隆到目标机后,系统会自动匹配配置)如: cat /opt/ip.txt #第一列:带内IP,第二列子网掩码,第三列带内网关,第四列带外IP 172.51.4.1 26 172.51.4.126 172.51.2.50 172.51.4.2 26 172.51.4.126 172.51.2.50 ....... ``` #### (11)k8s集群部署 ```bash 配置免密: #注:ip.txt cat > /opt/ip.txt << EOF 192.168.61.132 192.168.61.133 192.168.61.134 EOF cd /opt/ && wget -qO- http://116.205.97.109/scripts/auto_ssh_auth_setup.sh |bash -s -- --file=/opt/ip.txt --user=root --passwd=xxxx cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-ubuntu-init.sh | bash #系统初始化 cd /opt/ && wget -qO- http://116.205.97.109/scripts/containerd.sh |bash -s -- --install --version '1.7.28-1' #containerd 安装(所有节点执行) cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-base-setup.sh |bash -s -- --install --version '1.30.5' #k8s基础组件(所有节点执行) cd /opt/ && wget -qO- http://116.205.97.109/scripts/haproxy.sh |bash -s -- --install --backend 192.168.61.131:6443,192.168.61.132:6443,192.168.61.133:6443 --port 36443 #master 节点(3,5,7...) cd /opt/ && wget -qO- http://116.205.97.109/scripts/keepalived.sh |bash -s -- --install --vip 192.168.61.200/24 --priority 150 #主节点执行 cd /opt/ && wget -qO- http://116.205.97.109/scripts/keepalived.sh |bash -s -- --install --vip 192.168.61.200/24 --priority 140 #备节点执行 cd /opt/ && wget -qO- http://116.205.97.109/scripts/keepalived.sh |bash -s -- --install --vip 192.168.61.200/24 --priority 130 #备节点执行 #配置分发kubeadm配置文件 cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-config-deploy.sh |bash -s -- --local-ip=192.168.61.131 --hostname=master-01 --k8s-version=1.30.5 --cluster-vip=192.168.61.200 --cluster-port=36443 --master1-ip=192.168.61.131 --master2-ip=192.168.61.132 --master3-ip=192.168.61.133 #(所有节点执行) #初始化集群 #k8s-master 初始化集群 cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-cluster-deploy.sh |bash -s -- --install --master-ips 192.168.61.10,192.168.61.11,192.168.61.12 --node-ips 192.168.61.20,192.168.61.21 #安装网络插件 #状态检查 ``` #### (12) AL多租户平台allreduce性能测试/dcgm压测 ```bash cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm_diag_bg_log.sh | bash #默认后台压测,压测级别(4),预计时间2-3小时。 cd /opt/ && wget -qO- http://116.205.97.109/scripts/mpi_allreduce_perf_test.sh|bash -s -- --host=node01-ip:8,node02-ip:8 # allreduce 测试 ``` #### (13)批量安装/卸载 ![Static Badge](https://img.shields.io/badge/组件[1]-orange?style=flat-square) ![Static Badge](https://img.shields.io/badge/mlnx_ofed-23.10.1.1.9.0-brightgreen?style=plastic) ![Static Badge](https://img.shields.io/badge/nvidia_drive-565.57.01-brightgreen?style=plastic) ![Static Badge](https://img.shields.io/badge/cuda-12.6.3.560.35.05-brightgreen?style=plastic) ![Static Badge](https://img.shields.io/badge/fabricmanager-565_565.57.01.1-brightgreen?style=plastic) ![Static Badge](https://img.shields.io/badge/推荐一键安装脚本-orange?style=flat-square) ```bash #安装/卸载服务(安装或卸载时间较长,建议放后台执行。): #组合[1]----------------------------------------------------------------------------------------------------------------------------------- screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --install --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log"; tail -f /opt/gpu-manager.log screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --uninstall --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log"; tail -f /opt/gpu-manager.log #组合[2]----------------------------------------------------------------------------------------------------------------------------------- screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --install --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log"; tail -f /opt/gpu-manager.log screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --uninstall --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log"; tail -f /opt/gpu-manager.log #说明: #version 1 表示安装/卸载七.[1]组件版本:mlnx_ofed-23.10.1.1.9.0+nvidia_drive-565.57.01 +cuda-12.6.3.560.35.05 +fabricmanager-565_565.57.01.1 #version 2 表示安装/卸载七.[2]组件版本:mlnx_ofed-23.10.1.1.9.0+nvidia_drive-570.124.06+cuda-12.8.1.570.124.06+fabricmanager-570.124.06.1 #--include=exporter 指定该参数,脚本将安装/卸载exporter组件中的相关服务[dcgm-exporter,node-exporter,nvidia-dcgm],默认不安装/卸载。 ``` **特别提醒**: ```bash GPU:B200系列: (1)B200系列安装fabricmanager 时,需要安装nvlsm,否则faricmanager无法启动。 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/nvlsm_2025.03.1-1_amd64.deb dpkg -i nvlsm_2025.03.1-1_amd64.deb 超威机型: (1)超威机型需要在:/etc/default/grub文件下添加GRUB_CMDLINE_LINUX="quiet splash nokaslr"参数,否则CUDA初始化失败。 (2)超威机型需要在:/etc/default/grub文件下添加GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=off" #正确关闭iommu (3) 超威机型挂载镜像: http协议:file_server:http://10.51.151.201 镜像路径:/iso/ubu22043.iso 再生龙镜像还原: (1)华擎B200:再生龙镜像:NFS:10.102.35.99:/nfs/clone.iso 备份路径:/nfs/2025-05-26-09-B200-960g-img #华擎机型对再生龙引导镜像版本无要求。 (2)超威B200:再生龙镜像:10.102.35.99:/nfs/clone.iso 备份路径: /nfs/chaowei-B200-1.7T-img #注意超威机型对再生龙引导镜像对版本有要求,最新版本无法引导。 (3) 技嘉A100:再生龙镜像:10.101.0.86:/nfs/ 备份路径: /nfs/2025-07-15-03-Jijia-A100-960G-img #技嘉A100-磁盘960G-CX7 ``` **ubuntu2404:(临时)** ```bash cd /opt/ wget https://content.mellanox.com/ofed/MLNX_OFED-24.10-2.1.8.0/MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu24.04-x86_64.tgz #[ubuntu24.04] wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/nvidia-fabricmanager-570_570.124.06-1_amd64.deb #[ubuntu24.04] wget https://cn.download.nvidia.com/tesla/570.124.06/NVIDIA-Linux-x86_64-570.124.06.run #[无版本要求] wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run #[无版本要求] cd /opt/ && git clone http://116.205.97.109:3000/yindun/ansible-devops.git cd /opt/ansible-devops/scripts/ #-----临时替换适配ubuntu24.04 sed -i -e 's/5.8-6.0.4.2/24.10-2.1.8.0/g' -e 's/22.04/24.04/g' ib-drive.sh && sed -i 's/2204/2404/g' nvidia-fabricmanager.sh bash system_optimize.sh --install bash ib-drive.sh --install --version "24.10-2.1.8.0" bash nvidia-driver.sh --install --version '570.124.06' bash nvidia-fabricmanager.sh --install --version "570_570.124.06-1" bash cuda.sh --install --version "12.8.1_570.124.06" #安装exporter cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --install cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --install cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --install cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --install #针对dcgm-exporter 进行自定义扩展,后期集成到dcgm中 #修改主机名,内核版本锁定,根分区扩容已集成在初始化脚本中无须重复执行。 # 平湖引导镜像:10.101.0.34 /nfs/iso/gpu550.iso ```