ansible-devops/scripts/README.md

211 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<h2 align="center">GPU 环境标准化部署脚本使用说明:</h2>
<p align="center">
<img src="https://img.shields.io/github/languages/code-size/nanchengcyu/TechMindWave-frontend" alt="code size"/>
<img src="https://img.shields.io/badge/ofed-17.0.2-blue" alt="ofed"/>
<img src="https://img.shields.io/badge/NVIDIA-565.57.01-brightgreen" alt="NVIDIA"/>
<img src="https://img.shields.io/badge/fabricmanager-565.57.01-blue" alt="fabricmanager"/>
<img src="https://img.shields.io/badge/CUDA-12.6.3-brightgreen" alt="CUDA"/>
<br>
<img src="https://img.shields.io/badge/Author-王云龙-orange" alt="Author" />
</p>
<hr>
### 一、脚本概述
该脚本旨在简化 GPU 相关应用的安装流程,适用于需要快速部署 GPU 环境的场景。
- **核心功能**
```bash
脚本可批量完成网卡驱动、显卡驱动、fabricmanager互联管理器、CUDA 工具包、Nvidia-dcgm、DCGM-EXporter、Node-EXporter 核心组件的安装与卸载操作
```
- **配置说明**
```bash
用户管理:若需删除 ubuntu 用户,需手动执行相关用户删除命令,并妥善处理该用户关联的数据与权限。​
磁盘管理:磁盘分区扩容需通过磁盘管理工具,根据实际需求对磁盘进行分区调整与扩容操作,以满足应用存储需求。​
网络配置:网卡重命名需手动修改网络配置文件,根据实际网络环境对网卡名称进行重新定义,确保网络连接正常。
```
- **使用建议**
```bash
新系统推荐使用一键自动安装脚本,可快速、全面地完成 GPU 相关应用的部署,具体使用方法详见文章末尾说明。​若系统之前已存在相关安装内容,或需要对各组件进行独立、定制化部署,建议使用单独部署脚本安装。
```
### 二、使用说明
#### 1系统初始化
```bash
cd /opt/ && wget -qO- http://116.205.97.109/scripts/system_optimize.sh|bash
#磁盘扩容(初始化脚本已集成无须重新执行)
lvresize --extents +100%FREE --resizefs /dev/mapper/ubuntu--vg-ubuntu--lv
#修改主机名(初始化脚本已集成无须重新执行)
#IP=$(ip addr | awk '/^[0-9]+: / {}; /inet.*global/ {print gensub(/(.*)\/(.*)/, "\\1", "g", $2)}' | grep `ip route | grep default | awk '{print $3}' | awk -F. '{print $1"."$2}' | head -1` | head -1 | sed 's/\./-/g')
#hostnamectl set-hostname $IP
#bash
#内核锁定(初始化脚本已集成无须重新执行)
#apt-mark hold $(dpkg -l | grep -E "linux-(headers|image|unsigned|modules|modules-extra)" | grep "6.8.0-53" | awk '{print $2}')
#dpkg --get-selections | grep hold #查看
```
#### 2MLNX_OFED 网络套件安装/卸载
```bash
#支持版本[23.10-1.1.9.0]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh|bash -s -- --install --version "24.10-2.1.8.0" --distro "ubuntu24.04"
```
#### 3Nvidia 显卡驱动安装/卸载
```bash
#支持版本[565.57.01] [570.124.06]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '565.57.01'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '570.124.06'
```
#### 4GPU 互联管理器安装/卸载
```bash
#支持版本[565_565.57.01-1] [570_570.124.06-1]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --distro ubuntu22.04 --version 565_565.57.01-1
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --distro ubuntu24.04 --version 570_570.124.06-1
```
#### 5NVIDIA CUDA 工具包部署/卸载
```bash
#支持版本[12.6.3_560.35.05] [12.8.1_570.124.06]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.6.3_560.35.05'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.8.1_570.124.06'
```
#### 6dcgm/node exporter 部署/卸载
```bash
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --install #针对dcgm-exporter 进行自定义扩展后期集成到dcgm中
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --uninstall
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --uninstall
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --uninstall
cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --uninstall
```
#### 7批量组件安装/卸载
![Static Badge](https://img.shields.io/badge/组件[1]-orange?style=flat-square)
![Static Badge](https://img.shields.io/badge/mlnx_ofed-23.10.1.1.9.0-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/nvidia_drive-565.57.01-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/cuda-12.6.3.560.35.05-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/fabricmanager-565_565.57.01.1-brightgreen?style=plastic)
```bash
安装:---------------------------------------------------------------------------------------------------------------------------------------------
cd /opt/ && wget -qO- http://116.205.97.109/scripts/system_optimize.sh|bash
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh|bash -s -- --install --version '23.10-1.1.9.0'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '565.57.01'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.6.3_560.35.05'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --version '565_565.57.01-1'
卸载:---------------------------------------------------------------------------------------------------------------------------------------------
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh |bash -s -- --uninstall --version '23.10-1.1.9.0'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --uninstall --version '565.57.01'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --uninstall --version '12.6.3_560.35.05'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --uninstall --version '565_565.57.01-1'
```
![Static Badge](https://img.shields.io/badge/组件[2]-orange?style=flat-square)
![Static Badge](https://img.shields.io/badge/mlnx_ofed-23.10.1.1.9.0-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/nvidia_drive-570.124.06-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/cuda-12.8.1.570.124.06-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/fabricmanager-570.124.06.1-brightgreen?style=plastic)
```bash
安装:---------------------------------------------------------------------------------------------------------------------------------------------
cd /opt/ && wget -qO- http://116.205.97.109/scripts/system_optimize.sh|bash
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh|bash -s -- --install --version '23.10-1.1.9.0'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '570.124.06'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.8.1_570.124.06'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --version '570_570.124.06-1'
卸载:--------------------------------------------------------------------------------------------------------------------------------------------
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.shbash -s -- --uninstall --version '23.10-1.1.9.0'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --uninstall --version '570.124.06'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --uninstall --version '12.8.1_570.124.06'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --uninstall --version '570_570.124.06-1'
```
![Static Badge](https://img.shields.io/badge/推荐一键安装脚本-orange?style=flat-square)
```bash
#安装/卸载服务(安装或卸载时间较长,建议放后台执行。)
#组合[1]-----------------------------------------------------------------------------------------------------------------------------------
screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --install --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log";
tail -f /opt/gpu-manager.log
screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --uninstall --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log";
tail -f /opt/gpu-manager.log
#组合[2]-----------------------------------------------------------------------------------------------------------------------------------
screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --install --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log";
tail -f /opt/gpu-manager.log
screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --uninstall --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log";
tail -f /opt/gpu-manager.log
#说明:
#version 1 表示安装/卸载七.[1]组件版本mlnx_ofed-23.10.1.1.9.0+nvidia_drive-565.57.01 +cuda-12.6.3.560.35.05 +fabricmanager-565_565.57.01.1
#version 2 表示安装/卸载七.[2]组件版本mlnx_ofed-23.10.1.1.9.0+nvidia_drive-570.124.06+cuda-12.8.1.570.124.06+fabricmanager-570.124.06.1
#--include=exporter 指定该参数,脚本将安装/卸载exporter组件中的相关服务[dcgm-exporter,node-exporter,nvidia-dcgm],默认不安装/卸载。
```
**特别提醒**
```bash
GPU:B200系列
1B200系列安装fabricmanager 时需要安装nvlsm,否则faricmanager无法启动。
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/nvlsm_2025.03.1-1_amd64.deb
dpkg -i nvlsm_2025.03.1-1_amd64.deb
超威机型:
1超威机型需要在:/etc/default/grub文件下添加GRUB_CMDLINE_LINUX="quiet splash nokaslr"参数否则CUDA初始化失败。
2超威机型需要在:/etc/default/grub文件下添加GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=off" #正确关闭iommu
(3) 超威机型挂载镜像: http协议:file_server:http://10.51.151.201 镜像路径:/iso/ubu22043.iso
再生龙镜像还原:
1华擎B200:再生龙镜像NFS10.102.35.99/nfs/clone.iso 备份路径:/nfs/2025-05-26-09-B200-960g-img #华擎机型对再生龙引导镜像版本无要求
2超威B200:再生龙镜像:10.102.35.99/nfs/clone.iso 备份路径: /nfs/chaowei-B200-1.7T-img #注意超威机型对再生龙引导镜像对版本有要求,最新版本无法引导。
(3) 技嘉A100:再生龙镜像:10.101.0.86:/nfs/ 备份路径: /nfs/2025-07-15-03-Jijia-A100-960G-img #技嘉A100-磁盘960G-CX7
```
**ubuntu2404:(临时)**
```bash
cd /opt/
wget https://content.mellanox.com/ofed/MLNX_OFED-24.10-2.1.8.0/MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu24.04-x86_64.tgz #[ubuntu24.04]
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/nvidia-fabricmanager-570_570.124.06-1_amd64.deb #[ubuntu24.04]
wget https://cn.download.nvidia.com/tesla/570.124.06/NVIDIA-Linux-x86_64-570.124.06.run #[无版本要求]
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run #[无版本要求]
cd /opt/ && git clone http://116.205.97.109:3000/yindun/ansible-devops.git
cd /opt/ansible-devops/scripts/
#-----临时替换适配ubuntu24.04
sed -i -e 's/5.8-6.0.4.2/24.10-2.1.8.0/g' -e 's/22.04/24.04/g' ib-drive.sh && sed -i 's/2204/2404/g' nvidia-fabricmanager.sh
bash system_optimize.sh --install
bash ib-drive.sh --install --version "24.10-2.1.8.0"
bash nvidia-driver.sh --install --version '570.124.06'
bash nvidia-fabricmanager.sh --install --version "570_570.124.06-1"
bash cuda.sh --install --version "12.8.1_570.124.06"
#安装exporter
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --install #针对dcgm-exporter 进行自定义扩展后期集成到dcgm中
#修改主机名,内核版本锁定,根分区扩容已集成在初始化脚本中无须重复执行。
```