2025-07-05 15:49:53 +08:00
|
|
|
|
<h2 align="center">GPU 环境标准化部署脚本使用说明:</h2>
|
|
|
|
|
|
|
|
|
|
|
|
<p align="center">
|
|
|
|
|
|
<img src="https://img.shields.io/github/languages/code-size/nanchengcyu/TechMindWave-frontend" alt="code size"/>
|
|
|
|
|
|
<img src="https://img.shields.io/badge/ofed-17.0.2-blue" alt="ofed"/>
|
|
|
|
|
|
<img src="https://img.shields.io/badge/NVIDIA-565.57.01-brightgreen" alt="NVIDIA"/>
|
|
|
|
|
|
<img src="https://img.shields.io/badge/fabricmanager-565.57.01-blue" alt="fabricmanager"/>
|
|
|
|
|
|
<img src="https://img.shields.io/badge/CUDA-12.6.3-brightgreen" alt="CUDA"/>
|
|
|
|
|
|
<br>
|
|
|
|
|
|
<img src="https://img.shields.io/badge/Author-王云龙-orange" alt="Author" />
|
|
|
|
|
|
</p>
|
|
|
|
|
|
<hr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 一、脚本概述
|
|
|
|
|
|
|
|
|
|
|
|
该脚本旨在简化 GPU 相关应用的安装流程,适用于需要快速部署 GPU 环境的场景。
|
|
|
|
|
|
|
|
|
|
|
|
- **核心功能**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
脚本可批量完成网卡驱动、显卡驱动、fabricmanager互联管理器、CUDA 工具包、Nvidia-dcgm、DCGM-EXporter、Node-EXporter 核心组件的安装与卸载操作
|
|
|
|
|
|
```
|
|
|
|
|
|
- **配置说明**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
用户管理:若需删除 ubuntu 用户,需手动执行相关用户删除命令,并妥善处理该用户关联的数据与权限。
|
|
|
|
|
|
磁盘管理:磁盘分区扩容需通过磁盘管理工具,根据实际需求对磁盘进行分区调整与扩容操作,以满足应用存储需求。
|
|
|
|
|
|
网络配置:网卡重命名需手动修改网络配置文件,根据实际网络环境对网卡名称进行重新定义,确保网络连接正常。
|
|
|
|
|
|
```
|
|
|
|
|
|
- **使用建议**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
新系统推荐使用一键自动安装脚本,可快速、全面地完成 GPU 相关应用的部署,具体使用方法详见文章末尾说明。若系统之前已存在相关安装内容,或需要对各组件进行独立、定制化部署,建议使用单独部署脚本安装。
|
|
|
|
|
|
```
|
2025-07-14 16:27:45 +08:00
|
|
|
|
|
|
|
|
|
|
|
2025-07-05 15:49:53 +08:00
|
|
|
|
### 二、使用说明
|
|
|
|
|
|
|
|
|
|
|
|
#### (1)系统初始化
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/system_optimize.sh|bash
|
2025-07-14 16:15:28 +08:00
|
|
|
|
|
|
|
|
|
|
#磁盘扩容(可选项)
|
|
|
|
|
|
lvresize --extents +100%FREE --resizefs /dev/mapper/ubuntu--vg-ubuntu--lv
|
|
|
|
|
|
|
|
|
|
|
|
#修改主机名(可选项)
|
|
|
|
|
|
IP=$(ip addr | awk '/^[0-9]+: / {}; /inet.*global/ {print gensub(/(.*)\/(.*)/, "\\1", "g", $2)}' | grep `ip route | grep default | awk '{print $3}' | awk -F. '{print $1"."$2}' | head -1` | head -1 | sed 's/\./-/g')
|
|
|
|
|
|
hostnamectl set-hostname $IP
|
|
|
|
|
|
bash
|
|
|
|
|
|
|
|
|
|
|
|
#内核锁定(可选项)
|
|
|
|
|
|
apt-mark hold $(dpkg -l | grep -E "linux-(headers|image|unsigned|modules|modules-extra)" | grep "6.8.0-53" | awk '{print $2}')
|
|
|
|
|
|
dpkg --get-selections | grep hold #查看
|
2025-07-05 15:49:53 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### (2)MLNX_OFED 网络套件安装/卸载
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
#支持版本[23.10-1.1.9.0]
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh|bash -s -- --install --version '23.10-1.1.9.0'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh|bash -s -- --uninstall --version '23.10-1.1.9.0'
|
2025-07-05 15:49:53 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### (3)Nvidia 显卡驱动安装/卸载
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
#支持版本[565.57.01] [570.124.06]
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '565.57.01'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --uninstall --version '565.57.01'
|
2025-07-05 15:49:53 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### (4)GPU 互联管理器安装/卸载
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
#支持版本[565_565.57.01-1] [570_570.124.06-1]
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --version '565_565.57.01-1'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --uninstall --version '565_565.57.01-1'
|
2025-07-05 15:49:53 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### (5)NVIDIA CUDA 工具包部署/卸载
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
#支持版本[12.6.3_560.35.05] [12.8.1_570.124.06]
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.6.3_560.35.05'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --uninstall --version '12.6.3_560.35.05'
|
2025-07-05 15:49:53 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### (6)dcgm/node exporter 部署/卸载
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --install
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --install
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --install
|
2025-07-05 15:49:53 +08:00
|
|
|
|
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --uninstall
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --uninstall
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --uninstall
|
2025-07-05 15:49:53 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### (7)批量组件安装/卸载
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|

|
|
|
|
|
|

|
|
|
|
|
|

|
|
|
|
|
|

|
|
|
|
|
|
```bash
|
|
|
|
|
|
安装:---------------------------------------------------------------------------------------------------------------------------------------------
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/system_optimize.sh|bash
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh|bash -s -- --install --version '23.10-1.1.9.0'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '565.57.01'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.6.3_560.35.05'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --version '565_565.57.01-1'
|
2025-07-05 15:49:53 +08:00
|
|
|
|
|
|
|
|
|
|
卸载:---------------------------------------------------------------------------------------------------------------------------------------------
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh |bash -s -- --uninstall --version '23.10-1.1.9.0'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --uninstall --version '565.57.01'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --uninstall --version '12.6.3_560.35.05'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --uninstall --version '565_565.57.01-1'
|
2025-07-05 15:49:53 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|

|
|
|
|
|
|

|
|
|
|
|
|

|
|
|
|
|
|

|
|
|
|
|
|
```bash
|
|
|
|
|
|
安装:---------------------------------------------------------------------------------------------------------------------------------------------
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/system_optimize.sh|bash
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh|bash -s -- --install --version '23.10-1.1.9.0'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '570.124.06'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.8.1_570.124.06'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --version '570_570.124.06-1'
|
2025-07-05 15:49:53 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
卸载:--------------------------------------------------------------------------------------------------------------------------------------------
|
2025-07-05 18:18:15 +08:00
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.shbash -s -- --uninstall --version '23.10-1.1.9.0'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --uninstall --version '570.124.06'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --uninstall --version '12.8.1_570.124.06'
|
|
|
|
|
|
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --uninstall --version '570_570.124.06-1'
|
2025-07-05 15:49:53 +08:00
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|

|
|
|
|
|
|
```bash
|
|
|
|
|
|
#安装/卸载服务(安装或卸载时间较长,建议放后台执行。):
|
|
|
|
|
|
#组合[1]-----------------------------------------------------------------------------------------------------------------------------------
|
2025-07-05 18:18:15 +08:00
|
|
|
|
screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --install --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log";
|
2025-07-05 15:49:53 +08:00
|
|
|
|
tail -f /opt/gpu-manager.log
|
|
|
|
|
|
|
2025-07-05 18:18:15 +08:00
|
|
|
|
screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --uninstall --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log";
|
2025-07-05 15:49:53 +08:00
|
|
|
|
tail -f /opt/gpu-manager.log
|
|
|
|
|
|
|
|
|
|
|
|
#组合[2]-----------------------------------------------------------------------------------------------------------------------------------
|
2025-07-05 18:18:15 +08:00
|
|
|
|
screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --install --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log";
|
2025-07-05 15:49:53 +08:00
|
|
|
|
tail -f /opt/gpu-manager.log
|
|
|
|
|
|
|
2025-07-05 18:18:15 +08:00
|
|
|
|
screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --uninstall --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log";
|
2025-07-05 15:49:53 +08:00
|
|
|
|
tail -f /opt/gpu-manager.log
|
|
|
|
|
|
|
|
|
|
|
|
#说明:
|
|
|
|
|
|
#version 1 表示安装/卸载七.[1]组件版本:mlnx_ofed-23.10.1.1.9.0+nvidia_drive-565.57.01 +cuda-12.6.3.560.35.05 +fabricmanager-565_565.57.01.1
|
|
|
|
|
|
#version 2 表示安装/卸载七.[2]组件版本:mlnx_ofed-23.10.1.1.9.0+nvidia_drive-570.124.06+cuda-12.8.1.570.124.06+fabricmanager-570.124.06.1
|
|
|
|
|
|
#--include=exporter 指定该参数,脚本将安装/卸载exporter组件中的相关服务[dcgm-exporter,node-exporter,nvidia-dcgm],默认不安装/卸载。
|
|
|
|
|
|
|
|
|
|
|
|
```
|
2025-07-14 16:27:45 +08:00
|
|
|
|
**特别提醒**:
|
|
|
|
|
|
```bash
|
2025-07-14 16:45:14 +08:00
|
|
|
|
GPU:B200系列:
|
2025-07-14 16:48:35 +08:00
|
|
|
|
(1)B200系列安装fabricmanager 时,需要安装nvlsm,否则faricmanager无法启动。
|
2025-07-14 16:27:45 +08:00
|
|
|
|
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/nvlsm_2025.03.1-1_amd64.deb
|
|
|
|
|
|
dpkg -i nvlsm_2025.03.1-1_amd64.deb
|
2025-07-14 16:48:35 +08:00
|
|
|
|
超威机型:
|
|
|
|
|
|
(1)超威机型需要在:/etc/default/grub文件下添加GRUB_CMDLINE_LINUX="quiet splash nokaslr"参数,否则CUDA初始化失败。
|
|
|
|
|
|
(2)超威机型需要在:/etc/default/grub文件下添加GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=off" #正确关闭iommu
|
2025-07-14 16:45:14 +08:00
|
|
|
|
|
2025-07-14 16:48:35 +08:00
|
|
|
|
再生龙镜像还原:
|
2025-07-14 17:34:23 +08:00
|
|
|
|
(1)华擎B200:再生龙镜像:NFS:10.102.35.99:/nfs/clone.iso 备份路径:/nfs/2025-05-26-09-B200-960g-img #华擎机型对再生龙引导镜像版本无要求。
|
|
|
|
|
|
(2)超威B200: 再生龙镜像:10.102.35.99:/nfs/clone.iso 备份路径: /nfs/chaowei-B200-1.7T-img #注意超威机型对再生龙引导镜像对版本有要求,最新版本无法引导。
|
2025-07-14 18:17:38 +08:00
|
|
|
|
(3) 技嘉A100: 再生龙镜像:10.101.0.86:/nfs/ 备份路径: /nfs/ #
|
2025-07-14 16:27:45 +08:00
|
|
|
|
```
|