GPU 环境标准化部署脚本使用说明:

code size ofed NVIDIA fabricmanager CUDA
Author


### 一、脚本概述 该脚本旨在简化 GPU 相关应用的安装流程,适用于需要快速部署 GPU 环境的场景。 - **核心功能**: ```bash 脚本可批量完成网卡驱动、显卡驱动、fabricmanager互联管理器、CUDA 工具包、Nvidia-dcgm、DCGM-EXporter、Node-EXporter 核心组件的安装与卸载操作 ``` - **配置说明**: ```bash 用户管理:若需删除 ubuntu 用户,需手动执行相关用户删除命令,并妥善处理该用户关联的数据与权限。​ 磁盘管理:磁盘分区扩容需通过磁盘管理工具,根据实际需求对磁盘进行分区调整与扩容操作,以满足应用存储需求。​ 网络配置:网卡重命名需手动修改网络配置文件,根据实际网络环境对网卡名称进行重新定义,确保网络连接正常。 ``` - **使用建议**: ```bash 新系统推荐使用一键自动安装脚本,可快速、全面地完成 GPU 相关应用的部署,具体使用方法详见文章末尾说明。​若系统之前已存在相关安装内容,或需要对各组件进行独立、定制化部署,建议使用单独部署脚本安装。 ``` ### 二、使用说明 #### (1)系统初始化 ```bash cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/system_optimize.sh|bash ``` #### (2)MLNX_OFED 网络套件安装/卸载 ```bash #支持版本[23.10-1.1.9.0] cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/ib-drive.sh|bash -s -- --install --version '23.10-1.1.9.0' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/ib-drive.sh|bash -s -- --uninstall --version '23.10-1.1.9.0' ``` #### (3)Nvidia 显卡驱动安装/卸载 ```bash #支持版本[565.57.01] [570.124.06] cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-driver.sh | bash -s -- --install --version '565.57.01' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-driver.sh | bash -s -- --uninstall --version '565.57.01' ``` #### (4)GPU 互联管理器安装/卸载 ```bash #支持版本[565_565.57.01-1] [570_570.124.06-1] cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-fabricmanager.sh|bash -s -- --install --version '565_565.57.01-1' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-fabricmanager.sh|bash -s -- --uninstall --version '565_565.57.01-1' ``` #### (5)NVIDIA CUDA 工具包部署/卸载 ```bash #支持版本[12.6.3_560.35.05] [12.8.1_570.124.06] cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/cuda.sh | bash -s -- --install --version '12.6.3_560.35.05' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/cuda.sh | bash -s -- --uninstall --version '12.6.3_560.35.05' ``` #### (6)dcgm/node exporter 部署/卸载 ```bash cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-dcgm.sh | bash -s -- --install cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/dcgm-exporter.sh | bash -s -- --install cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/node-exporter.sh | bash -s -- --install cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-dcgm.sh | bash -s -- --uninstall cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/dcgm-exporter.sh | bash -s -- --uninstall cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/node-exporter.sh | bash -s -- --uninstall ``` #### (7)批量组件安装/卸载 ![Static Badge](https://img.shields.io/badge/组件[1]-orange?style=flat-square) ![Static Badge](https://img.shields.io/badge/mlnx_ofed-23.10.1.1.9.0-brightgreen?style=plastic) ![Static Badge](https://img.shields.io/badge/nvidia_drive-565.57.01-brightgreen?style=plastic) ![Static Badge](https://img.shields.io/badge/cuda-12.6.3.560.35.05-brightgreen?style=plastic) ![Static Badge](https://img.shields.io/badge/fabricmanager-565_565.57.01.1-brightgreen?style=plastic) ```bash 安装:--------------------------------------------------------------------------------------------------------------------------------------------- cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/system_optimize.sh|bash cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/ib-drive.sh|bash -s -- --install --version '23.10-1.1.9.0' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-driver.sh | bash -s -- --install --version '565.57.01' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/cuda.sh | bash -s -- --install --version '12.6.3_560.35.05' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-fabricmanager.sh|bash -s -- --install --version '565_565.57.01-1' 卸载:--------------------------------------------------------------------------------------------------------------------------------------------- cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/ib-drive.sh |bash -s -- --uninstall --version '23.10-1.1.9.0' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-driver.sh | bash -s -- --uninstall --version '565.57.01' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/cuda.sh | bash -s -- --uninstall --version '12.6.3_560.35.05' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-fabricmanager.sh|bash -s -- --uninstall --version '565_565.57.01-1' ``` ![Static Badge](https://img.shields.io/badge/组件[2]-orange?style=flat-square) ![Static Badge](https://img.shields.io/badge/mlnx_ofed-23.10.1.1.9.0-brightgreen?style=plastic) ![Static Badge](https://img.shields.io/badge/nvidia_drive-570.124.06-brightgreen?style=plastic) ![Static Badge](https://img.shields.io/badge/cuda-12.8.1.570.124.06-brightgreen?style=plastic) ![Static Badge](https://img.shields.io/badge/fabricmanager-570.124.06.1-brightgreen?style=plastic) ```bash 安装:--------------------------------------------------------------------------------------------------------------------------------------------- cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/system_optimize.sh|bash cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/ib-drive.sh|bash -s -- --install --version '23.10-1.1.9.0' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-driver.sh | bash -s -- --install --version '570.124.06' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/cuda.sh | bash -s -- --install --version '12.8.1_570.124.06' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-fabricmanager.sh|bash -s -- --install --version '570_570.124.06-1' 卸载:-------------------------------------------------------------------------------------------------------------------------------------------- cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/ib-drive.shbash -s -- --uninstall --version '23.10-1.1.9.0' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-driver.sh | bash -s -- --uninstall --version '570.124.06' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/cuda.sh | bash -s -- --uninstall --version '12.8.1_570.124.06' cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/nvidia-fabricmanager.sh|bash -s -- --uninstall --version '570_570.124.06-1' ``` ![Static Badge](https://img.shields.io/badge/推荐一键安装脚本-orange?style=flat-square) ```bash #安装/卸载服务(安装或卸载时间较长,建议放后台执行。): #组合[1]----------------------------------------------------------------------------------------------------------------------------------- screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/gpu-manager.sh|bash -s -- --install --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log"; tail -f /opt/gpu-manager.log screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/gpu-manager.sh|bash -s -- --uninstall --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log"; tail -f /opt/gpu-manager.log #组合[2]----------------------------------------------------------------------------------------------------------------------------------- screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/gpu-manager.sh|bash -s -- --install --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log"; tail -f /opt/gpu-manager.log screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://10.101.0.51:3000/yindun/ansible-devops/raw/branch/main/scripts/gpu-manager.sh|bash -s -- --uninstall --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log"; tail -f /opt/gpu-manager.log #说明: #version 1 表示安装/卸载七.[1]组件版本:mlnx_ofed-23.10.1.1.9.0+nvidia_drive-565.57.01 +cuda-12.6.3.560.35.05 +fabricmanager-565_565.57.01.1 #version 2 表示安装/卸载七.[2]组件版本:mlnx_ofed-23.10.1.1.9.0+nvidia_drive-570.124.06+cuda-12.8.1.570.124.06+fabricmanager-570.124.06.1 #--include=exporter 指定该参数,脚本将安装/卸载exporter组件中的相关服务[dcgm-exporter,node-exporter,nvidia-dcgm],默认不安装/卸载。 ```