ansible-devops/scripts/README.md

369 lines
19 KiB
Markdown
Raw Normal View History

2025-11-10 13:18:16 +08:00
<h2 align="center">GPU 环境标 准化部署脚本使用说明:</h2>
2025-07-05 15:49:53 +08:00
<p align="center">
<img src="https://img.shields.io/github/languages/code-size/nanchengcyu/TechMindWave-frontend" alt="code size"/>
<img src="https://img.shields.io/badge/ofed-17.0.2-blue" alt="ofed"/>
<img src="https://img.shields.io/badge/NVIDIA-565.57.01-brightgreen" alt="NVIDIA"/>
<img src="https://img.shields.io/badge/fabricmanager-565.57.01-blue" alt="fabricmanager"/>
<img src="https://img.shields.io/badge/CUDA-12.6.3-brightgreen" alt="CUDA"/>
<br>
<img src="https://img.shields.io/badge/Author-王云龙-orange" alt="Author" />
</p>
<hr>
### 一、脚本概述
该脚本旨在简化 GPU 相关应用的安装流程,适用于需要快速部署 GPU 环境的场景。
- **核心功能**
```bash
脚本可批量完成网卡驱动、显卡驱动、fabricmanager互联管理器、CUDA 工具包、Nvidia-dcgm、DCGM-EXporter、Node-EXporter 核心组件的安装与卸载操作
```
- **配置说明**
```bash
用户管理:若需删除 ubuntu 用户,需手动执行相关用户删除命令,并妥善处理该用户关联的数据与权限。​
磁盘管理:磁盘分区扩容需通过磁盘管理工具,根据实际需求对磁盘进行分区调整与扩容操作,以满足应用存储需求。​
网络配置:网卡重命名需手动修改网络配置文件,根据实际网络环境对网卡名称进行重新定义,确保网络连接正常。
```
- **使用建议**
```bash
新系统推荐使用一键自动安装脚本,可快速、全面地完成 GPU 相关应用的部署,具体使用方法详见文章末尾说明。​若系统之前已存在相关安装内容,或需要对各组件进行独立、定制化部署,建议使用单独部署脚本安装。
```
2025-07-14 16:27:45 +08:00
2025-07-05 15:49:53 +08:00
### 二、使用说明
#### 1系统初始化
```bash
2025-07-05 18:18:15 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/system_optimize.sh|bash
2025-07-14 16:15:28 +08:00
2025-07-15 17:33:58 +08:00
#磁盘扩容(初始化脚本已集成无须重新执行)
2025-08-20 16:07:58 +08:00
#lvresize --extents +100%FREE --resizefs /dev/mapper/ubuntu--vg-ubuntu--lv
2025-07-14 16:15:28 +08:00
2025-07-15 17:33:58 +08:00
#修改主机名(初始化脚本已集成无须重新执行)
2025-08-20 16:07:39 +08:00
#IP=$(ip addr | awk '/^[0-9]+: / {}; /inet.*global/ {print gensub(/(.*)\/(.*)/, "\\1", "g", $2)}' | grep `ip route | grep default | awk '{print $3}' | awk -F. '{print $1"."$2}' | head -1` | head -1 | sed 's/\./-/g')
#hostnamectl set-hostname $IP
#bash
2025-07-14 16:15:28 +08:00
2025-07-15 17:33:58 +08:00
#内核锁定(初始化脚本已集成无须重新执行)
2025-08-20 16:07:39 +08:00
#apt-mark hold $(dpkg -l | grep -E "linux-(headers|image|unsigned|modules|modules-extra)" | grep "6.8.0-53" | awk '{print $2}')
#dpkg --get-selections | grep hold #查看
2025-07-05 15:49:53 +08:00
```
#### 2MLNX_OFED 网络套件安装/卸载
```bash
#支持版本[23.10-1.1.9.0]
2025-07-19 23:16:13 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib-drive.sh|bash -s -- --install --version "24.10-2.1.8.0" --distro "ubuntu24.04"
2025-07-05 15:49:53 +08:00
```
2025-10-14 14:01:55 +08:00
#### 3IB 网卡排序
```bash
#支持版本[23.10-1.1.9.0]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ib.sh|bash -s -- --install
```
2025-07-05 15:49:53 +08:00
2025-10-14 14:01:55 +08:00
#### 4Nvidia 显卡驱动安装/卸载
2025-07-05 15:49:53 +08:00
```bash
#支持版本[565.57.01] [570.124.06]
2025-07-05 18:18:15 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '565.57.01'
2025-07-20 00:36:55 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-driver.sh | bash -s -- --install --version '570.124.06'
2025-07-05 15:49:53 +08:00
```
2025-10-14 14:01:55 +08:00
#### 5GPU 互联管理器安装/卸载
2025-07-05 15:49:53 +08:00
```bash
#支持版本[565_565.57.01-1] [570_570.124.06-1]
2025-07-20 00:36:55 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --distro ubuntu22.04 --version 565_565.57.01-1
2025-07-20 00:09:56 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-fabricmanager.sh|bash -s -- --install --distro ubuntu24.04 --version 570_570.124.06-1
2025-07-05 15:49:53 +08:00
```
2025-10-14 14:01:55 +08:00
#### 6NVIDIA CUDA 工具包部署/卸载
2025-07-05 15:49:53 +08:00
```bash
#支持版本[12.6.3_560.35.05] [12.8.1_570.124.06]
2025-07-05 18:18:15 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.6.3_560.35.05'
2025-07-20 00:36:55 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/cuda.sh | bash -s -- --install --version '12.8.1_570.124.06'
2025-07-05 15:49:53 +08:00
```
2025-10-14 14:01:55 +08:00
#### 7dcgm/node exporter 部署/卸载
2025-07-05 15:49:53 +08:00
```bash
2025-07-05 18:18:15 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --install
2025-07-28 18:01:16 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --install #针对dcgm-exporter 进行自定义扩展后期集成到dcgm中
2025-07-05 15:49:53 +08:00
2025-07-05 18:18:15 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --uninstall
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --uninstall
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --uninstall
2025-07-28 18:01:16 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --uninstall
2025-07-05 15:49:53 +08:00
```
2025-10-14 14:01:55 +08:00
#### 8Docker 安装/卸载
2025-09-24 09:34:42 +08:00
```bash
#支持版本[5:28.4.0-1~ubuntu.24.04~noble]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/docker.sh | bash -s -- --install --version '5:28.4.0-1~ubuntu.24.04~noble'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/docker.sh | bash -s -- --uninstall --version '5:28.4.0-1~ubuntu.24.04~noble'
2025-09-24 09:45:42 +08:00
```
2025-09-24 09:34:42 +08:00
2025-09-24 10:07:39 +08:00
#### 9nvidia-container-toolkit 安装/卸载
2025-09-24 10:06:57 +08:00
```bash
#支持版本[1.17.6-1,1.17.7-1,1.17.8-1.....]
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-container-toolkit.sh | bash -s -- --install --version '1.17.6-1'
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-container-toolkit.sh | bash -s -- --uninstall --version '1.17.6-1'
2025-09-24 10:19:46 +08:00
#查看版本nvidia-container-runtime --version|head -1
2025-11-08 14:16:03 +08:00
```
#### 10nfs 安装/卸载
```bash
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nfs.sh | bash -s -- --install --share-dirs=/opt/data
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nfs.sh | bash -s -- --install --share-dirs=/opt/data
2025-09-24 10:06:57 +08:00
```
2025-11-10 17:05:16 +08:00
#### 11jenkinsdocker版本-后续优化)
2025-11-10 14:39:25 +08:00
```bash
cd /opt/ && wget -qO- http://116.205.97.109/scripts/jenkins-install.sh | bash
```
2025-10-15 13:02:05 +08:00
#### 10Clonezilla 母机增强配置
2025-10-15 12:54:22 +08:00
```bash
2025-10-15 13:02:05 +08:00
#在再生龙克隆系统前,对目标母机系统进行增强配置,实现克隆还原后系统 “开箱即用”无需手动修改主机名、带内IP、修复udev规则文件丢失问题等。
2025-10-15 12:54:22 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/clonezilla_config.sh | bash
2025-10-15 13:02:05 +08:00
#=注意:配置 IP 映射关系
脚本执行完成后,需在母机的 /opt/ip.txt 文件中,按格式填写还原后主机的带内 IP带外 IP 对应关系(后续克隆到目标机后,系统会自动匹配配置)如:
2025-10-15 12:54:22 +08:00
cat /opt/ip.txt
#第一列:带内IP,第二列子网掩码第三列带内网关第四列带外IP
172.51.4.1 26 172.51.4.126 172.51.2.50
172.51.4.2 26 172.51.4.126 172.51.2.50
.......
```
2025-11-07 13:35:54 +08:00
2025-10-29 12:53:48 +08:00
#### 11k8s集群部署
```bash
2025-10-30 20:07:21 +08:00
配置免密:
#注ip.txt
cat > /opt/ip.txt << EOF
2025-11-02 20:20:57 +08:00
192.168.61.131
2025-10-30 20:52:44 +08:00
192.168.61.132
192.168.61.133
192.168.61.134
2025-10-30 20:07:21 +08:00
EOF
2025-10-31 09:08:09 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/auto_ssh_auth_setup.sh |bash -s -- --file=/opt/ip.txt --user=root --passwd=xxxx
2025-10-30 13:10:33 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-ubuntu-init.sh | bash #系统初始化
cd /opt/ && wget -qO- http://116.205.97.109/scripts/containerd.sh |bash -s -- --install --version '1.7.28-1' #containerd 安装(所有节点执行)
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-base-setup.sh |bash -s -- --install --version '1.30.5' #k8s基础组件(所有节点执行)
2025-10-31 09:23:30 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/haproxy.sh |bash -s -- --install --backend 192.168.61.131:6443,192.168.61.132:6443,192.168.61.133:6443 --port 36443
2025-10-30 13:58:12 +08:00
#master 节点3,5,7...
2025-10-29 19:45:06 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/keepalived.sh |bash -s -- --install --vip 192.168.61.200/24 --priority 150 #主节点执行
cd /opt/ && wget -qO- http://116.205.97.109/scripts/keepalived.sh |bash -s -- --install --vip 192.168.61.200/24 --priority 140 #备节点执行
cd /opt/ && wget -qO- http://116.205.97.109/scripts/keepalived.sh |bash -s -- --install --vip 192.168.61.200/24 --priority 130 #备节点执行
2025-10-30 17:27:46 +08:00
#配置分发kubeadm配置文件
2025-11-02 20:30:59 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-config-deploy.sh |bash -s -- --local-ip=192.168.61.131 --hostname=master-01 --k8s-version=1.30.5 --cluster-vip=192.168.61.200 --cluster-port=36443 --master1-ip=192.168.61.131 --master2-ip=192.168.61.132 --master3-ip=192.168.61.133 #master01
2025-10-31 00:55:14 +08:00
2025-11-02 20:30:59 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-config-deploy.sh |bash -s -- --local-ip=192.168.61.132 --hostname=master-02 --k8s-version=1.30.5 --cluster-vip=192.168.61.200 --cluster-port=36443 --master1-ip=192.168.61.131 --master2-ip=192.168.61.132 --master3-ip=192.168.61.133 #master02
2025-11-02 20:20:57 +08:00
2025-11-02 20:30:59 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-config-deploy.sh |bash -s -- --local-ip=192.168.61.133 --hostname=master-03 --k8s-version=1.30.5 --cluster-vip=192.168.61.200 --cluster-port=36443 --master1-ip=192.168.61.131 --master2-ip=192.168.61.132 --master3-ip=192.168.61.133 #master03
2025-11-02 20:20:57 +08:00
2025-11-02 20:27:44 +08:00
#初始化集群,在这个地方拍摄快照调试代码
2025-10-30 18:28:50 +08:00
#k8s-master 初始化集群
2025-11-08 12:23:36 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/k8s-cluster-deploy.sh |bash -s -- --master-ips=192.168.61.132,192.168.61.133 --node-ips=192.168.61.134
2025-11-07 18:24:09 +08:00
#脚本存在bug 请手动执行初始化kubeadm init --config kubeadm-init.yaml --upload-certs
2025-10-31 00:55:14 +08:00
2025-11-10 13:23:50 +08:00
#常用工具组件安装helm 工具安装
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
2025-11-08 15:30:21 +08:00
#安装nfs存储类
2025-11-10 08:48:57 +08:00
# 依赖环境所有节点必须安装nfs客户端apt install -y nfs-common
2025-11-08 16:05:24 +08:00
# 若没有node节点,可取消master污点让其可调度kubectl taint nodes master-03 node-role.kubernetes.io/control-plane:NoSchedule-
2025-11-08 15:30:21 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nfs.sh | bash -s -- --install --share-dirs=/opt/data #安装nfs
2025-11-08 16:11:09 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/install-nfs-storageclass-pro.sh | bash -s -- --nfs-server 192.168.61.131 --share-dirs /opt/data #指定nfs信息
2025-11-08 15:30:21 +08:00
2025-11-10 09:27:31 +08:00
#安装Metrics Server组件
cd /opt/k8s-install-conf/ && wget http://116.205.97.109/scripts/metrics-server.yaml && kubectl apply -f /opt/k8s-install-conf/metrics-server.yaml
2025-11-10 13:18:06 +08:00
#测试 kubectl top node
2025-11-10 09:27:31 +08:00
2025-11-10 09:59:40 +08:00
#安装ingress-nginx组件
cd /opt/k8s-install-conf/ && wget http://116.205.97.109/scripts/ingress.yaml && kubectl apply -f /opt/k8s-install-conf/ingress.yaml
2025-10-31 00:55:14 +08:00
#安装网络插件
2025-11-08 12:23:36 +08:00
#wget -q -c -O /opt/k8s-install-conf/calico.yaml http://116.205.97.109/scripts/calico.yaml --show-progress && kubectl apply -f /opt/k8s-install-conf/calico.yaml
2025-11-10 12:05:38 +08:00
#static pod 安装 kuboard-UI
在 Kubernetes master 节点上,执行如下两行指令,即可在根据提示完成 kuboard 安装。默认用户名/密码: admin/Kuboard123
2025-11-10 13:15:06 +08:00
cd /opt/k8s-install-conf && curl -fsSL https://addons.kuboard.cn/kuboard/kuboard-static-pod.sh -o kuboard.sh
2025-11-10 12:05:38 +08:00
sed -i 's#eipwork/kuboard:v3#swr.cn-east-2.myhuaweicloud.com/kuboard/kuboard:v3#g' kuboard.sh
sh kuboard.sh
2025-11-10 13:15:06 +08:00
2025-11-10 17:51:51 +08:00
2025-11-10 17:05:16 +08:00
#Argocd部署【集群内部部署-dev环境】- K8S集群内部署
2025-11-10 13:15:06 +08:00
kubectl create namespace argocd && cd /opt/k8s-install-conf && wget https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
kubectl apply -n argocd -f install.yaml
#获取密码kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
2025-11-10 17:05:16 +08:00
# Jenkins 独立于K8S集群之外部署(docker 部署)
cd /opt/ && wget -qO- http://116.205.97.109/scripts/jenkins-install.sh | bash
2025-11-10 17:51:51 +08:00
#Gitlab
2025-11-20 11:21:36 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/gitlab.sh |bash -s -- --install #安装
cd /opt/ && wget -qO- http://116.205.97.109/scripts/gitlab.sh |bash -s -- --uninstall #卸载
2025-11-10 17:51:51 +08:00
#elk
#Prometheus
#ceph
#mysql
#kafka
2025-11-20 10:41:12 +08:00
#redis【哨兵集群模式】
cat > /opt/ip.txt << EOF
192.168.61.131
192.168.61.132
192.168.61.133
EOF
cd /opt/ && wget -qO- http://116.205.97.109/scripts/ssh.sh |bash -s -- --file=/opt/ip.txt --user=root --passwd=XXXXXX #免密
cd /opt/ && wget -qO- http://116.205.97.109/scripts/redis.sh |bash -s -- --install --ip=IP1(主),IP2,IP3 --passwd=XXXXXX #安装
cd /opt/ && wget -qO- http://116.205.97.109/scripts/redis.sh |bash -s -- --uninstall -ip=IP1(主),IP2,IP3 --passwd=XXXXXX #卸载
2025-11-10 17:51:51 +08:00
#harbor
2025-11-10 17:53:53 +08:00
#nacos
#
2025-10-29 12:53:48 +08:00
```
2025-10-15 12:54:22 +08:00
2025-11-07 17:49:21 +08:00
#### (12) AL多租户平台 GPU资源基准测试
2025-10-31 16:18:41 +08:00
```bash
2025-11-07 17:46:04 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu_bench_auto.sh|bash -s -- --tests=bandwidthTest,deviceQuery,gpu_burn,p2pBandwidthLatencyTest #gpu基准测试
2025-10-31 16:18:41 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm_diag_bg_log.sh | bash #默认后台压测压测级别4预计时间2-3小时。
2025-11-19 10:12:34 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/local_allreduce_gpu.sh | bash #单节点allreduce测试
2025-10-31 16:21:10 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/mpi_allreduce_perf_test.sh|bash -s -- --host=node01-ip:8,node02-ip:8 # allreduce 测试
2025-11-05 13:59:40 +08:00
2025-10-31 16:18:41 +08:00
```
2025-10-15 12:54:22 +08:00
2025-11-07 18:10:48 +08:00
#### (13) docker实现监控快速搭建
2025-11-07 13:35:54 +08:00
```bash
2025-11-07 13:42:42 +08:00
#环境依赖docker,docker-compose
2025-11-07 17:38:31 +08:00
#promtheus配置1.支持【dcgm-exporter,node-exporter】consul 自动发现和file_sd自动发现。2.支持【ipmi-exporter】file_sd自动发现。
2025-11-07 17:35:12 +08:00
#自动部署服务ipmi-exporter,prometheus, alertmanager,consul,grafana,prometheus,prometheus-alert
2025-11-07 13:35:54 +08:00
wget -O - http://116.205.97.109/scripts/prometheus-monitor.tgz | tar -xvz && cd prometheus-monitor && docker-compose up -d
2025-11-07 13:41:15 +08:00
#以下命令可通过ansible 批量执行或在任意节点发起PUT请求批量循环注册。
2025-11-07 14:09:33 +08:00
bash /opt/prometheus-monitor/dcgm-consul.sh --register/deregister #dcgm-exporter->consul注册/注销
bash /opt/prometheus-monitor/node-consul.sh --register/deregister #node-exporter->consul 注册/注销
2025-11-07 13:35:54 +08:00
```
2025-10-31 16:18:41 +08:00
#### 13批量安装/卸载
2025-07-05 15:49:53 +08:00
![Static Badge](https://img.shields.io/badge/组件[1]-orange?style=flat-square)
![Static Badge](https://img.shields.io/badge/mlnx_ofed-23.10.1.1.9.0-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/nvidia_drive-565.57.01-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/cuda-12.6.3.560.35.05-brightgreen?style=plastic)
![Static Badge](https://img.shields.io/badge/fabricmanager-565_565.57.01.1-brightgreen?style=plastic)
2025-09-24 10:17:06 +08:00
2025-07-05 15:49:53 +08:00
![Static Badge](https://img.shields.io/badge/推荐一键安装脚本-orange?style=flat-square)
```bash
#安装/卸载服务(安装或卸载时间较长,建议放后台执行。)
#组合[1]-----------------------------------------------------------------------------------------------------------------------------------
2025-07-05 18:18:15 +08:00
screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --install --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log";
2025-07-05 15:49:53 +08:00
tail -f /opt/gpu-manager.log
2025-07-05 18:18:15 +08:00
screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --uninstall --version 1 --include=exporter 2>&1 > /opt/gpu-manager.log";
2025-07-05 15:49:53 +08:00
tail -f /opt/gpu-manager.log
#组合[2]-----------------------------------------------------------------------------------------------------------------------------------
2025-07-05 18:18:15 +08:00
screen -dmS install_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --install --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log";
2025-07-05 15:49:53 +08:00
tail -f /opt/gpu-manager.log
2025-07-05 18:18:15 +08:00
screen -dmS uninstall_script bash -c "cd /opt/ && wget -qO- http://116.205.97.109/scripts/gpu-manager.sh|bash -s -- --uninstall --version 2 --include=exporter 2>&1 > /opt/gpu-manager.log";
2025-07-05 15:49:53 +08:00
tail -f /opt/gpu-manager.log
#说明:
#version 1 表示安装/卸载七.[1]组件版本mlnx_ofed-23.10.1.1.9.0+nvidia_drive-565.57.01 +cuda-12.6.3.560.35.05 +fabricmanager-565_565.57.01.1
#version 2 表示安装/卸载七.[2]组件版本mlnx_ofed-23.10.1.1.9.0+nvidia_drive-570.124.06+cuda-12.8.1.570.124.06+fabricmanager-570.124.06.1
#--include=exporter 指定该参数,脚本将安装/卸载exporter组件中的相关服务[dcgm-exporter,node-exporter,nvidia-dcgm],默认不安装/卸载。
```
2025-07-14 16:27:45 +08:00
**特别提醒**
```bash
2025-07-14 16:45:14 +08:00
GPU:B200系列
2025-07-14 16:48:35 +08:00
1B200系列安装fabricmanager 时需要安装nvlsm,否则faricmanager无法启动。
2025-07-14 16:27:45 +08:00
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/nvlsm_2025.03.1-1_amd64.deb
dpkg -i nvlsm_2025.03.1-1_amd64.deb
2025-07-14 16:48:35 +08:00
超威机型:
1超威机型需要在:/etc/default/grub文件下添加GRUB_CMDLINE_LINUX="quiet splash nokaslr"参数否则CUDA初始化失败。
2超威机型需要在:/etc/default/grub文件下添加GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=off" #正确关闭iommu
2025-07-15 12:16:36 +08:00
(3) 超威机型挂载镜像: http协议:file_server:http://10.51.151.201 镜像路径:/iso/ubu22043.iso
2025-07-14 16:45:14 +08:00
2025-07-14 16:48:35 +08:00
再生龙镜像还原:
2025-07-15 12:16:36 +08:00
1华擎B200:再生龙镜像NFS10.102.35.99/nfs/clone.iso 备份路径:/nfs/2025-05-26-09-B200-960g-img #华擎机型对再生龙引导镜像版本无要求
2超威B200:再生龙镜像:10.102.35.99/nfs/clone.iso 备份路径: /nfs/chaowei-B200-1.7T-img #注意超威机型对再生龙引导镜像对版本有要求,最新版本无法引导。
(3) 技嘉A100:再生龙镜像:10.101.0.86:/nfs/ 备份路径: /nfs/2025-07-15-03-Jijia-A100-960G-img #技嘉A100-磁盘960G-CX7
2025-07-14 16:27:45 +08:00
```
2025-07-16 17:02:46 +08:00
2025-07-16 17:03:28 +08:00
**ubuntu2404:(临时)**
2025-07-16 17:02:46 +08:00
```bash
cd /opt/
wget https://content.mellanox.com/ofed/MLNX_OFED-24.10-2.1.8.0/MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu24.04-x86_64.tgz #[ubuntu24.04]
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/nvidia-fabricmanager-570_570.124.06-1_amd64.deb #[ubuntu24.04]
wget https://cn.download.nvidia.com/tesla/570.124.06/NVIDIA-Linux-x86_64-570.124.06.run #[无版本要求]
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run #[无版本要求]
cd /opt/ && git clone http://116.205.97.109:3000/yindun/ansible-devops.git
cd /opt/ansible-devops/scripts/
#-----临时替换适配ubuntu24.04
sed -i -e 's/5.8-6.0.4.2/24.10-2.1.8.0/g' -e 's/22.04/24.04/g' ib-drive.sh && sed -i 's/2204/2404/g' nvidia-fabricmanager.sh
bash system_optimize.sh --install
bash ib-drive.sh --install --version "24.10-2.1.8.0"
bash nvidia-driver.sh --install --version '570.124.06'
bash nvidia-fabricmanager.sh --install --version "570_570.124.06-1"
bash cuda.sh --install --version "12.8.1_570.124.06"
#安装exporter
cd /opt/ && wget -qO- http://116.205.97.109/scripts/nvidia-dcgm.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/dcgm-exporter.sh | bash -s -- --install
cd /opt/ && wget -qO- http://116.205.97.109/scripts/node-exporter.sh | bash -s -- --install
2025-07-28 18:01:16 +08:00
cd /opt/ && wget -qO- http://116.205.97.109/scripts/deploy_gpu_monitor.sh | bash -s -- --install #针对dcgm-exporter 进行自定义扩展后期集成到dcgm中
2025-07-28 17:06:42 +08:00
2025-07-16 17:02:46 +08:00
#修改主机名,内核版本锁定,根分区扩容已集成在初始化脚本中无须重复执行。
2025-10-13 14:32:28 +08:00
# 平湖引导镜像10.101.0.34 /nfs/iso/gpu550.iso
2025-07-16 17:02:46 +08:00
```