forked from yindun/ansible-devops
This commit is contained in:
commit
5fb837a692
|
|
@ -0,0 +1,100 @@
|
|||
<h2 align="center">Ansible实现GPU服务器组件的标准化部署</h2>
|
||||
<p align="center">
|
||||
<img src="https://img.shields.io/github/languages/code-size/nanchengcyu/TechMindWave-frontend" alt="code size"/>
|
||||
<img src="https://img.shields.io/badge/ofed-17.0.2-blue" alt="ofed"/>
|
||||
<img src="https://img.shields.io/badge/NVIDIA-565.57.01-brightgreen" alt="NVIDIA"/>
|
||||
<img src="https://img.shields.io/badge/fabricmanager-565.57.01-blue" alt="fabricmanager"/>
|
||||
<img src="https://img.shields.io/badge/CUDA-12.6.3-brightgreen" alt="CUDA"/>
|
||||
<br>
|
||||
<img src="https://img.shields.io/badge/Author-王云龙-orange" alt="Author" />
|
||||
</p>
|
||||
<hr>
|
||||
|
||||
### 一、剧本概述
|
||||
|
||||
该剧本通过**独立脚本+Ansible批量执行**实现服务器组件的标准化部署,支持以下核心功能:
|
||||
|
||||
- **组件范围**:系统初始化、显卡驱动、网卡驱动、Node Exporter、DCGM Exporter(可扩展)
|
||||
- **操作类型**:`--install`(安装)、`--uninstall`(卸载)
|
||||
- **版本管理**:默认版本(变量文件定义)、手动指定版本(执行时传递)
|
||||
- **维护特性**:仅需修改脚本即可调整安装逻辑,无需改动Ansible剧本,实现解耦维护。
|
||||
|
||||
### 二、目录结构说明
|
||||
|
||||
```plaintext
|
||||
prod-ansible/
|
||||
├── inventory/ # 主机清单
|
||||
│ └── prod.ini # 主机分组清单
|
||||
│
|
||||
├── group_vars/ # 全局公共变量
|
||||
│ └── all.yaml # SSH配置、脚本路径、日志目录
|
||||
│
|
||||
├── roles/ # 组件角色(独立变量/任务/脚本)
|
||||
│ ├── gpu_driver/ # 显卡驱动角色
|
||||
│ │ ├── vars/ # 角色专用变量
|
||||
│ │ │ └── main.yaml # 驱动版本、下载URL等
|
||||
│ │ ├── tasks/ # 角色任务(安装/卸载逻辑)
|
||||
│ │ │ └── main.yml # 调用脚本执行操作
|
||||
│ │ └── files/ # 角色专属脚本
|
||||
│ │ └── install.sh # 显卡驱动安装/卸载脚本
|
||||
│ │
|
||||
│ ├── node_exporter/ # Node Exporter角色
|
||||
│ │ ├── vars/ # 专属变量
|
||||
│ │ └── tasks/ # 专属任务
|
||||
│ │
|
||||
│ └── dcgm_exporter/ # DCGM Exporter角色
|
||||
│
|
||||
├── playbooks/ # 独立组件剧本
|
||||
│ ├── deploy_gpu.yml # 仅部署GPU驱动
|
||||
│ ├── deploy_node_exporter.yml# 仅部署Node Exporter
|
||||
│ └── all_components_deploy.yml # 全量部署所有组件
|
||||
│
|
||||
├── scripts/ # 辅助脚本[ansible执行错误时使用]
|
||||
│ └── ib-drive.sh # 网卡驱动安装脚本
|
||||
│ └── nvidia-driver.sh # 显卡驱动安装脚本
|
||||
│ └── nvidia-fabricmanager.sh # GPU互联管理器安装脚本
|
||||
│ └── cuda.sh # CUDA工具包安装脚本
|
||||
│ └── system_optimize.sh # 系统初始化安装脚本
|
||||
│
|
||||
├── ansible.cfg # Ansible全局配置
|
||||
└── README.md
|
||||
```
|
||||
|
||||
|
||||
### 三、执行命令示例
|
||||
|
||||
#### 场景1:批量安装Node Exporter(案例)
|
||||
```bash
|
||||
ansible-playbook -i inventory/production.ini site.yml --extra-vars "component=node-exporter-install operation=install"
|
||||
```
|
||||
|
||||
#### 场景2:单台服务器安装指定版本NVIDIA驱动
|
||||
```bash
|
||||
ansible-playbook -i inventory/staging.ini site.yml --limit server-01 --extra-vars "component=gpu-install operation=install version=535.104.05"
|
||||
|
||||
```
|
||||
|
||||
#### 场景3:批量卸载DCGM Exporter
|
||||
```bash
|
||||
ansible-playbook -i inventory/production.ini site.yml --extra-vars "component=dcgm-exporter-install operation=uninstall"
|
||||
```
|
||||
|
||||
#### 场景4:手动执行脚本(Ansible失败时)
|
||||
```bash
|
||||
# 移步:scripts 目录手动执行相关脚本。
|
||||
# http://10.101.0.51:3000/yindun/ansible-devops/src/branch/main/scripts
|
||||
```
|
||||
|
||||
|
||||
|
||||
### 四、扩展与维护
|
||||
|
||||
#### 1. 新增组件
|
||||
1. 编写独立脚本(如`node-exporter-install.sh`),支持`--install`/`--uninstall`/`--version`参数;
|
||||
2. 将脚本上传至指定的存储位置;
|
||||
3. 在group_vars/all.yaml中添加默认版本;
|
||||
4. 直接通过Ansible调用:
|
||||
```bash
|
||||
ansible-playbook -i inventory/prod.ini site.yml --extra-vars "component=node-exporter-install operation=install"
|
||||
```
|
||||
---
|
||||
Loading…
Reference in New Issue