Skip to content
Snippets Groups Projects
Commit 75b3b627 authored by huongdm1896's avatar huongdm1896
Browse files

Add GPU instructions

parent 48454ced
No related branches found
No related tags found
No related merge requests found
# CUDA Installation and Configuration Guide with TensorFlow on NVIDIA GPU (G5K)
(This guide has been tested on chifflot - Lille)
## 1. Check GPU Status
```bash
nvidia-smi
```
The result will look like this:
```
Fri Apr 25 14:53:18 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-32GB On | 00000000:3B:00.0 Off | 0 |
| N/A 33C P0 26W / 250W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE-32GB On | 00000000:D8:00.0 Off | 0 |
| N/A 30C P0 27W / 250W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
```
## 2. Check Current CUDA Version
```bash
nvcc --version
```
The result will look like this:
```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
```
### Note:
You need to check the version of CUDA which GPU support and the current using version.
If you need a different version of CUDA, you will need to switch to another one.
## 3. Check Available CUDA Versions with `module`
```bash
module av cuda
```
The result will look like this:
```
-------------------- /grid5000/spack/v1/share/spack/modules/linux-debian11-x86_64_v2 --------------------
cuda/11.4.0_gcc-10.4.0 cuda/11.8.0_gcc-10.4.0 cuda/12.2.1_gcc-10.4.0 (D)
cuda/11.6.2_gcc-10.4.0 cuda/12.0.0_gcc-10.4.0 mpich/4.1_gcc-10.4.0-ofi-cuda
cuda/11.7.1_gcc-10.4.0 cuda/12.1.1_gcc-10.4.0 mpich/4.1_gcc-10.4.0-ucx-cuda
```
## 4. Load the Desired CUDA Version
To load CUDA version 12.2.1 (as GPU requires), use the following command:
```bash
module load cuda/12.2.1_gcc-10.4.0
```
The default CUDA version will be replaced by CUDA 12.2.1.
## 5. Set CUDA Environment Variables(If needed)
Set up the environment variables to use the newly loaded CUDA version:
```bash
export PATH=/usr/local/cuda-12.2.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.2.1/lib64:$LD_LIBRARY_PATH
```
## 6. Check CUDA Version Again
After loading the new CUDA version, check the version again:
```bash
nvcc --version
```
The result will show:
```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0
```
## 7. Install TensorFlow with CUDA Support
```bash
python3 -m pip install 'tensorflow[and-cuda]'
```
## 8. Verify Available GPUs in TensorFlow
Verify that TensorFlow recognizes the available GPUs:
```bash
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
```
The result will show a list of available GPUs:
```
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
```
---
Congratulations, you have successfully installed and configured TensorFlow with GPU support on G5K! (Or not? :stuck_out_tongue_winking_eye:)
\ No newline at end of file
......@@ -39,7 +39,6 @@ This framework requires:
```bash
pip install -r requirements.txt
```
*Note:* `requirements.txt` includes `tensorflow`, `tensorflow-datasets` `scikit-learn` and `numpy` using for the provided Flower example.
Navigate to `Run` directory:
......@@ -198,7 +197,7 @@ Choose only one in 3 settings:
## Quickstart
### Step 1. Reserve the Hosts in G5K
### Step 0. Reserve the Hosts in G5K
Reserve the required number of hosts (*See the [document of G5K](https://www.grid5000.fr/w/Getting_Started#Reserving_resources_with_OAR:_the_basics) for more details*)
<u>For example</u>:
......@@ -209,17 +208,29 @@ oarsub -I -l host=4,walltime=2
```
Reserve 4 hosts (GPU) (1 server + 3 clients) for 2 hours:
```bash
oarsub -I -t exotic -p "gpu_count>0" -l {"cluster='drac'"}/host=4 # grenoble
oarsub -I -p "gpu_count>0" -l {"cluster='chifflot'"}/host=4 # lille
oarsub -I -p "gpu_count>0" -l {"cluster='chifflot'"}/host=4,walltime=2 # lille
```
**Remark**: for now only 2 clusters, `chifflot` in Lille and `drac` in Grenoble are available for testing in more than 3 GPU nodes, maximum is 8 (`chifflot`) or 12 (`drac`) nodes.
**Remark**: for now only 1 cluster, `chifflot` in Lille is available for testing (more than 3 GPU nodes and able to set up requirement), maximum is 8 (`chifflot`) nodes. Need to configure cuda for GPU using, check out the quick guide [here](./GPU_cuda.md) or [G5K website](https://www.grid5000.fr/w/GPUs_on_Grid5000).
Make sure your are in`eflwr/Run/`:
```bash
cd Run
```
### Step 1. Install requirements
If you use CPU nodes:
```bash
pip install -r requirements.txt # futher needed for Flower example
```
If you use GPU nodes:
```bash
pip install -r requirement_GPU.txt # futher needed for Flower example
```
*Note:* futher requirement includes `tensorflow` or `tensorflow[and-cuda]` , `tensorflow-datasets` `scikit-learn` and `numpy` using for the provided Flower example.
### Step 2. Configure
Two JSON configuration files (e.g. `config_instances_CPU.json` for CPU and `config_instances_GPU.json` for GPU) to specify experiment details includes one or more instances.
......
flwr==1.13.0
flwr-datasets==0.4.0
expetator==0.3.25
tensorflow[and-cuda]>=2.16.1,<2.17.0
tensorflow-datasets==4.4.0
tensorboard>=2.16.2,<2.17.0
scikit-learn==1.1.3
numpy>=1.23.0,<1.24.0
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment