From 75b3b627245eae93e7a06f5fc019431ee6d80efc Mon Sep 17 00:00:00 2001 From: huongdm1896 <domaihuong1451997@gmail.com> Date: Sat, 3 May 2025 01:52:27 +0200 Subject: [PATCH] Add GPU instructions --- GPU_cuda.md | 131 ++++++++++++++++++++++++++++++++++++++++++++ README.md | 21 +++++-- requirement_GPU.txt | 8 +++ 3 files changed, 155 insertions(+), 5 deletions(-) create mode 100644 GPU_cuda.md create mode 100644 requirement_GPU.txt diff --git a/GPU_cuda.md b/GPU_cuda.md new file mode 100644 index 0000000..180e188 --- /dev/null +++ b/GPU_cuda.md @@ -0,0 +1,131 @@ +# CUDA Installation and Configuration Guide with TensorFlow on NVIDIA GPU (G5K) +(This guide has been tested on chifflot - Lille) +## 1. Check GPU Status + +```bash +nvidia-smi +``` + +The result will look like this: + +``` +Fri Apr 25 14:53:18 2025 ++---------------------------------------------------------------------------------------+ +| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 | +|-----------------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+======================+======================| +| 0 Tesla V100-PCIE-32GB On | 00000000:3B:00.0 Off | 0 | +| N/A 33C P0 26W / 250W | 0MiB / 32768MiB | 0% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ +| 1 Tesla V100-PCIE-32GB On | 00000000:D8:00.0 Off | 0 | +| N/A 30C P0 27W / 250W | 0MiB / 32768MiB | 0% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ ++---------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=======================================================================================| +| No running processes found | ++---------------------------------------------------------------------------------------+ +``` + +## 2. Check Current CUDA Version + +```bash +nvcc --version +``` + +The result will look like this: + +``` +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2021 NVIDIA Corporation +Built on Sun_Feb_14_21:12:58_PST_2021 +Cuda compilation tools, release 11.2, V11.2.152 +Build cuda_11.2.r11.2/compiler.29618528_0 +``` + +### Note: +You need to check the version of CUDA which GPU support and the current using version. +If you need a different version of CUDA, you will need to switch to another one. + +## 3. Check Available CUDA Versions with `module` + +```bash +module av cuda +``` + +The result will look like this: + +``` +-------------------- /grid5000/spack/v1/share/spack/modules/linux-debian11-x86_64_v2 -------------------- + cuda/11.4.0_gcc-10.4.0 cuda/11.8.0_gcc-10.4.0 cuda/12.2.1_gcc-10.4.0 (D) + cuda/11.6.2_gcc-10.4.0 cuda/12.0.0_gcc-10.4.0 mpich/4.1_gcc-10.4.0-ofi-cuda + cuda/11.7.1_gcc-10.4.0 cuda/12.1.1_gcc-10.4.0 mpich/4.1_gcc-10.4.0-ucx-cuda +``` + +## 4. Load the Desired CUDA Version + +To load CUDA version 12.2.1 (as GPU requires), use the following command: + +```bash +module load cuda/12.2.1_gcc-10.4.0 +``` + +The default CUDA version will be replaced by CUDA 12.2.1. + +## 5. Set CUDA Environment Variables(If needed) + +Set up the environment variables to use the newly loaded CUDA version: + +```bash +export PATH=/usr/local/cuda-12.2.1/bin:$PATH +export LD_LIBRARY_PATH=/usr/local/cuda-12.2.1/lib64:$LD_LIBRARY_PATH +``` + +## 6. Check CUDA Version Again + +After loading the new CUDA version, check the version again: + +```bash +nvcc --version +``` + +The result will show: + +``` +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2023 NVIDIA Corporation +Built on Tue_Jul_11_02:20:44_PDT_2023 +Cuda compilation tools, release 12.2, V12.2.128 +Build cuda_12.2.r12.2/compiler.33053471_0 +``` + +## 7. Install TensorFlow with CUDA Support + +```bash +python3 -m pip install 'tensorflow[and-cuda]' +``` + +## 8. Verify Available GPUs in TensorFlow + +Verify that TensorFlow recognizes the available GPUs: + +```bash +python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" +``` + +The result will show a list of available GPUs: + +``` +[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')] +``` + +--- + +Congratulations, you have successfully installed and configured TensorFlow with GPU support on G5K! (Or not? :stuck_out_tongue_winking_eye:) \ No newline at end of file diff --git a/README.md b/README.md index 818f918..1facfef 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,6 @@ This framework requires: ```bash pip install -r requirements.txt ``` -*Note:* `requirements.txt` includes `tensorflow`, `tensorflow-datasets` `scikit-learn` and `numpy` using for the provided Flower example. Navigate to `Run` directory: @@ -198,7 +197,7 @@ Choose only one in 3 settings: ## Quickstart -### Step 1. Reserve the Hosts in G5K +### Step 0. Reserve the Hosts in G5K Reserve the required number of hosts (*See the [document of G5K](https://www.grid5000.fr/w/Getting_Started#Reserving_resources_with_OAR:_the_basics) for more details*) <u>For example</u>: @@ -209,17 +208,29 @@ oarsub -I -l host=4,walltime=2 ``` Reserve 4 hosts (GPU) (1 server + 3 clients) for 2 hours: ```bash -oarsub -I -t exotic -p "gpu_count>0" -l {"cluster='drac'"}/host=4 # grenoble -oarsub -I -p "gpu_count>0" -l {"cluster='chifflot'"}/host=4 # lille +oarsub -I -p "gpu_count>0" -l {"cluster='chifflot'"}/host=4,walltime=2 # lille ``` -**Remark**: for now only 2 clusters, `chifflot` in Lille and `drac` in Grenoble are available for testing in more than 3 GPU nodes, maximum is 8 (`chifflot`) or 12 (`drac`) nodes. +**Remark**: for now only 1 cluster, `chifflot` in Lille is available for testing (more than 3 GPU nodes and able to set up requirement), maximum is 8 (`chifflot`) nodes. Need to configure cuda for GPU using, check out the quick guide [here](./GPU_cuda.md) or [G5K website](https://www.grid5000.fr/w/GPUs_on_Grid5000). Make sure your are in`eflwr/Run/`: ```bash cd Run ``` +### Step 1. Install requirements + +If you use CPU nodes: + ```bash + pip install -r requirements.txt # futher needed for Flower example + ``` + +If you use GPU nodes: + ```bash + pip install -r requirement_GPU.txt # futher needed for Flower example + ``` +*Note:* futher requirement includes `tensorflow` or `tensorflow[and-cuda]` , `tensorflow-datasets` `scikit-learn` and `numpy` using for the provided Flower example. + ### Step 2. Configure Two JSON configuration files (e.g. `config_instances_CPU.json` for CPU and `config_instances_GPU.json` for GPU) to specify experiment details includes one or more instances. diff --git a/requirement_GPU.txt b/requirement_GPU.txt new file mode 100644 index 0000000..bbf1a72 --- /dev/null +++ b/requirement_GPU.txt @@ -0,0 +1,8 @@ +flwr==1.13.0 +flwr-datasets==0.4.0 +expetator==0.3.25 +tensorflow[and-cuda]>=2.16.1,<2.17.0 +tensorflow-datasets==4.4.0 +tensorboard>=2.16.2,<2.17.0 +scikit-learn==1.1.3 +numpy>=1.23.0,<1.24.0 -- GitLab