Measure Energy of Flower FL in G5K
This project provides tools to measure the energy consumption of Flower-based federated learning (FL) experiments on the Grid'5000 (G5K) testbed. It includes scripts to manage distributed nodes, run FL experiments, and monitor energy usage.
Table of Contents
Getting Started
The repository includes an example of Flower (using TensorFlow) in the Flower_v1
directory and the source of measuring framework in Run
. This example demonstrates how to use this framework to measure energy consumption.
Installation
Clone the repository and navigate to the eflwr
directory:
git clone https://gitlab.irit.fr/sepia-pub/delight/eflwr.git
cd eflwr
This framework requires:
- Python 3.9.2 or higher.
- Additional dependencies listed in
requirements.txt
. Install them with:pip install -r requirements.txt
Note: requirements.txt
includes tensorflow
, tensorflow-datasets
scikit-learn
and numpy
using for the provided Flower example.
Navigate to Run
directory:
cd Run
Usage
FL Framework
FL scripts (including server and client scripts) can be updated, for example, in the Flower_v1
directory.
Important Notes on Flower Client Configuration
When using this framework for Flower deployment on distributed servers, the client script should not require manual input of the IP:PORT
for the Flower server. The framework is already designed to handle this automatically.
Key Points:
- The Flower client must be started with the correct
server_address
, which is automatically configured in the framework. - Users should not manually input the
IP:PORT
in configuration file, as the framework already passes this information automatically. - The client script must be structured to accept the server address as an argument, ensuring compatibility with the framework.
Example:
# Clients usage: python3 client.py <other_args> <IP:PORT>"
fl.client.start_client(server_address=sys.argv[<last_position>])
So, in Configuration file will not including the IP:PORT of server in client cmd:
"clients": [
{
"name": "client1",
"command": "python3",
"args": [
"./Flower_v1/client_1.py",
"cifar10",
"0",
"3"
],
"ip": "172.16.66.77"
},]
By following this structure, the deployment will function as expected without requiring manual intervention for the server address configuration.
Configure instance for CPU
Configure instances of experiment in a json format, structure is shown below.
- instances includes "1", "2" ,... are identifies of each instance.
- instance: name of instance.
- output_dir: location stores the output files (experiment log and energy monitoring output).
-
dvfs_cpu: choose only one in 3 settings.
-
dummy
: for testing in min and max CPU freq (false
ortrue
). -
baseline
: for testing in max CPU freq (false
ortrue
). -
frequencies
: Limits to the provided list of frequencies (null
orint list []
).Remark: check the available frequencies before using oftion
frequencies
.- Set the permissions and disable Turbo Boost first:
bash "$(python3 -c "import expetator, os; print(os.path.join(os.path.dirname(expetator.__file__), 'leverages', 'dvfs_pct.sh'))")" init
- Run this command to get available 4 frequencies (min, max, 2 in middle):
python3 get_freq.py
- Update extraced frequencies value to configure files.
-
- Structure of json config: (remark: role "modules": ["logger"] is supported by the Flower_v1 example provided in this repo to easily log FL performance values. For your FL framework just leave it blank: "modules":[]).
{ "instances": { "1": { "instance": "", "output_dir": "", "dvfs_cpu": { "dummy": true, "baseline": false, "frequencies": null }, "server": { "command": "python3", "args": [ ], "ip": "", "modules": ["logger"], "port": 8080 }, "clients": [ { "name": "client1", "command": "python3", "args": [ ], "ip": "" }, {...}, {...} ] }, "2": { "instance": "", ... } } }
Configure instance for GPU (*developing)
-
The configuration is as same CPU, except dvfs role. In GPU config, the role is dvfs_gpu.
Choose only one in 3 settings:-
dummy
: for testing in min and max GPU freq (false
ortrue
). -
baseline
: for testing in max GPU freq (false
ortrue
). - last setting is
frequencies
, includes 3 parameters: test with the all freqs in the range. To disable this setting, set thezoomfrom
andzoomto
are same values.-
steps
: steps to jump in range/window of frequencies (int). -
zoomfrom
: freq start -
zoomto
: freq stop
example:- list of freq available [1, 1.1, 1.2, 1.9, 2.5, 2.7] GHz
- with
zoomfrom
= 1.1,zoomto
= 2.7 andsteps
= 2 - list of tested freq returns [1.1, 1.9, 2,7]
-
Remark: check the available frequencies before using option
frequencies
. Run below cmd:nvidia-smi -i 0 --query-supported-clocks=gr --format=csv,noheader,nounits | tr '\n' ' '
"dvfs_gpu": { "dummy": true, "baseline": false, "steps": 2, "zoomfrom": 0, "zoomto": 0 },
-
Run exp
2 options of experiment: run single instance or all instances (a campaign).
Run single instance:
python3 measure_instance.py -c [config_file] -i [instance] -x [experiment_name] -r [repetitions]
- [config_file]: The instances configuration file.
- [instance] : Identify number of single instance.
- [experiment_name]: The name you use to identify your experiment.
- [repetitions]: Number of repetitions for the experiment.
Run campaign:
python3 measure_campaign.py -x [experiment_name] -c [config_file] -r [repetitions]
For campaign running, all instances which were defined in [config_file] will be used.
Quickstart
Step 1. Reserve the Hosts in G5K
Reserve the required number of hosts (See the document of G5K for more details)
For example:
Reserve 4 hosts (CPU) (1 server + 3 clients) for 2 hours:
oarsub -I -l host=4,walltime=2
Reserve 4 hosts (GPU) (1 server + 3 clients) for 2 hours:
oarsub -I -t exotic -p "gpu_count>0" -l {"cluster='drac'"}/host=4 # grenoble
oarsub -I -p "gpu_count>0" -l {"cluster='chifflot'"}/host=4 # lille
Remark: for now only 2 clusters, chifflot
in Lille and drac
in Grenoble are available for testing in more than 3 GPU nodes, maximum is 8 (chifflot
) or 12 (drac
) nodes.
Make sure your are ineflwr/Run/
:
cd Run
Step 2. Configure
Two JSON configuration files (e.g. config_instances_CPU.json
for CPU and config_instances_GPU.json
for GPU) to specify experiment details includes one or more instances.
cat config_instances_CPU.json
For example: config_instances_CPU.json
provides two examples of instance configuration.
- instance "
1
": fedAvg, cifar10, dvfs with min and max CPU freq, 1 round. - instance "
2
": fedAvg2Clients, cifar10, dvfs with min and max CPU freq, 1 round.
Step 3. Collect IP
Run the following command to collect/generate a node list:
uniq $OAR_NODEFILE > nodelist
Automatically populate missing IP addresses in the JSON file:
python3 collect_ip.py -n nodelist -c config_instances_CPU.json
Step 4. Run the Campaign or Single Instance
Run single instance with instance 1
, and 2 repetitions:
python3 measure_instance.py -x SingleTest -c config_instances_CPU.json -i 1 -r 2
Run a campaign with all instances (1
and 2
), and 2 repetitions:
python3 measure_campaign.py -x CampaignTest -c config_instances_CPU.json -r 2
Note: Running single instance takes about 6 mins (1 round (80s) * 2 repetitions * 2 freqs = 320s). Running a campaign (2 instances) takes about 12 mins.
Step 5. Output
The logs and energy monitoring data will be saved in the directory specified in the JSON configuration.
Output dir structure for demo single instance: Log/Flower_SingleTest/Flower_instance_Flower_instance_fedAvg_cifar10
Log/Flower_SingleTest
├── Flower_instance_Flower_instance_fedAvg_cifar10
│ ├── Expetator
| | ├── config_instance_1.json
│ ├── Expetator_<host_info>_<timestamp>_mojitos: mojitos outputs
│ ├── Expetator_<host_info>_<timestamp>_power: wattmetter outputs
│ ├── Expetator_<host_info>_<timestamp>: measurement log
│ ├── Flwr_<timestamp>: Flower log
│ │ ├── Client_<ip>
│ │ ├── Client_<ip>
│ │ ├── Server_<ip>
│ │ ├── training_results_<instance_name>_<time>.csv
│ ├── Flwr_<timestamp>
│ │ ├── Client_<ip>
│ │ ├── Server_<ip>
...
Output dir structure for demo campaign, includes 4 folders for 4 instances:
Log/Flower_CampaignTest
├── Flower_instance_fedAvg_cifar10_epoch1
├── Flower_instance_fedAvg_cifar10_epoch2
├── Flower_instance_fedAvg2Clients_cifar10_epoch1
├── Flower_instance_fedAvg2Clients_cifar10_epoch2
...
Step 6. Clean Up
After the experiment, exit the host and kill job if needed:
exit
oardel <job_id>
Results analysis
The 1st version of results analysis can be found in this repo https://gitlab.irit.fr/sepia-pub/delight/fedeator_results_analysis
License
This project is licensed under [GPLv3].