Skip to content
Snippets Groups Projects

Measure Energy of Flower FL in G5K

This project provides tools to measure the energy consumption of Flower-based federated learning (FL) experiments on the Grid'5000 (G5K) testbed. It includes scripts to manage distributed nodes, run FL experiments, and monitor energy usage.

Table of Contents

Getting Started

The repository includes an example of Flower (using TensorFlow) in the Flower_v1 directory and the source of measuring framework in Run. This example demonstrates how to use this framework to measure energy consumption.

Installation

Clone the repository and navigate to the eflwr directory:

git clone https://gitlab.irit.fr/sepia-pub/delight/eflwr.git
cd eflwr

This framework requires:

  • Python 3.9.2 or higher.
  • Additional dependencies listed in requirements.txt. Install them with:
    pip install -r requirements.txt

Note: requirements.txt includes tensorflow, tensorflow-datasets scikit-learn and numpy using for the provided Flower example.

Navigate to Run directory:

cd Run

Usage

FL framework

FL scripts (includes server and client scripts) can be updated, example in dir Flower_v1.

Configure instance for CPU

Configure instances of experiment in a json format, structure is shown below.

  • instances includes "1", "2" ,... are identifies of each instance.
  • instance: name of instance.
  • output_dir: location stores the output files (experiment log and energy monitoring output).
  • dvfs_cpu: choose only one in 3 settings.
    • dummy: for testing in min and max CPU freq (false or true).

    • baseline: for testing in max CPU freq (false or true).

    • frequencies: Limits to the provided list of frequencies (null or int list []).

      Remark: check the available frequencies before using oftion frequencies.

      • Set the permissions and disable Turbo Boost first:
      bash "$(python3 -c "import expetator, os; print(os.path.join(os.path.dirname(expetator.__file__), 'leverages', 'dvfs_pct.sh'))")" init
      • Run this command to get available 4 frequencies (min, max, 2 in middle):
      python3 get_freq.py
      • Update extraced frequencies value to configure files.
  • Structure of json config:
    {
        "instances": {
            "1": {
                "instance": "",
                "output_dir": "",
                "dvfs_cpu": {
                    "dummy": true,
                    "baseline": false,
                    "frequencies": null
                },
                "server": {
                    "command": "python3",
                    "args": [
                    ],
                    "ip": "",
                    "modules": ["logger"],
                    "port": 8080
                    },
                "clients": [
                {
                    "name": "client1",
                    "command": "python3",
                    "args": [
                    ],
                    "ip": ""
                },
                {...},
                {...}
                ]
            },
            "2": {
                "instance": "",
                ...
            }
        }
    }

Configure instance for GPU (*developing)

  • The configuration is as same CPU, except dvfs role. In GPU config, the role is dvfs_gpu.
    Choose only one in 3 settings:

    • dummy: for testing in min and max GPU freq (false or true).
    • baseline: for testing in max GPU freq (false or true).
    • last setting includes 3 parameters: test with the all freqs in the range. To disable this setting, set the zoomfrom and zoomto are same values.
      • steps: steps to jump in range/window of frequencies (int).
      • zoomfrom: freq start
      • zoomto: freq stop
        example:
        • list of freq available [1, 1.1, 1.2, 1.9, 2.5, 2.7] GHz
        • with zoomfrom = 1.1, zoomto = 2.7 and steps = 2
        • list of tested freq returns [1.1, 1.9, 2,7]

    Remark: check the available frequencies before using option frequencies. Run below cmd:

    nvidia-smi -i 0 --query-supported-clocks=gr --format=csv,noheader,nounits | tr '\n' ' '
    "dvfs_gpu": {
                  "dummy": true,
                  "baseline": false,
                  "steps": 2,
                  "zoomfrom": 0,
                  "zoomto": 0
              },

Run exp

2 options of experiment: run single instance or all instances (a campaign).

Run single instance:

python3 measure_instance.py -c [config_file] -i [instance] -x [experiment_name] -r [repetitions]
  • [config_file]: The instances configuration file.
  • [instance] : Identify number of single instance.
  • [experiment_name]: The name you use to identify your experiment.
  • [repetitions]: Number of repetitions for the experiment.

Run campaign:

python3 measure_campaign.py -x [experiment_name] -c [config_file] -r [repetitions]

For campaign running, all instances which were defined in [config_file] will be used.

Quickstart

Step 1. Reserve the Hosts in G5K

Reserve the required number of hosts (See the document of G5K for more details)
For example:

Reserve 4 hosts (CPU) (1 server + 3 clients) for 2 hours:

oarsub -I -l host=4,walltime=2

Reserve 4 hosts (GPU) (1 server + 3 clients) for 2 hours:

oarsub -I -t exotic -p "gpu_count>0" -l {"cluster='drac'"}/host=4 # grenoble

Remark: for now only 2 clusters, chifflot in Lille and drac in Grenoble are available for testing in more than 3 GPU nodes, maximum is 8 (chifflot) or 12 (drac) nodes.

Make sure your are ineflwr/Run/:

cd Run

Step 2. Configure

Two JSON configuration files (e.g. config_instances_CPU.json for CPU and config_instances_GPU.json for GPU) to specify experiment details includes one or more instances.

cat config_instances_CPU.json

For example: config_instances_CPU.json provides two examples of instance configuration.

  • instance "1": fedAvg, cifar10, dvfs with min and max CPU freq, 1 round.
  • instance "2": fedAvg2Clients, cifar10, dvfs with min and max CPU freq, 1 round.

Step 3. Collect IP

Run the following command to collect/generate a node list:

uniq $OAR_NODEFILE > nodelist

Automatically populate missing IP addresses in the JSON file:

python3 collect_ip.py -n nodelist -c config_instances_CPU.json

Step 4. Run the Campaign or Single Instance

Run single instance with instance 1, and 2 repetitions:

python3 measure_instance.py -x SingleTest -c config_instances_CPU.json -i 1 -r 2

Run a campaign with all instances (1 and 2), and 2 repetitions:

python3 measure_campaign.py -x CampaignTest -c config_instances_CPU.json -r 2

Note: Running single instance takes about 6 mins (1 round (80s) * 2 repetitions * 2 freqs = 320s). Running a campaign (2 instances) takes about 12 mins.

Step 5. Output

The logs and energy monitoring data will be saved in the directory specified in the JSON configuration.

Output dir structure for demo single instance: Log/Flower_SingleTest/Flower_instance_Flower_instance_fedAvg_cifar10

Log/Flower_SingleTest
├── Flower_instance_Flower_instance_fedAvg_cifar10
│   ├── Expetator
|   |   ├── config_instance_1.json
│   ├── Expetator_<host_info>_<timestamp>_mojitos: mojitos outputs
│   ├── Expetator_<host_info>_<timestamp>_power: wattmetter outputs
│   ├── Expetator_<host_info>_<timestamp>: measurement log
│   ├── Flwr_<timestamp>: Flower log
│   │   ├── Client_<ip>
│   │   ├── Server_<ip>
│   │   ├── training_results_<instance_name>_<time>.csv
│   ├── Flwr_<timestamp>
│   │   ├── Client_<ip>
│   │   ├── Server_<ip>
...

Output dir structure for demo campaign, includes 2 folders for 2 instances:

Log/Flower_CampaignTest
├── Flower_instance_Flower_instance_fedAvg_cifar10
├── Flower_instance_Flower_instance_fedAvg2Clients_cifar10
...
### Step 6. Clean Up

After the experiment, exit the host and kill job if needed:
  ```bash
  exit
  oardel <job_id>

License

This project is licensed under [GPLv3].