Skip to content
Snippets Groups Projects
user avatar
huongdm1896 authored
48454ced
History
Name Last commit Last update
Flower_v1
Run
.gitignore
README.md
requirements.txt

Measure Energy of Flower FL in G5K

This project provides tools to measure the energy consumption of Flower-based federated learning (FL) experiments on the Grid'5000 (G5K) testbed. It includes scripts to manage distributed nodes, run FL experiments, and monitor energy usage.

Table of Contents

Getting Started

The repository includes an example of Flower (using TensorFlow) in the Flower_v1 directory and the source of measuring framework in Run. This example demonstrates how to use this framework to measure energy consumption.

Installation

Clone the repository and navigate to the eflwr directory:

git clone https://gitlab.irit.fr/sepia-pub/delight/eflwr.git
cd eflwr

This framework requires:

  • Python 3.9.2 or higher.
  • Additional dependencies listed in requirements.txt. Install them with:
    pip install -r requirements.txt

Note: requirements.txt includes tensorflow, tensorflow-datasets scikit-learn and numpy using for the provided Flower example.

Navigate to Run directory:

cd Run

Usage

FL Framework

FL scripts (including server and client scripts) can be updated, for example, in the Flower_v1 directory.

Important Notes on Flower Client Configuration

When using this framework for Flower deployment on distributed servers, the client script should not require manual input of the IP:PORT for the Flower server. The framework is already designed to handle this automatically.

Key Points:

  • The Flower client must be started with the correct server_address, which is automatically configured in the framework.
  • Users should not manually input the IP:PORT in configuration file, as the framework already passes this information automatically.
  • The client script must be structured to accept the server address as an argument, ensuring compatibility with the framework.

Example:

# Clients usage: python3 client.py <other_args> <IP:PORT>"
fl.client.start_client(server_address=sys.argv[<last_position>])

So, in Configuration file will not including the IP:PORT of server in client cmd:

"clients": [
           {
               "name": "client1",
               "command": "python3",
               "args": [
                   "./Flower_v1/client_1.py",
                   "cifar10",
                   "0",
                   "3"
               ],
               "ip": "172.16.66.77"
           },]

By following this structure, the deployment will function as expected without requiring manual intervention for the server address configuration.

Configure instance for CPU

Configure instances of experiment in a json format, structure is shown below.

  • instances includes "1", "2" ,... are identifies of each instance.
  • instance: name of instance.
  • output_dir: location stores the output files (experiment log and energy monitoring output).
  • dvfs_cpu: choose only one in 3 settings.
    • dummy: for testing in min and max CPU freq (false or true).

    • baseline: for testing in max CPU freq (false or true).

    • frequencies: Limits to the provided list of frequencies (null or int list []).

      Remark: check the available frequencies before using oftion frequencies.

      • Set the permissions and disable Turbo Boost first:
      bash "$(python3 -c "import expetator, os; print(os.path.join(os.path.dirname(expetator.__file__), 'leverages', 'dvfs_pct.sh'))")" init
      • Run this command to get available 4 frequencies (min, max, 2 in middle):
      python3 get_freq.py
      • Update extraced frequencies value to configure files.
  • Structure of json config: (remark: role "modules": ["logger"] is supported by the Flower_v1 example provided in this repo to easily log FL performance values. For your FL framework just leave it blank: "modules":[]).
    {
        "instances": {
            "1": {
                "instance": "",
                "output_dir": "",
                "dvfs_cpu": {
                    "dummy": true,
                    "baseline": false,
                    "frequencies": null
                },
                "server": {
                    "command": "python3",
                    "args": [
                    ],
                    "ip": "",
                    "modules": ["logger"],
                    "port": 8080
                    },
                "clients": [
                {
                    "name": "client1",
                    "command": "python3",
                    "args": [
                    ],
                    "ip": ""
                },
                {...},
                {...}
                ]
            },
            "2": {
                "instance": "",
                ...
            }
        }
    }

Configure instance for GPU (*developing)

  • The configuration is as same CPU, except dvfs role. In GPU config, the role is dvfs_gpu.
    Choose only one in 3 settings:

    • dummy: for testing in min and max GPU freq (false or true).
    • baseline: for testing in max GPU freq (false or true).
    • last setting is frequencies, includes 3 parameters: test with the all freqs in the range. To disable this setting, set the zoomfrom and zoomto are same values.
      • steps: steps to jump in range/window of frequencies (int).
      • zoomfrom: freq start
      • zoomto: freq stop
        example:
        • list of freq available [1, 1.1, 1.2, 1.9, 2.5, 2.7] GHz
        • with zoomfrom = 1.1, zoomto = 2.7 and steps = 2
        • list of tested freq returns [1.1, 1.9, 2,7]

    Remark: check the available frequencies before using option frequencies. Run below cmd:

    nvidia-smi -i 0 --query-supported-clocks=gr --format=csv,noheader,nounits | tr '\n' ' '
    "dvfs_gpu": {
                  "dummy": true,
                  "baseline": false,
                  "steps": 2,
                  "zoomfrom": 0,
                  "zoomto": 0
              },

Run exp

2 options of experiment: run single instance or all instances (a campaign).

Run single instance:

python3 measure_instance.py -c [config_file] -i [instance] -x [experiment_name] -r [repetitions]
  • [config_file]: The instances configuration file.
  • [instance] : Identify number of single instance.
  • [experiment_name]: The name you use to identify your experiment.
  • [repetitions]: Number of repetitions for the experiment.

Run campaign:

python3 measure_campaign.py -x [experiment_name] -c [config_file] -r [repetitions]

For campaign running, all instances which were defined in [config_file] will be used.

Quickstart

Step 1. Reserve the Hosts in G5K

Reserve the required number of hosts (See the document of G5K for more details)
For example:

Reserve 4 hosts (CPU) (1 server + 3 clients) for 2 hours:

oarsub -I -l host=4,walltime=2

Reserve 4 hosts (GPU) (1 server + 3 clients) for 2 hours:

oarsub -I -t exotic -p "gpu_count>0" -l {"cluster='drac'"}/host=4 # grenoble
oarsub -I -p "gpu_count>0" -l {"cluster='chifflot'"}/host=4 # lille

Remark: for now only 2 clusters, chifflot in Lille and drac in Grenoble are available for testing in more than 3 GPU nodes, maximum is 8 (chifflot) or 12 (drac) nodes.

Make sure your are ineflwr/Run/:

cd Run

Step 2. Configure

Two JSON configuration files (e.g. config_instances_CPU.json for CPU and config_instances_GPU.json for GPU) to specify experiment details includes one or more instances.

cat config_instances_CPU.json

For example: config_instances_CPU.json provides two examples of instance configuration.

  • instance "1": fedAvg, cifar10, dvfs with min and max CPU freq, 1 round.
  • instance "2": fedAvg2Clients, cifar10, dvfs with min and max CPU freq, 1 round.

Step 3. Collect IP

Run the following command to collect/generate a node list:

uniq $OAR_NODEFILE > nodelist

Automatically populate missing IP addresses in the JSON file:

python3 collect_ip.py -n nodelist -c config_instances_CPU.json

Step 4. Run the Campaign or Single Instance

Run single instance with instance 1, and 2 repetitions:

python3 measure_instance.py -x SingleTest -c config_instances_CPU.json -i 1 -r 2

Run a campaign with all instances (1 and 2), and 2 repetitions:

python3 measure_campaign.py -x CampaignTest -c config_instances_CPU.json -r 2

Note: Running single instance takes about 6 mins (1 round (80s) * 2 repetitions * 2 freqs = 320s). Running a campaign (2 instances) takes about 12 mins.

Step 5. Output

The logs and energy monitoring data will be saved in the directory specified in the JSON configuration.

Output dir structure for demo single instance: Log/Flower_SingleTest/Flower_instance_Flower_instance_fedAvg_cifar10

Log/Flower_SingleTest
├── Flower_instance_Flower_instance_fedAvg_cifar10
│   ├── Expetator
|   |   ├── config_instance_1.json
│   ├── Expetator_<host_info>_<timestamp>_mojitos: mojitos outputs
│   ├── Expetator_<host_info>_<timestamp>_power: wattmetter outputs
│   ├── Expetator_<host_info>_<timestamp>: measurement log
│   ├── Flwr_<timestamp>: Flower log
│   │   ├── Client_<ip>
│   │   ├── Client_<ip>
│   │   ├── Server_<ip>
│   │   ├── training_results_<instance_name>_<time>.csv
│   ├── Flwr_<timestamp>
│   │   ├── Client_<ip>
│   │   ├── Server_<ip>
...

Output dir structure for demo campaign, includes 4 folders for 4 instances:

Log/Flower_CampaignTest
├── Flower_instance_fedAvg_cifar10_epoch1
├── Flower_instance_fedAvg_cifar10_epoch2
├── Flower_instance_fedAvg2Clients_cifar10_epoch1
├── Flower_instance_fedAvg2Clients_cifar10_epoch2
...

Step 6. Clean Up

After the experiment, exit the host and kill job if needed:

exit
oardel <job_id>

Results analysis

The 1st version of results analysis can be found in this repo https://gitlab.irit.fr/sepia-pub/delight/fedeator_results_analysis

License

This project is licensed under [GPLv3].