From 3397fddf018b36e1676084f534d4abc1eafdac53 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C3=ABl=20Madon?= <mael.madon@irit.fr> Date: Wed, 26 Jan 2022 09:26:51 +0100 Subject: [PATCH] first draft of 0_prepare_workload --- .gitignore | 3 + 0_prepare_workload.ipynb | 94 ++++++++++++++ 0_prepare_workload/swf.py | 29 +++++ 0_prepare_workload/swf_moulinette.py | 180 +++++++++++++++++++++++++++ README.md | 6 +- platform/average_metacentrum.xml | 28 +++++ sched_input/.gitkeep | 0 workload/.gitkeep | 0 8 files changed, 339 insertions(+), 1 deletion(-) create mode 100644 .gitignore create mode 100644 0_prepare_workload.ipynb create mode 100644 0_prepare_workload/swf.py create mode 100755 0_prepare_workload/swf_moulinette.py create mode 100644 platform/average_metacentrum.xml create mode 100644 sched_input/.gitkeep create mode 100644 workload/.gitkeep diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..526d74f --- /dev/null +++ b/.gitignore @@ -0,0 +1,3 @@ +sched_input/*.json +workload/*.swf +workload/*.json \ No newline at end of file diff --git a/0_prepare_workload.ipynb b/0_prepare_workload.ipynb new file mode 100644 index 0000000..686833d --- /dev/null +++ b/0_prepare_workload.ipynb @@ -0,0 +1,94 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "forced-resolution", + "metadata": {}, + "source": [ + "# Downloading and preparing the workload and platform\n", + "## Workload\n", + "We use the reconverted log `METACENTRUM-2013-3.swf` available on [Parallel Workload Archive](https://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/index.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f66eb756", + "metadata": {}, + "outputs": [], + "source": [ + "# Downloading and unzipping the workload\n", + "!curl -o workload/METACENTRUM-2013-3.swf.gz http://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/METACENTRUM-2013-3.swf.gz\n", + "!gunzip workload/METACENTRUM-2013-3.swf.gz" + ] + }, + { + "cell_type": "markdown", + "id": "graphic-rabbit", + "metadata": {}, + "source": [ + "As mentionned in the [original paper releasing the log](https://www.cs.huji.ac.il/~feit/parsched/jsspp15/p5-klusacek.pdf), the platform is **very heterogeneous**. For the purpose of our study, we perform the following selection:\n", + "- we remove from the workload all the clusters whose nodes have **more than 16 cores**\n", + "- we remove from the workload the jobs with a **execution time greater than one day**\n", + "- we remove from the workload the jobs with a **number of requested cores greater than 16**\n", + "\n", + "To do so, we use a home made SWF parser. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ec15ee8", + "metadata": {}, + "outputs": [], + "source": [ + "! ./0_prepare_workload/swf_moulinette.py workload/METACENTRUM-2013-3.swf -o workload/MC_selection_article.swf \\\n", + " --keep_only=\"nb_res <= 16 and run_time <= 24*3600\" \\\n", + " --partitions_to_select 1 2 3 5 7 8 9 10 11 12 14 15 18 19 20 21 22 23 25 26 31" + ] + }, + { + "cell_type": "markdown", + "id": "afde35e8", + "metadata": {}, + "source": [ + "## Platform\n", + "According to the system specifications given in the [corresponding page in Parallel Workload Archive](https://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/index.html), if we exclude nodes with >16 cores, there are $\\#cores_{total} = 6416$ cores on May 1st 2014.(1)\n", + "\n", + "We build a platform file adapted to the remaining workload. We choose to make it homogeneous with 16-core nodes. To have a coherent number of nodes, we count:\n", + "\n", + "$\\#nodes = \\frac{\\#cores_{total} * \\%kept_{core.hour}}{\\#corePerNode} = 6416 * .294 / 16 = 118$\n", + "\n", + "In SimGrid platform language, this corresponds to such a cluster:\n", + "```xml\n", + "<cluster id=\"cluster_MC\" prefix=\"MC_\" suffix=\"\" radical=\"0-117\" core=\"16\">\n", + "```\n", + "\n", + "The corresponding SimGrid platform file can be found in `platform/average_metacentrum.xml`.\n", + "\n", + "(1) clusters decomissionned before or comissionned after May 1st 2014 have also been removed: $8+480+160+1792+256+576+88+416+108+168+752+112+588+152+160+160+192+24+224 = 6416$" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/0_prepare_workload/swf.py b/0_prepare_workload/swf.py new file mode 100644 index 0000000..d7384be --- /dev/null +++ b/0_prepare_workload/swf.py @@ -0,0 +1,29 @@ +#!/usr/bin/env python3 + +"""SWF types and functions.""" + +from enum import Enum, unique + + +@unique +class SwfField(Enum): + """Maps SWF columns and their meaning.""" + + JOB_ID = 1 + SUBMIT_TIME = 2 + WAIT_TIME = 3 + RUN_TIME = 4 + ALLOCATED_PROCESSOR_COUNT = 5 + AVERAGE_CPU_TIME_USED = 6 + USED_MEMORY = 7 + REQUESTED_NUMBER_OF_PROCESSORS = 8 + REQUESTED_TIME = 9 + REQUESTED_MEMORY = 10 + STATUS = 11 + USER_ID = 12 + GROUP_ID = 13 + APPLICATION_ID = 14 + QUEUD_ID = 15 + PARTITION_ID = 16 + PRECEDING_JOB_ID = 17 + THINK_TIME_FROM_PRECEDING_JOB = 18 diff --git a/0_prepare_workload/swf_moulinette.py b/0_prepare_workload/swf_moulinette.py new file mode 100755 index 0000000..e24a4e5 --- /dev/null +++ b/0_prepare_workload/swf_moulinette.py @@ -0,0 +1,180 @@ +#!/usr/bin/env python3 + +""" +SWF parser to make selections and obtain some stats. +Inspired from https://gitlab.inria.fr/batsim/batsim/-/blob/master/tools/swf_to_batsim_workload_compute_only.py +""" + +import argparse +import json +import re + +from swf import SwfField + + +def generate_workload(input_swf, output_swf=None, + partitions_to_select=None, + job_walltime_factor=2, + given_walltime_only=False, + job_grain=1, + platform_size=None, + indent=None, + translate_submit_times=False, + keep_only=None, + verbose=False, + quiet=False, + job_size_function_string='1*nb_res'): + """Makes a selection from a SWF trace, optionally outputing it as SWF.""" + element = '([-+]?\d+(?:\.\d+)?)' + r = re.compile('\s*' + (element + '\s+') * 17 + element + '\s*') + + current_id = 0 + + # Some counters... + not_selected = {"nb": 0, "coreh": 0} + selected = {"nb": 0, "coreh": 0} + not_valid = 0 + not_line_match_format = 0 + users = [] + + minimum_observed_submit_time = float('inf') + + # Let's loop over the lines of the input file + i = 1 + for line in input_swf: + i += 1 + if i % 100000 == 0: + print("Processing swf line", i) + + res = r.match(line) + + if res: + # Parsing... + job_id = (int(float(res.group(SwfField.JOB_ID.value)))) + nb_res = int( + float(res.group(SwfField.REQUESTED_NUMBER_OF_PROCESSORS.value))) + run_time = float(res.group(SwfField.RUN_TIME.value)) + submit_time = max(0, float(res.group(SwfField.SUBMIT_TIME.value))) + walltime = max(job_walltime_factor * run_time, + float(res.group(SwfField.REQUESTED_TIME.value))) + user_id = str(res.group(SwfField.USER_ID.value)) + + partition_id = int(res.group(SwfField.PARTITION_ID.value)) + + # nb_res may be changed by calling a user-given function + nb_res = eval(job_size_function_string) + + if given_walltime_only: + walltime = float(res.group(SwfField.REQUESTED_TIME.value)) + + # Select jobs to keep + is_valid_job = (nb_res > 0 and walltime > + run_time and run_time > 0 and submit_time >= 0) + select_partition = ((partitions_to_select is None) or + (partition_id in partitions_to_select)) + use_job = select_partition and ( + (keep_only is None) or eval(keep_only)) + + if not is_valid_job: + not_valid += 1 + if not use_job: + not_selected["nb"] += 1 + not_selected["coreh"] += run_time * nb_res + + else: + # Increment counters + selected["nb"] += 1 + selected["coreh"] += run_time * nb_res + if user_id not in users: + users.append(user_id) + + # Output in the swf + if output_swf is not None: + output_swf.write(line) + + + else: + not_line_match_format += 1 + + print('-------------------\nEnd parsing') + print('Total {} jobs and {} users have been created.'.format( + selected["nb"], len(users))) + print( + 'Total number of core-hours: {:.0f}'.format(selected["coreh"] / 3600)) + print('{} valid jobs were not selected (keep_only) for {:.0f} core-hour'.format( + not_selected["nb"], not_selected["coreh"] / 3600)) + print("Jobs not selected: {:.1f}% in number, {:.1f}% in core-hour" + .format(not_selected["nb"] / (not_selected["nb"]+selected["nb"]) * 100, + not_selected["coreh"] / (selected["coreh"]+not_selected["coreh"]) * 100)) + print('{} out of {} lines in the file did not match the swf format'.format( + not_line_match_format, i)) + print('{} jobs were not valid'.format(not_valid)) + + + +def main(): + """ + Program entry point. + + Parses the input arguments then calls generate_flat_platform. + """ + parser = argparse.ArgumentParser( + description='Reads a SWF (Standard Workload Format) file and outputs some stats') + parser.add_argument('input_swf', type=argparse.FileType('r'), + help='The input SWF file') + parser.add_argument('-o', '--output_swf', + type=argparse.FileType('w'), default=None, + help='The optional output SWF file') + parser.add_argument('-sp', '--partitions_to_select', + type=int, nargs='+', default=None, + help='List of partitions to only consider in the input trace. The jobs running in the other partitions will be discarded.') + + parser.add_argument('-jsf', '--job-size-function', + type=str, + default='1*nb_res', + help='The function to apply on the jobs size. ' + 'The identity is used by default.') + parser.add_argument('-jwf', '--job_walltime_factor', + type=float, default=2, + help='Jobs walltimes are computed by the formula max(givenWalltime, jobWalltimeFactor*givenRuntime)') + parser.add_argument('-gwo', '--given_walltime_only', + action="store_true", + help='If set, only the given walltime in the trace will be used') + parser.add_argument('-jg', '--job_grain', + type=int, default=1, + help='Selects the level of detail we want for jobs. This parameter is used to group jobs that have close running time') + parser.add_argument('-pf', '--platform_size', type=int, default=None, + help='If set, the number of machines to put in the output JSON files is set by this parameter instead of taking the maximum job size') + parser.add_argument('-i', '--indent', type=int, default=None, + help='If set to a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0, or negative, will only insert newlines. The default value (None) selects the most compact representation.') + parser.add_argument('-t', '--translate_submit_times', + action="store_true", + help="If set, the jobs' submit times will be translated towards 0") + parser.add_argument('--keep_only', + type=str, + default=None, + help='If set, this parameter is evaluated to choose which jobs should be kept') + + group = parser.add_mutually_exclusive_group() + group.add_argument("-v", "--verbose", action="store_true") + group.add_argument("-q", "--quiet", action="store_true") + + args = parser.parse_args() + + generate_workload(input_swf=args.input_swf, + output_swf=args.output_swf, + partitions_to_select=args.partitions_to_select, + job_walltime_factor=args.job_walltime_factor, + given_walltime_only=args.given_walltime_only, + job_grain=args.job_grain, + platform_size=args.platform_size, + indent=args.indent, + translate_submit_times=args.translate_submit_times, + keep_only=args.keep_only, + verbose=args.verbose, + quiet=args.quiet, + job_size_function_string=args.job_size_function) + + +if __name__ == "__main__": + main() diff --git a/README.md b/README.md index d601053..65be6a1 100644 --- a/README.md +++ b/README.md @@ -9,12 +9,16 @@ Clone the repository git clone https://gitlab.irit.fr/sepia-pub/open-science/demand-response-user.git ``` -For reproducible experiments, all the dependancies for these experiments and their version (batsim, batshed, python3 with packages, ...) are managed with `nix` package manager. To install `nix`: +For reproductibility, all the dependancies for these experiments and their version are managed with `nix` package manager. To install `nix`: ```bash curl -L https://nixos.org/nix/install | sh ``` +The main software used (and configured in the file `default.nix`) are: +- [batsim](https://batsim.org/) for the infrastructure simulation +- [batmen](https://gitlab.irit.fr/sepia-pub/mael/batmen): our set of schedulers for batsim and plugin to simulate users +- python3, .. . TODO for the data analysis ## Start the simulation environment TODO diff --git a/platform/average_metacentrum.xml b/platform/average_metacentrum.xml new file mode 100644 index 0000000..7e78d83 --- /dev/null +++ b/platform/average_metacentrum.xml @@ -0,0 +1,28 @@ +<?xml version='1.0'?> +<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd"> +<platform version="4.1"> + + <zone id="AS0" routing="Full"> + + <cluster id="cluster_MC" prefix="MC_" suffix="" radical="0-117" + core="16" bw="10Gbps" lat="0.0" + speed="11.77Gf, 1e-9Mf, 0.166666666666667f, 0.006666666666667f"> + <prop id="wattage_per_state" value="100:100:217, 9.75:9.75:9.75, 100:100:100, 125:125:125" /> + <prop id="wattage_off" value="10" /> + <prop id="sleep_pstates" value="1:2:3" /> + </cluster> + + <cluster id="cluster_master" prefix="master_host" suffix="" radical="0-0" + bw="125MBps" lat="0.0" speed="100.0Mf"> + <prop id="wattage_per_state" value="100:100:200" /> + <prop id="wattage_off" value="10" /> + <prop id="role" value="master" /> + </cluster> + + <link id="backbone" bandwidth="10Gbps" latency="0.0" /> + + <zoneRoute src="cluster_MC" dst="cluster_master" gw_src="MC_cluster_MC_router" gw_dst="master_hostcluster_master_router"> + <link_ctn id="backbone" /> + </zoneRoute> + </zone> +</platform> diff --git a/sched_input/.gitkeep b/sched_input/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/workload/.gitkeep b/workload/.gitkeep new file mode 100644 index 0000000..e69de29 -- GitLab