Skip to content
Snippets Groups Projects
Commit 3397fddf authored by Maël Madon's avatar Maël Madon
Browse files

first draft of 0_prepare_workload

parent f6d73bf6
Branches
Tags
No related merge requests found
sched_input/*.json
workload/*.swf
workload/*.json
\ No newline at end of file
%% Cell type:markdown id:forced-resolution tags:
# Downloading and preparing the workload and platform
## Workload
We use the reconverted log `METACENTRUM-2013-3.swf` available on [Parallel Workload Archive](https://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/index.html).
%% Cell type:code id:f66eb756 tags:
``` python
# Downloading and unzipping the workload
!curl -o workload/METACENTRUM-2013-3.swf.gz http://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/METACENTRUM-2013-3.swf.gz
!gunzip workload/METACENTRUM-2013-3.swf.gz
```
%% Cell type:markdown id:graphic-rabbit tags:
As mentionned in the [original paper releasing the log](https://www.cs.huji.ac.il/~feit/parsched/jsspp15/p5-klusacek.pdf), the platform is **very heterogeneous**. For the purpose of our study, we perform the following selection:
- we remove from the workload all the clusters whose nodes have **more than 16 cores**
- we remove from the workload the jobs with a **execution time greater than one day**
- we remove from the workload the jobs with a **number of requested cores greater than 16**
To do so, we use a home made SWF parser.
%% Cell type:code id:6ec15ee8 tags:
``` python
! ./0_prepare_workload/swf_moulinette.py workload/METACENTRUM-2013-3.swf -o workload/MC_selection_article.swf \
--keep_only="nb_res <= 16 and run_time <= 24*3600" \
--partitions_to_select 1 2 3 5 7 8 9 10 11 12 14 15 18 19 20 21 22 23 25 26 31
```
%% Cell type:markdown id:afde35e8 tags:
## Platform
According to the system specifications given in the [corresponding page in Parallel Workload Archive](https://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/index.html), if we exclude nodes with >16 cores, there are $\#cores_{total} = 6416$ cores on May 1st 2014.(1)
We build a platform file adapted to the remaining workload. We choose to make it homogeneous with 16-core nodes. To have a coherent number of nodes, we count:
$\#nodes = \frac{\#cores_{total} * \%kept_{core.hour}}{\#corePerNode} = 6416 * .294 / 16 = 118$
In SimGrid platform language, this corresponds to such a cluster:
```xml
<cluster id="cluster_MC" prefix="MC_" suffix="" radical="0-117" core="16">
```
The corresponding SimGrid platform file can be found in `platform/average_metacentrum.xml`.
(1) clusters decomissionned before or comissionned after May 1st 2014 have also been removed: $8+480+160+1792+256+576+88+416+108+168+752+112+588+152+160+160+192+24+224 = 6416$
#!/usr/bin/env python3
"""SWF types and functions."""
from enum import Enum, unique
@unique
class SwfField(Enum):
"""Maps SWF columns and their meaning."""
JOB_ID = 1
SUBMIT_TIME = 2
WAIT_TIME = 3
RUN_TIME = 4
ALLOCATED_PROCESSOR_COUNT = 5
AVERAGE_CPU_TIME_USED = 6
USED_MEMORY = 7
REQUESTED_NUMBER_OF_PROCESSORS = 8
REQUESTED_TIME = 9
REQUESTED_MEMORY = 10
STATUS = 11
USER_ID = 12
GROUP_ID = 13
APPLICATION_ID = 14
QUEUD_ID = 15
PARTITION_ID = 16
PRECEDING_JOB_ID = 17
THINK_TIME_FROM_PRECEDING_JOB = 18
#!/usr/bin/env python3
"""
SWF parser to make selections and obtain some stats.
Inspired from https://gitlab.inria.fr/batsim/batsim/-/blob/master/tools/swf_to_batsim_workload_compute_only.py
"""
import argparse
import json
import re
from swf import SwfField
def generate_workload(input_swf, output_swf=None,
partitions_to_select=None,
job_walltime_factor=2,
given_walltime_only=False,
job_grain=1,
platform_size=None,
indent=None,
translate_submit_times=False,
keep_only=None,
verbose=False,
quiet=False,
job_size_function_string='1*nb_res'):
"""Makes a selection from a SWF trace, optionally outputing it as SWF."""
element = '([-+]?\d+(?:\.\d+)?)'
r = re.compile('\s*' + (element + '\s+') * 17 + element + '\s*')
current_id = 0
# Some counters...
not_selected = {"nb": 0, "coreh": 0}
selected = {"nb": 0, "coreh": 0}
not_valid = 0
not_line_match_format = 0
users = []
minimum_observed_submit_time = float('inf')
# Let's loop over the lines of the input file
i = 1
for line in input_swf:
i += 1
if i % 100000 == 0:
print("Processing swf line", i)
res = r.match(line)
if res:
# Parsing...
job_id = (int(float(res.group(SwfField.JOB_ID.value))))
nb_res = int(
float(res.group(SwfField.REQUESTED_NUMBER_OF_PROCESSORS.value)))
run_time = float(res.group(SwfField.RUN_TIME.value))
submit_time = max(0, float(res.group(SwfField.SUBMIT_TIME.value)))
walltime = max(job_walltime_factor * run_time,
float(res.group(SwfField.REQUESTED_TIME.value)))
user_id = str(res.group(SwfField.USER_ID.value))
partition_id = int(res.group(SwfField.PARTITION_ID.value))
# nb_res may be changed by calling a user-given function
nb_res = eval(job_size_function_string)
if given_walltime_only:
walltime = float(res.group(SwfField.REQUESTED_TIME.value))
# Select jobs to keep
is_valid_job = (nb_res > 0 and walltime >
run_time and run_time > 0 and submit_time >= 0)
select_partition = ((partitions_to_select is None) or
(partition_id in partitions_to_select))
use_job = select_partition and (
(keep_only is None) or eval(keep_only))
if not is_valid_job:
not_valid += 1
if not use_job:
not_selected["nb"] += 1
not_selected["coreh"] += run_time * nb_res
else:
# Increment counters
selected["nb"] += 1
selected["coreh"] += run_time * nb_res
if user_id not in users:
users.append(user_id)
# Output in the swf
if output_swf is not None:
output_swf.write(line)
else:
not_line_match_format += 1
print('-------------------\nEnd parsing')
print('Total {} jobs and {} users have been created.'.format(
selected["nb"], len(users)))
print(
'Total number of core-hours: {:.0f}'.format(selected["coreh"] / 3600))
print('{} valid jobs were not selected (keep_only) for {:.0f} core-hour'.format(
not_selected["nb"], not_selected["coreh"] / 3600))
print("Jobs not selected: {:.1f}% in number, {:.1f}% in core-hour"
.format(not_selected["nb"] / (not_selected["nb"]+selected["nb"]) * 100,
not_selected["coreh"] / (selected["coreh"]+not_selected["coreh"]) * 100))
print('{} out of {} lines in the file did not match the swf format'.format(
not_line_match_format, i))
print('{} jobs were not valid'.format(not_valid))
def main():
"""
Program entry point.
Parses the input arguments then calls generate_flat_platform.
"""
parser = argparse.ArgumentParser(
description='Reads a SWF (Standard Workload Format) file and outputs some stats')
parser.add_argument('input_swf', type=argparse.FileType('r'),
help='The input SWF file')
parser.add_argument('-o', '--output_swf',
type=argparse.FileType('w'), default=None,
help='The optional output SWF file')
parser.add_argument('-sp', '--partitions_to_select',
type=int, nargs='+', default=None,
help='List of partitions to only consider in the input trace. The jobs running in the other partitions will be discarded.')
parser.add_argument('-jsf', '--job-size-function',
type=str,
default='1*nb_res',
help='The function to apply on the jobs size. '
'The identity is used by default.')
parser.add_argument('-jwf', '--job_walltime_factor',
type=float, default=2,
help='Jobs walltimes are computed by the formula max(givenWalltime, jobWalltimeFactor*givenRuntime)')
parser.add_argument('-gwo', '--given_walltime_only',
action="store_true",
help='If set, only the given walltime in the trace will be used')
parser.add_argument('-jg', '--job_grain',
type=int, default=1,
help='Selects the level of detail we want for jobs. This parameter is used to group jobs that have close running time')
parser.add_argument('-pf', '--platform_size', type=int, default=None,
help='If set, the number of machines to put in the output JSON files is set by this parameter instead of taking the maximum job size')
parser.add_argument('-i', '--indent', type=int, default=None,
help='If set to a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0, or negative, will only insert newlines. The default value (None) selects the most compact representation.')
parser.add_argument('-t', '--translate_submit_times',
action="store_true",
help="If set, the jobs' submit times will be translated towards 0")
parser.add_argument('--keep_only',
type=str,
default=None,
help='If set, this parameter is evaluated to choose which jobs should be kept')
group = parser.add_mutually_exclusive_group()
group.add_argument("-v", "--verbose", action="store_true")
group.add_argument("-q", "--quiet", action="store_true")
args = parser.parse_args()
generate_workload(input_swf=args.input_swf,
output_swf=args.output_swf,
partitions_to_select=args.partitions_to_select,
job_walltime_factor=args.job_walltime_factor,
given_walltime_only=args.given_walltime_only,
job_grain=args.job_grain,
platform_size=args.platform_size,
indent=args.indent,
translate_submit_times=args.translate_submit_times,
keep_only=args.keep_only,
verbose=args.verbose,
quiet=args.quiet,
job_size_function_string=args.job_size_function)
if __name__ == "__main__":
main()
...@@ -9,12 +9,16 @@ Clone the repository ...@@ -9,12 +9,16 @@ Clone the repository
git clone https://gitlab.irit.fr/sepia-pub/open-science/demand-response-user.git git clone https://gitlab.irit.fr/sepia-pub/open-science/demand-response-user.git
``` ```
For reproducible experiments, all the dependancies for these experiments and their version (batsim, batshed, python3 with packages, ...) are managed with `nix` package manager. To install `nix`: For reproductibility, all the dependancies for these experiments and their version are managed with `nix` package manager. To install `nix`:
```bash ```bash
curl -L https://nixos.org/nix/install | sh curl -L https://nixos.org/nix/install | sh
``` ```
The main software used (and configured in the file `default.nix`) are:
- [batsim](https://batsim.org/) for the infrastructure simulation
- [batmen](https://gitlab.irit.fr/sepia-pub/mael/batmen): our set of schedulers for batsim and plugin to simulate users
- python3, .. . TODO for the data analysis
## Start the simulation environment ## Start the simulation environment
TODO TODO
......
<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
<platform version="4.1">
<zone id="AS0" routing="Full">
<cluster id="cluster_MC" prefix="MC_" suffix="" radical="0-117"
core="16" bw="10Gbps" lat="0.0"
speed="11.77Gf, 1e-9Mf, 0.166666666666667f, 0.006666666666667f">
<prop id="wattage_per_state" value="100:100:217, 9.75:9.75:9.75, 100:100:100, 125:125:125" />
<prop id="wattage_off" value="10" />
<prop id="sleep_pstates" value="1:2:3" />
</cluster>
<cluster id="cluster_master" prefix="master_host" suffix="" radical="0-0"
bw="125MBps" lat="0.0" speed="100.0Mf">
<prop id="wattage_per_state" value="100:100:200" />
<prop id="wattage_off" value="10" />
<prop id="role" value="master" />
</cluster>
<link id="backbone" bandwidth="10Gbps" latency="0.0" />
<zoneRoute src="cluster_MC" dst="cluster_master" gw_src="MC_cluster_MC_router" gw_dst="master_hostcluster_master_router">
<link_ctn id="backbone" />
</zoneRoute>
</zone>
</platform>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment