" -D, --domains=LIST comma-separated list of accepted domains\r\n",
" --exclude-domains=LIST comma-separated list of rejected domains\r\n",
" --follow-ftp follow FTP links from HTML documents\r\n",
" --follow-tags=LIST comma-separated list of followed HTML tags\r\n",
" --ignore-tags=LIST comma-separated list of ignored HTML tags\r\n",
" -H, --span-hosts go to foreign hosts when recursive\r\n",
" -L, --relative follow relative links only\r\n",
" -I, --include-directories=LIST list of allowed directories\r\n",
" --trust-server-names use the name specified by the redirection\r\n",
" URL's last component\r\n",
" -X, --exclude-directories=LIST list of excluded directories\r\n",
" -np, --no-parent don't ascend to the parent directory\r\n",
"\r\n",
"Email bug reports, questions, discussions to <bug-wget@gnu.org>\r\n",
"and/or open issues at https://savannah.gnu.org/bugs/?func=additem&group=wget.\r\n"
]
}
],
"source": [
"!wget --help"
]
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"id": "graphic-rabbit",
"id": "graphic-rabbit",
...
@@ -687,7 +229,7 @@
...
@@ -687,7 +229,7 @@
],
],
"metadata": {
"metadata": {
"kernelspec": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "Python 3",
"language": "python",
"language": "python",
"name": "python3"
"name": "python3"
},
},
...
@@ -701,7 +243,7 @@
...
@@ -701,7 +243,7 @@
"name": "python",
"name": "python",
"nbconvert_exporter": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"pygments_lexer": "ipython3",
"version": "3.9.6"
"version": "3.10.4"
}
}
},
},
"nbformat": 4,
"nbformat": 4,
...
...
%% Cell type:markdown id:forced-resolution tags:
%% Cell type:markdown id:forced-resolution tags:
# Downloading and preparing the workload and platform
# Downloading and preparing the workload and platform
## Workload
## Workload
We use the reconverted log `METACENTRUM-2013-3.swf` available on [Parallel Workload Archive](https://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/index.html).
We use the reconverted log `METACENTRUM-2013-3.swf` available on [Parallel Workload Archive](https://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/index.html).
--retr-symlinks when recursing, get linked-to files (not dir)
FTPS options:
--ftps-implicit use implicit FTPS (default port is 990)
--ftps-resume-ssl resume the SSL/TLS session started in the control connection when
opening a data connection
--ftps-clear-data-connection cipher the control channel only; all the data will be in plaintext
--ftps-fallback-to-ftp fall back to FTP if FTPS is not supported in the target server
WARC options:
--warc-file=FILENAME save request/response data to a .warc.gz file
--warc-header=STRING insert STRING into the warcinfo record
--warc-max-size=NUMBER set maximum size of WARC files to NUMBER
--warc-cdx write CDX index files
--warc-dedup=FILENAME do not store records listed in this CDX file
--no-warc-compression do not compress WARC files with GZIP
--no-warc-digests do not calculate SHA1 digests
--no-warc-keep-log do not store the log file in a WARC record
--warc-tempdir=DIRECTORY location for temporary files created by the
WARC writer
Recursive download:
-r, --recursive specify recursive download
-l, --level=NUMBER maximum recursion depth (inf or 0 for infinite)
--delete-after delete files locally after downloading them
-k, --convert-links make links in downloaded HTML or CSS point to
local files
--convert-file-only convert the file part of the URLs only (usually known as the basename)
--backups=N before writing file X, rotate up to N backup files
-K, --backup-converted before converting file X, back up as X.orig
-m, --mirror shortcut for -N -r -l inf --no-remove-listing
-p, --page-requisites get all images, etc. needed to display HTML page
--strict-comments turn on strict (SGML) handling of HTML comments
Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions
-R, --reject=LIST comma-separated list of rejected extensions
--accept-regex=REGEX regex matching accepted URLs
--reject-regex=REGEX regex matching rejected URLs
--regex-type=TYPE regex type (posix|pcre)
-D, --domains=LIST comma-separated list of accepted domains
--exclude-domains=LIST comma-separated list of rejected domains
--follow-ftp follow FTP links from HTML documents
--follow-tags=LIST comma-separated list of followed HTML tags
--ignore-tags=LIST comma-separated list of ignored HTML tags
-H, --span-hosts go to foreign hosts when recursive
-L, --relative follow relative links only
-I, --include-directories=LIST list of allowed directories
--trust-server-names use the name specified by the redirection
URL's last component
-X, --exclude-directories=LIST list of excluded directories
-np, --no-parent don't ascend to the parent directory
Email bug reports, questions, discussions to <bug-wget@gnu.org>
and/or open issues at https://savannah.gnu.org/bugs/?func=additem&group=wget.
%% Cell type:markdown id:graphic-rabbit tags:
%% Cell type:markdown id:graphic-rabbit tags:
It is a 2-year-long trace from MetaCentrum, the national grid of the Czech republic. As mentionned in the [original paper releasing the log](https://www.cs.huji.ac.il/~feit/parsched/jsspp15/p5-klusacek.pdf), the platform is **very heterogeneous** and underwent majors changes during the logging period. For the purpose of our study, we perform the following selection.
It is a 2-year-long trace from MetaCentrum, the national grid of the Czech republic. As mentionned in the [original paper releasing the log](https://www.cs.huji.ac.il/~feit/parsched/jsspp15/p5-klusacek.pdf), the platform is **very heterogeneous** and underwent majors changes during the logging period. For the purpose of our study, we perform the following selection.
First:
First:
- we remove from the workload all the clusters whose nodes have **more than 16 cores**
- we remove from the workload all the clusters whose nodes have **more than 16 cores**
- we truncate the workload to keep only 6 month (June to November 2014) where no major change was performed in the infrastructure (no cluster < 16 cores added nor removed, no reconfiguration in the scheduling system)
- we truncate the workload to keep only 6 month (June to November 2014) where no major change was performed in the infrastructure (no cluster < 16 cores added nor removed, no reconfiguration in the scheduling system)
Second:
Second:
- we remove from the workload the jobs with an **execution time greater than one day**
- we remove from the workload the jobs with an **execution time greater than one day**
- we remove from the workload the jobs with a **number of requested cores greater than 16**
- we remove from the workload the jobs with a **number of requested cores greater than 16**
To do so, we use a the home-made SWF parser `swf_moulinette.py`:
To do so, we use a the home-made SWF parser `swf_moulinette.py`:
%% Cell type:code id:ff40dcdd tags:
%% Cell type:code id:ff40dcdd tags:
``` python
``` python
# First selection
# First selection
# Create a swf with only the selected clusters and the 6 selected months
# Create a swf with only the selected clusters and the 6 selected months
fromtimeimport*
fromtimeimport*
begin_trace=1356994806# according to original SWF header
begin_trace=1356994806# according to original SWF header
jun1_unix_time,nov30_unix_time=mktime(strptime('Sun Jun 1 00:00:00 2014')),mktime(strptime('Sun Nov 30 23:59:59 2014'))
jun1_unix_time,nov30_unix_time=mktime(strptime('Sun Jun 1 00:00:00 2014')),mktime(strptime('Sun Nov 30 23:59:59 2014'))
--keep_only="nb_res <= 16 and run_time <= 24*3600"
--keep_only="nb_res <= 16 and run_time <= 24*3600"
```
```
%% Output
%% Output
Processing swf line 100000
Processing swf line 100000
Processing swf line 200000
Processing swf line 200000
Processing swf line 300000
Processing swf line 300000
Processing swf line 400000
Processing swf line 400000
Processing swf line 500000
Processing swf line 500000
Processing swf line 600000
Processing swf line 600000
Processing swf line 700000
Processing swf line 700000
Processing swf line 800000
Processing swf line 800000
Processing swf line 900000
Processing swf line 900000
Processing swf line 1000000
Processing swf line 1000000
Processing swf line 1100000
Processing swf line 1100000
Processing swf line 1200000
Processing swf line 1200000
Processing swf line 1300000
Processing swf line 1300000
Processing swf line 1400000
Processing swf line 1400000
Processing swf line 1500000
Processing swf line 1500000
Processing swf line 1600000
Processing swf line 1600000
-------------------
-------------------
End parsing
End parsing
Total 1604201 jobs and 546 users have been created.
Total 1604201 jobs and 546 users have been created.
Total number of core-hours: 4785357
Total number of core-hours: 4785357
44828 valid jobs were not selected (keep_only) for 13437365 core-hour
44828 valid jobs were not selected (keep_only) for 13437365 core-hour
Jobs not selected: 2.7% in number, 73.7% in core-hour
Jobs not selected: 2.7% in number, 73.7% in core-hour
0 out of 1649030 lines in the file did not match the swf format
0 out of 1649030 lines in the file did not match the swf format
1 jobs were not valid
1 jobs were not valid
%% Cell type:markdown id:afde35e8 tags:
%% Cell type:markdown id:afde35e8 tags:
## Platform
## Platform
According to the system specifications given in the [corresponding page in Parallel Workload Archive](https://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/index.html): from June 1st 2014 to Nov 30th 2014 there is no change in the platform for the clusters considered in our study (<16 cores). There is a total of **6304 cores**.(1)
According to the system specifications given in the [corresponding page in Parallel Workload Archive](https://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/index.html): from June 1st 2014 to Nov 30th 2014 there is no change in the platform for the clusters considered in our study (<16 cores). There is a total of **6304 cores**.(1)
We build a platform file adapted to the remaining workload. We see above that the second selection cuts 73.7\% of core-hours from the original workload. We choose to make an homogeneous cluster with 16-core nodes. To have a coherent number of nodes, we count:
We build a platform file adapted to the remaining workload. We see above that the second selection cuts 73.7\% of core-hours from the original workload. We choose to make an homogeneous cluster with 16-core nodes. To have a coherent number of nodes, we count:
The corresponding SimGrid platform file can be found in `platform/average_metacentrum.xml`.
The corresponding SimGrid platform file can be found in `platform/average_metacentrum.xml`.
(1) clusters decomissionned before or comissionned after the 6-month period have been removed: $8+480+160+1792+256+576+88+416+108+168+752+112+588+48+152+160+192+24+224 = 6304$
(1) clusters decomissionned before or comissionned after the 6-month period have been removed: $8+480+160+1792+256+576+88+416+108+168+752+112+588+48+152+160+192+24+224 = 6304$