-
Maël Madon authoredMaël Madon authored
title: "Frugal prediction of the load of HPC centers"
author:
firstname: Millian
lastname: Poquet
title: Maître de conférences
mail: millian.poquet@irit.fr
date: 28 septembre 2022
bibliography: biblio.bib
Context {-}
High Performance Computing (HPC) centers are large-scale computational platforms composed of many nodes and cores. Many different users execute their applications on them, notably to conduct scientific simulation studies. Users do not directly access such platforms but use a middleware called a resource manager to reserve computational resources and to execute their applications on them. Examples of such resource managers include OAR, Slurm, PBS, Flux...
Resource managers can take many decisions about the execution of applications (when to execute them, on which resources) but also on the resources themselves (the frequency the processors should run at, when to shutdown or boot resources). Resources managers are therefore a key component when one wants to optimize the whole behavior of an HPC center, as tuning resource management policies can lead to significant gains.
The power consumption of HPC centers is substantial (e.g., Frontier consumes more than 20 MW) and we are interested in reducing this energy footprint. In particular, node shutdown policies are rarely implemented on HPC centers as they can be detrimental for performance if they are not well adapted to the present and future load of the center.
Objective of the internship {-}
The main objective of this internship is to develop a system that predicts the load of an HPC center. Here, the load can be defined as the amount of computation (in core×hour) in the applications that are being executed, and in the applications that are in queue (i.e., that are waiting to be executed). The quality of the developed system will be evaluated on different objectives: 1 precision of the predictions on various time horizons (10/30/60 minutes), 2 time and energy cost of the system, both in learning phase or when asking for predictions at runtime, and 3 understandability of the method, of the trained model and of the results.
We are interested in the tradeoffs enabled by various machine learning methods on this problem. In particular, methods that can associate an uncertainty value with each prediction interest us the most, as they should enable us to develop more robust node shutdown algorithms in the long run. Ideally, the system developed during the internship will be able to estimate the probability of the load to be in a given interval value (e.g., between 10 and 50 core×hour) at a given time (e.g., in 10 minutes).
Traces from the Parallel Workload Archive will be used as data sources for this work. The prediction of the execution time of applications has been studied in the literature ZCLT22 GGRT15 and could be used to design the load prediction system, as the prediction of execution times can be seen as a subproblem.
\pagebreak
Expected skills and profile {-}
- Machine learning skills, especially on time series
- Programming skills in Python or R (C++ is a plus)
- A taste for experimental methods (a taste for chocolate is a plus)
- Fluent French or English
Practical details {-}
The internship will take place at IRIT, the largest computer science research institute in Toulouse, France. Our team SEPIA works on resource management on various distributed systems (cloud datacenters, HPC centers, edge architectures, IoT...) and is especially interested in ecological transition, notably by reducing energy consumption and CO2 emissions, by using renewable energy...
The internship will be supervised by Millian Poquet and Georges Da Costa in a convivial atmosphere :)
.
A computer and an office will be provided, as well as a monthly internship stipend of 591 €.
Internship duration is 5-6 months.
You can send us your application (cover letter + resume / short curriculum vitæ) by email to millian.poquet@irit.fr or georges.da-costa@irit.fr.
Bibliography {-}
- ZCLT22 Salah Zrigui, Raphael Y. de Camargo, Arnaud Legrand and Denis Trystram. Improving the Performance of Batch Schedulers Using Online Job Runtime Classification. Journal of Parallel and Distributed Computing, Elsevier, 2022, 164, pp.83-95.
- GGRT15 Eric Gaussier, David Glesser, Valentin Reis, and Denis Trystram. Improving Backfilling by using Machine Learning to predict Running Times. SuperComputing 2015, Nov 2015, Austin, TX, United States.
ZCLT22: https://hal.archives-ouvertes.fr/hal-03023222 GGRT15: https://hal.archives-ouvertes.fr/hal-01221186