From 2ac9c5a88b650f251ffbceba7d415a16ab5f32ad Mon Sep 17 00:00:00 2001
From: Millian Poquet <millian.poquet@irit.fr>
Date: Wed, 15 May 2024 17:16:49 +0200
Subject: [PATCH] prediction: jupyter->Rmd, update guide

---
 artifact-overview.typ                         |  49 ++-
 flake.nix                                     |  10 +-
 .../m100_process_prediction_results.ipynb     | 387 ------------------
 notebooks/prediction-results-analysis.Rmd     | 265 ++++++++++++
 4 files changed, 311 insertions(+), 400 deletions(-)
 delete mode 100644 notebooks/m100_process_prediction_results.ipynb
 create mode 100644 notebooks/prediction-results-analysis.Rmd

diff --git a/artifact-overview.typ b/artifact-overview.typ
index 388c1de..f84abd2 100644
--- a/artifact-overview.typ
+++ b/artifact-overview.typ
@@ -281,12 +281,11 @@ The step-by-step instructions of this document can be used in several ways *depe
 + You can *check* the final analyses (code + plots) done in Article @lightpredenergy by reading the provided pre-rendered notebooks available on #link(zenodo-url)[Zenodo].
 + You can *reproduce* the *final analyses* by first downloading the provided aggregated results of the experiments from #link(zenodo-url)[Zenodo], and then by running the notebooks yourself.
   This enables you to *edit* our notebooks before running them, so that you can to modify the analyses done or add your own.
-  // - Refer to #todo[link to Danilo's notebook section] for the machine learning experiment.
+  - Refer to @sec-analyze-prediction-results for instructions to analyze the results of the machine learning experiment.
   - Refer to @sec-analyze-simu-campaign-results for instructions to analyze the results of the scheduling experiment.
 + You can *reproduce* our *experimental campaigns* by downloading the provided experiment input files from #link(zenodo-url)[Zenodo],
   and then by running the experiment yourself.
   This can enable you to make sure that our experiment can be reproduced with the *exact same parameters and configuration*.
-  //- Refer to #todo[link to Danilo's expe section?] for the machine learning experiment.
   - Refer to @sec-run-simu-campaign for instructions to reproduce the scheduling experiment.
 + You can *fully reproduce* our *experimental campaigns* by downloading original traces of the Marconi100,
   by generating the experimental campaigns parameters yourself (enabling you to hack provided command-line parameters or provided code),
@@ -453,7 +452,7 @@ nix develop .#py-scripts --command m100-pred-merge-jobfiles -d ./m100-data/
 ]
 
 === Compressing prediction output into single files
-The expected output data of has been stored on #link(zenodo-url)[Zenodo].
+The expected output data has been stored on #link(zenodo-url)[Zenodo].
 
 #fullbox(footer:[Disk: 82 Mo.])[
   ```sh
@@ -469,19 +468,45 @@ The expected output data of has been stored on #link(zenodo-url)[Zenodo].
   ))
 ]
 
-== Analyzing prediction results
+== Analyzing prediction results <sec-analyze-prediction-results>
+This analysis requires that the two job power prediction archives (outputs of @sec-job-power-pred, available on #link(zenodo-url)[Zenodo]) are available on your disk in the `./user-power-predictions` directory.
+The following command populates the `./user-power-predictions/data` by extracting the archives and uncompressing all the required files on your disk.
 
-=== Required data
-Output from the previous section 
+#fullbox(footer: [Disk: 519 Mo. Time: 00:00:05.])[
+  ```sh
+  mkdir ./user-power-predictions/data
+  nix develop .#merge-m100-power-predictions --command \
+    tar xf ./user-power-predictions/*mean.tar.gz --directory ./user-power-predictions/data
+  nix develop .#merge-m100-power-predictions --command \
+    tar xf ./user-power-predictions/*max.tar.gz --directory ./user-power-predictions/data
+  nix develop .#merge-m100-power-predictions --command \
+    gunzip ./user-power-predictions/data/*/*.gz
+  ```
+]
 
-- `m100-data/power_pred_users_allmethods_mean.tar.gz`, the jobs mean power predictions.
-- `m100-data/power_pred_users_allmethods_max.tar.gz`, the jobs maximum power predictions.
+The analysis of the predictions, which also generates Figures 2 and 3 of Article @lightpredenergy, can be reproduced with the following command.
 
-=== Reproducing the paper's plots
+#fullbox(footer:[Time (laptop): 00:00:20.])[
+  ```sh
+  nix develop .#r-py-notebook --command \
+              Rscript notebooks/run-rmarkdown-notebook.R \
+                      notebooks/prediction-results-analysis.Rmd
+  ```
 
-Please refer to this #link("./notebooks/m100_process_prediction_results.ipynb")[Notebook] for
-the scripts to reproduce the paper's plots, notably Figures 2 and 3.
+  #filehashes((
+    "6ce534dd1bf017444f05c354a1b3b767", "notebooks/fig2a-distrib-job-power-mean.svg",
+    "0b1cdfcf017c2cba7057a544e19cd698", "notebooks/fig2b-distrib-job-power-max.svg",
+    "0bc88e65ae495a8d6ec7d3cbcfca12ae", "notebooks/fig3a-pred-mape-mean-power.svg",
+    "a19b1a7c5dc72ec73a5349d85fc68fa3", "notebooks/fig3b-pred-mape-max-power.svg",
+    "04c2d5ef412b791a4d5515ec0719b3d0", "notebooks/prediction-results-analysis.html",
+  ), fill: (x, y) => {
+    if y > 0 { red.lighten(80%) }
+  },
+  )
 
+  We could not make HTML notebooks and Python-generated images binary reproducible despite our best efforts.
+  Their content should be completely reproducible though.
+]
 
 == Job scheduling with power prediction <sec-sched>
 This section shows how to reproduce Sections 6.4 and 6.5 of article @lightpredenergy.
@@ -601,7 +626,7 @@ Required input files.
 - The `/tmp/wlds` directory (#emph-overhead[1.4 Go]) that contains all the workload files (output of @sec-gen-workloads).
 
 Please note that all input files can be downloaded from #link(zenodo-url)[Zenodo] if you have not generated them yourself.
-In particular to populate the `/tmp/wlds`directory you can *download file* `workloads.tar.xz` and then *extract it* into `/tmp/` via a command such as the following. `tar xf workloads.tar.xz --directory=/tmp/`
+In particular to populate the `/tmp/wlds` directory you can *download file* `workloads.tar.xz` and then *extract it* into `/tmp/` via a command such as the following. `tar xf workloads.tar.xz --directory=/tmp/`
 
 #fullbox(footer: [#emph-overhead[Disk: 7.6 Go.] Time: 00:06:00.])[
   ```sh
diff --git a/flake.nix b/flake.nix
index bec15ed..80340f5 100644
--- a/flake.nix
+++ b/flake.nix
@@ -85,7 +85,7 @@
           });
           easypower-sched-lib = easy-powercap-pkgs.easypower;
         };
-        devShells = {
+        devShells = rec {
           download-m100-months = pkgs.mkShell {
             buildInputs = [
               packages.python-scripts
@@ -127,6 +127,14 @@
               pkgs.rPackages.viridis
             ];
           };
+          r-py-notebook = pkgs.mkShell {
+            buildInputs = r-notebook.buildInputs ++ [
+              pkgs.rPackages.reticulate
+              pyPkgs.pandas
+              pyPkgs.seaborn
+              pyPkgs.scikit-learn
+            ];
+          };
           typst-shell = pkgs.mkShell {
             buildInputs = [
               typstpkgs.typst-dev
diff --git a/notebooks/m100_process_prediction_results.ipynb b/notebooks/m100_process_prediction_results.ipynb
deleted file mode 100644
index 6540dc6..0000000
--- a/notebooks/m100_process_prediction_results.ipynb
+++ /dev/null
@@ -1,387 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Processing the mean power prediction results (script `run_prediction_per_user_allmethods_mean.py`)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import pandas as pd\n",
-    "import seaborn as sns\n",
-    "\n",
-    "import os\n",
-    "\n",
-    "RESULTS_PATH = \"../m100-data/total_power_mean_predictions_users_allmethods_mean/\"\n",
-    "PRED_COLS = [\"hist_pred_total_power_mean\",\n",
-    "            \"LinearRegression_total_power_mean_watts\",\n",
-    "            \"RandomForestRegressor_total_power_mean_watts\", \n",
-    "            \"LinearSVR_total_power_mean_watts\", \n",
-    "            \"SGDRegressor_total_power_mean_watts\"]\n",
-    "\n",
-    "\n",
-    "result_filenames = os.listdir(RESULTS_PATH)\n",
-    "\n",
-    "df_all_results = pd.concat([pd.read_csv(RESULTS_PATH+filename, low_memory=False) for filename in result_filenames])\n",
-    "\n",
-    "df_all_results = df_all_results.dropna(subset=PRED_COLS)\n",
-    "df_all_results\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error\n",
-    "\n",
-    "lst_users = df_all_results[\"user_id\"].drop_duplicates().to_list()\n",
-    "#print(lst_users)\n",
-    "\n",
-    "df_results_user_group = df_all_results.groupby(\"user_id\")\n",
-    "\n",
-    "lst_stats_per_user = []\n",
-    "\n",
-    "for user in lst_users:\n",
-    "    results_user = df_results_user_group.get_group(user)\n",
-    "    hist_mape = mean_absolute_percentage_error(results_user[\"total_power_mean_watts\"], results_user[\"hist_pred_total_power_mean\"])\n",
-    "    LR_mape = mean_absolute_percentage_error(results_user[\"total_power_mean_watts\"], results_user[\"LinearRegression_total_power_mean_watts\"])\n",
-    "    RF_mape = mean_absolute_percentage_error(results_user[\"total_power_mean_watts\"], results_user[\"RandomForestRegressor_total_power_mean_watts\"])\n",
-    "    LSVR_mape = mean_absolute_percentage_error(results_user[\"total_power_mean_watts\"], results_user[\"LinearSVR_total_power_mean_watts\"])\n",
-    "    SGD_mape = mean_absolute_percentage_error(results_user[\"total_power_mean_watts\"], results_user[\"SGDRegressor_total_power_mean_watts\"])\n",
-    "    res = {\"user_id\": user, \n",
-    "           \"hist_mape\": hist_mape, \n",
-    "           \"LinearRegression_mape\": LR_mape, \n",
-    "           \"RandomForestRegressor_mape\": RF_mape, \n",
-    "           \"LinearSVR_mape\": LSVR_mape,\n",
-    "           \"SGDRegressor_mape\": SGD_mape}\n",
-    "    lst_stats_per_user.append(res)\n",
-    "    #break\n",
-    "\n",
-    "df_stats_per_user = pd.DataFrame(lst_stats_per_user)\n",
-    "df_stats_per_user\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "COLS = [\"hist_mape\",\"LinearRegression_mape\",\"RandomForestRegressor_mape\",\"LinearSVR_mape\",\"SGDRegressor_mape\"]\n",
-    "\n",
-    "df_stats_per_user[COLS].describe()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "COLS = [\"hist_mape\",\"LinearRegression_mape\",\"RandomForestRegressor_mape\",\"LinearSVR_mape\",\"SGDRegressor_mape\"]\n",
-    "\n",
-    "df_stats_per_user_pivot = pd.melt(df_stats_per_user, id_vars=\"user_id\")\n",
-    "df_stats_per_user_pivot"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Figure 3 A"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "import matplotlib.pyplot as plt\n",
-    "\n",
-    "TINY_SIZE = 2\n",
-    "SMALL_SIZE = 5\n",
-    "MEDIUM_SIZE = 20\n",
-    "BIGGER_SIZE = 50\n",
-    "FIG_WIDTH = 40\n",
-    "FIG_HEIGHT = 10\n",
-    "\n",
-    "\n",
-    "#plt.rc('font', size=16)          # controls default text sizes\n",
-    "plt.rc('font', size=20)          # controls default text sizes\n",
-    "plt.rc('axes', titlesize=MEDIUM_SIZE)     # fontsize of the axes title\n",
-    "plt.rc('axes', labelsize=MEDIUM_SIZE)     # fontsize of the x and y labels\n",
-    "plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels\n",
-    "plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels\n",
-    "plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize\n",
-    "plt.rc('figure', titlesize=MEDIUM_SIZE)  # fontsize of the figure title\n",
-    "\n",
-    "#g = sns.boxplot(x=\"variable\", y=\"value\", data=df_stats_per_user_pivot, showfliers=False)\n",
-    "#plt.xticks(ticks=[0,1,2,3,4],labels=[\"History\", \"LinearRegression\", \"RandomForest\", \"LinearSVR\", \"SGDRegressor\"],rotation=30)\n",
-    "g = sns.boxplot(y=\"variable\", x=\"value\", data=df_stats_per_user_pivot, showfliers=False)\n",
-    "plt.yticks(ticks=[0,1,2,3,4],labels=[\"History\", \"LinearRegression\", \"RandomForest\", \"LinearSVR\", \"SGDRegressor\"],rotation=0)\n",
-    "\n",
-    "g.set_ylabel(\"Prediction Method\")\n",
-    "g.set_xlabel(\"Mean Absolute Percentage Error (MAPE)     \")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Processing the max power prediction results (script `run_prediction_per_user_allmethods_max.py`)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import pandas as pd\n",
-    "import seaborn as sns\n",
-    "\n",
-    "import os\n",
-    "\n",
-    "RESULTS_PATH = \"./m100-data/total_power_mean_predictions_users_allmethods_max/\"\n",
-    "\n",
-    "PRED_COLS = [\"hist_pred_total_power_max\",\n",
-    "            \"LinearRegression_total_power_max_watts\",\n",
-    "            \"RandomForestRegressor_total_power_max_watts\", \n",
-    "            \"LinearSVR_total_power_max_watts\", \n",
-    "            \"SGDRegressor_total_power_max_watts\"]\n",
-    "\n",
-    "\n",
-    "result_filenames = os.listdir(RESULTS_PATH)\n",
-    "\n",
-    "df_all_results = pd.concat([pd.read_csv(RESULTS_PATH+filename, low_memory=False) for filename in result_filenames])\n",
-    "\n",
-    "df_all_results = df_all_results.dropna(subset=PRED_COLS)\n",
-    "df_all_results"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error\n",
-    "\n",
-    "lst_users = df_all_results[\"user_id\"].drop_duplicates().to_list()\n",
-    "#print(lst_users)\n",
-    "\n",
-    "df_results_user_group = df_all_results.groupby(\"user_id\")\n",
-    "\n",
-    "lst_stats_per_user = []\n",
-    "\n",
-    "for user in lst_users:\n",
-    "    results_user = df_results_user_group.get_group(user)\n",
-    "    hist_mape = mean_absolute_percentage_error(results_user[\"total_power_max_watts\"], results_user[\"hist_pred_total_power_max\"])\n",
-    "    LR_mape = mean_absolute_percentage_error(results_user[\"total_power_max_watts\"], results_user[\"LinearRegression_total_power_max_watts\"])\n",
-    "    RF_mape = mean_absolute_percentage_error(results_user[\"total_power_max_watts\"], results_user[\"RandomForestRegressor_total_power_max_watts\"])\n",
-    "    LSVR_mape = mean_absolute_percentage_error(results_user[\"total_power_max_watts\"], results_user[\"LinearSVR_total_power_max_watts\"])\n",
-    "    SGD_mape = mean_absolute_percentage_error(results_user[\"total_power_max_watts\"], results_user[\"SGDRegressor_total_power_max_watts\"])\n",
-    "    res = {\"user_id\": user, \n",
-    "           \"hist_mape\": hist_mape, \n",
-    "           \"LinearRegression_mape\": LR_mape, \n",
-    "           \"RandomForestRegressor_mape\": RF_mape, \n",
-    "           \"LinearSVR_mape\": LSVR_mape,\n",
-    "           \"SGDRegressor_mape\": SGD_mape}\n",
-    "    lst_stats_per_user.append(res)\n",
-    "    #break\n",
-    "\n",
-    "df_stats_per_user = pd.DataFrame(lst_stats_per_user)\n",
-    "df_stats_per_user"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "COLS = [\"hist_mape\",\"LinearRegression_mape\",\"RandomForestRegressor_mape\",\"LinearSVR_mape\",\"SGDRegressor_mape\"]\n",
-    "\n",
-    "df_stats_per_user[COLS].describe()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "COLS = [\"hist_mape\",\"LinearRegression_mape\",\"RandomForestRegressor_mape\",\"LinearSVR_mape\",\"SGDRegressor_mape\"]\n",
-    "\n",
-    "df_stats_per_user_pivot = pd.melt(df_stats_per_user, id_vars=\"user_id\")\n",
-    "df_stats_per_user_pivot"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Figure 3 B"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "import matplotlib.pyplot as plt\n",
-    "\n",
-    "TINY_SIZE = 2\n",
-    "SMALL_SIZE = 5\n",
-    "MEDIUM_SIZE = 20\n",
-    "BIGGER_SIZE = 50\n",
-    "FIG_WIDTH = 40\n",
-    "FIG_HEIGHT = 10\n",
-    "\n",
-    "\n",
-    "plt.rc('font', size=20)          # controls default text sizes\n",
-    "plt.rc('axes', titlesize=MEDIUM_SIZE)     # fontsize of the axes title\n",
-    "plt.rc('axes', labelsize=MEDIUM_SIZE)     # fontsize of the x and y labels\n",
-    "plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels\n",
-    "plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels\n",
-    "plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize\n",
-    "plt.rc('figure', titlesize=MEDIUM_SIZE)  # fontsize of the figure title\n",
-    "\n",
-    "#g = sns.boxplot(x=\"variable\", y=\"value\", data=df_stats_per_user_pivot, showfliers=False)\n",
-    "#plt.xticks(ticks=[0,1,2,3,4],labels=[\"History\", \"LinearRegression\", \"RandomForest\", \"LinearSVR\", \"SGDRegressor\"],rotation=30)\n",
-    "#g.set_xlabel(\"Prediction Method\")\n",
-    "#g.set_ylabel(\"Mean Absolute Percentage Error (MAPE)            \")\n",
-    "\n",
-    "g = sns.boxplot(y=\"variable\", x=\"value\", data=df_stats_per_user_pivot, showfliers=False)\n",
-    "plt.yticks(ticks=[0,1,2,3,4],labels=[\"History\", \"LinearRegression\", \"RandomForest\", \"LinearSVR\", \"SGDRegressor\"],rotation=0)\n",
-    "g.set_ylabel(\"Prediction Method\")\n",
-    "g.set_xlabel(\"Mean Absolute Percentage Error (MAPE)\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Getting the actual mean and max power distributions"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Mean (Figure 2 A)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import matplotlib.pyplot as plt\n",
-    "import seaborn as sns\n",
-    "\n",
-    "TINY_SIZE = 2\n",
-    "SMALL_SIZE = 5\n",
-    "MEDIUM_SIZE = 20\n",
-    "BIGGER_SIZE = 50\n",
-    "FIG_WIDTH = 40\n",
-    "FIG_HEIGHT = 10\n",
-    "\n",
-    "plt.clf()\n",
-    "\n",
-    "plt.rc('figure', figsize=(8, 6))\n",
-    "plt.rc('font', size=MEDIUM_SIZE)          # controls default text sizes\n",
-    "plt.rc('axes', titlesize=MEDIUM_SIZE)     # fontsize of the axes title\n",
-    "plt.rc('axes', labelsize=MEDIUM_SIZE)     # fontsize of the x and y labels\n",
-    "plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels\n",
-    "plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels\n",
-    "plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize\n",
-    "plt.rc('figure', titlesize=MEDIUM_SIZE)  # fontsize of the figure title\n",
-    "\n",
-    "g = sns.histplot(x=\"total_power_mean_watts\", data=df_all_results, bins=25, fill=False)\n",
-    "#g.ax.set_yscale('log')\n",
-    "g.set_xlabel(\"Total Power (watts)\")\n",
-    "g.set_ylabel(\"Number of Jobs\")\n",
-    "plt.xticks(ticks=[0,250,500,750,1000,1250,1500], rotation=30)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Max (Figure 2 B)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import matplotlib.pyplot as plt\n",
-    "import seaborn as sns\n",
-    "\n",
-    "TINY_SIZE = 2\n",
-    "SMALL_SIZE = 5\n",
-    "MEDIUM_SIZE = 20\n",
-    "BIGGER_SIZE = 50\n",
-    "FIG_WIDTH = 40\n",
-    "FIG_HEIGHT = 10\n",
-    "\n",
-    "plt.clf()\n",
-    "\n",
-    "plt.rc('figure', figsize=(8, 6))\n",
-    "plt.rc('font', size=MEDIUM_SIZE)          # controls default text sizes\n",
-    "plt.rc('axes', titlesize=MEDIUM_SIZE)     # fontsize of the axes title\n",
-    "plt.rc('axes', labelsize=MEDIUM_SIZE)     # fontsize of the x and y labels\n",
-    "plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels\n",
-    "plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels\n",
-    "plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize\n",
-    "plt.rc('figure', titlesize=MEDIUM_SIZE)  # fontsize of the figure title\n",
-    "\n",
-    "#g = sns.displot(x=\"total_power_max_watts\", data=df_all_results)\n",
-    "g = sns.histplot(x=\"total_power_max_watts\", data=df_all_results, bins=25, fill=False)\n",
-    "\n",
-    "#g.ax.set_yscale('log')\n",
-    "g.set_xlabel(\"Total Power (watts)\")\n",
-    "g.set_ylabel(\"Number of Jobs\")\n",
-    "plt.xticks(ticks=[0,250,500,750,1000,1250,1500,1750,2000], rotation=30)\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.9"
-  },
-  "orig_nbformat": 4
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/notebooks/prediction-results-analysis.Rmd b/notebooks/prediction-results-analysis.Rmd
new file mode 100644
index 0000000..e20a77b
--- /dev/null
+++ b/notebooks/prediction-results-analysis.Rmd
@@ -0,0 +1,265 @@
+---
+title: "Job power prediction result analysis"
+author: "Danilo Carastan-Santos"
+date: "2024-05-15"
+output:
+  rmdformats::readthedown
+---
+
+## Processing the mean power prediction results
+Outputs of script `run_prediction_per_user_allmethods_mean.py`.
+
+```{python}
+import pandas as pd
+import seaborn as sns
+
+import os
+
+RESULTS_PATH = "../user-power-predictions/data/total_power_mean_predictions_users_allmethods_mean/"
+PRED_COLS = ["hist_pred_total_power_mean",
+            "LinearRegression_total_power_mean_watts",
+            "RandomForestRegressor_total_power_mean_watts", 
+            "LinearSVR_total_power_mean_watts", 
+            "SGDRegressor_total_power_mean_watts"]
+
+
+result_filenames = os.listdir(RESULTS_PATH)
+
+df_all_results = pd.concat([pd.read_csv(RESULTS_PATH+filename, low_memory=False) for filename in result_filenames])
+
+df_all_results = df_all_results.dropna(subset=PRED_COLS)
+df_all_results
+
+
+from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
+
+lst_users = df_all_results["user_id"].drop_duplicates().to_list()
+#print(lst_users)
+
+df_results_user_group = df_all_results.groupby("user_id")
+
+lst_stats_per_user = []
+
+for user in lst_users:
+    results_user = df_results_user_group.get_group(user)
+    hist_mape = mean_absolute_percentage_error(results_user["total_power_mean_watts"], results_user["hist_pred_total_power_mean"])
+    LR_mape = mean_absolute_percentage_error(results_user["total_power_mean_watts"], results_user["LinearRegression_total_power_mean_watts"])
+    RF_mape = mean_absolute_percentage_error(results_user["total_power_mean_watts"], results_user["RandomForestRegressor_total_power_mean_watts"])
+    LSVR_mape = mean_absolute_percentage_error(results_user["total_power_mean_watts"], results_user["LinearSVR_total_power_mean_watts"])
+    SGD_mape = mean_absolute_percentage_error(results_user["total_power_mean_watts"], results_user["SGDRegressor_total_power_mean_watts"])
+    res = {"user_id": user, 
+           "hist_mape": hist_mape, 
+           "LinearRegression_mape": LR_mape, 
+           "RandomForestRegressor_mape": RF_mape, 
+           "LinearSVR_mape": LSVR_mape,
+           "SGDRegressor_mape": SGD_mape}
+    lst_stats_per_user.append(res)
+    #break
+
+df_stats_per_user = pd.DataFrame(lst_stats_per_user)
+df_stats_per_user
+
+COLS = ["hist_mape","LinearRegression_mape","RandomForestRegressor_mape","LinearSVR_mape","SGDRegressor_mape"]
+df_stats_per_user[COLS].describe()
+
+COLS = ["hist_mape","LinearRegression_mape","RandomForestRegressor_mape","LinearSVR_mape","SGDRegressor_mape"]
+df_stats_per_user_pivot = pd.melt(df_stats_per_user, id_vars="user_id")
+df_stats_per_user_pivot
+```
+
+### Figure 3 (a)
+```{python}
+import matplotlib.pyplot as plt
+
+TINY_SIZE = 2
+SMALL_SIZE = 5
+MEDIUM_SIZE = 20
+BIGGER_SIZE = 50
+FIG_WIDTH = 40
+FIG_HEIGHT = 10
+
+#plt.rc('font', size=16)          # controls default text sizes
+plt.rc('font', size=20)          # controls default text sizes
+plt.rc('axes', titlesize=MEDIUM_SIZE)     # fontsize of the axes title
+plt.rc('axes', labelsize=MEDIUM_SIZE)     # fontsize of the x and y labels
+plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
+plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
+plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize
+plt.rc('figure', titlesize=MEDIUM_SIZE)  # fontsize of the figure title
+plt.rc('figure', figsize=(8,4))
+
+#g = sns.boxplot(x="variable", y="value", data=df_stats_per_user_pivot, showfliers=False)
+#plt.xticks(ticks=[0,1,2,3,4],labels=["History", "LinearRegression", "RandomForest", "LinearSVR", "SGDRegressor"],rotation=30)
+g = sns.boxplot(y="variable", x="value", data=df_stats_per_user_pivot, showfliers=False)
+plt.yticks(ticks=[0,1,2,3,4],labels=["History", "LinearRegression", "RandomForest", "LinearSVR", "SGDRegressor"],rotation=0)
+
+g.set_ylabel("Prediction Method")
+g.set_xlabel("Mean Absolute Percentage Error (MAPE)     ")
+plt.tight_layout(pad=0)
+plt.savefig("./fig3a-pred-mape-mean-power.svg")
+```
+
+## Processing the max power prediction results
+Outputs of script `run_prediction_per_user_allmethods_max.py`.
+
+```{python}
+import pandas as pd
+import seaborn as sns
+
+import os
+
+RESULTS_PATH = "../user-power-predictions/data/total_power_mean_predictions_users_allmethods_max/"
+
+PRED_COLS = ["hist_pred_total_power_max",
+            "LinearRegression_total_power_max_watts",
+            "RandomForestRegressor_total_power_max_watts", 
+            "LinearSVR_total_power_max_watts", 
+            "SGDRegressor_total_power_max_watts"]
+
+
+result_filenames = os.listdir(RESULTS_PATH)
+
+df_all_results = pd.concat([pd.read_csv(RESULTS_PATH+filename, low_memory=False) for filename in result_filenames])
+
+df_all_results = df_all_results.dropna(subset=PRED_COLS)
+#df_all_results
+
+
+from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
+
+lst_users = df_all_results["user_id"].drop_duplicates().to_list()
+#print(lst_users)
+
+df_results_user_group = df_all_results.groupby("user_id")
+
+lst_stats_per_user = []
+
+for user in lst_users:
+    results_user = df_results_user_group.get_group(user)
+    hist_mape = mean_absolute_percentage_error(results_user["total_power_max_watts"], results_user["hist_pred_total_power_max"])
+    LR_mape = mean_absolute_percentage_error(results_user["total_power_max_watts"], results_user["LinearRegression_total_power_max_watts"])
+    RF_mape = mean_absolute_percentage_error(results_user["total_power_max_watts"], results_user["RandomForestRegressor_total_power_max_watts"])
+    LSVR_mape = mean_absolute_percentage_error(results_user["total_power_max_watts"], results_user["LinearSVR_total_power_max_watts"])
+    SGD_mape = mean_absolute_percentage_error(results_user["total_power_max_watts"], results_user["SGDRegressor_total_power_max_watts"])
+    res = {"user_id": user, 
+           "hist_mape": hist_mape, 
+           "LinearRegression_mape": LR_mape, 
+           "RandomForestRegressor_mape": RF_mape, 
+           "LinearSVR_mape": LSVR_mape,
+           "SGDRegressor_mape": SGD_mape}
+    lst_stats_per_user.append(res)
+    #break
+
+df_stats_per_user = pd.DataFrame(lst_stats_per_user)
+#df_stats_per_user
+
+COLS = ["hist_mape","LinearRegression_mape","RandomForestRegressor_mape","LinearSVR_mape","SGDRegressor_mape"]
+df_stats_per_user[COLS].describe()
+
+COLS = ["hist_mape","LinearRegression_mape","RandomForestRegressor_mape","LinearSVR_mape","SGDRegressor_mape"]
+df_stats_per_user_pivot = pd.melt(df_stats_per_user, id_vars="user_id")
+df_stats_per_user_pivot
+```
+
+### Figure 3 (b)
+```{python}
+import matplotlib.pyplot as plt
+
+TINY_SIZE = 2
+SMALL_SIZE = 5
+MEDIUM_SIZE = 20
+BIGGER_SIZE = 50
+FIG_WIDTH = 40
+FIG_HEIGHT = 10
+
+
+plt.rc('font', size=20)          # controls default text sizes
+plt.rc('axes', titlesize=MEDIUM_SIZE)     # fontsize of the axes title
+plt.rc('axes', labelsize=MEDIUM_SIZE)     # fontsize of the x and y labels
+plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
+plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
+plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize
+plt.rc('figure', titlesize=MEDIUM_SIZE)  # fontsize of the figure title
+plt.rc('figure', figsize=(8,4))
+
+#g = sns.boxplot(x="variable", y="value", data=df_stats_per_user_pivot, showfliers=False)
+#plt.xticks(ticks=[0,1,2,3,4],labels=["History", "LinearRegression", "RandomForest", "LinearSVR", "SGDRegressor"],rotation=30)
+#g.set_xlabel("Prediction Method")
+#g.set_ylabel("Mean Absolute Percentage Error (MAPE)            ")
+
+g = sns.boxplot(y="variable", x="value", data=df_stats_per_user_pivot, showfliers=False)
+plt.yticks(ticks=[0,1,2,3,4],labels=["History", "LinearRegression", "RandomForest", "LinearSVR", "SGDRegressor"],rotation=0)
+g.set_ylabel("Prediction Method")
+g.set_xlabel("Mean Absolute Percentage Error (MAPE)")
+plt.tight_layout(pad=0)
+plt.savefig("./fig3b-pred-mape-max-power.svg")
+```
+
+## Getting the actual mean and max power distributions
+### Mean: Figure 2 (a)
+```{python}
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+TINY_SIZE = 2
+SMALL_SIZE = 5
+MEDIUM_SIZE = 20
+BIGGER_SIZE = 50
+FIG_WIDTH = 40
+FIG_HEIGHT = 10
+
+plt.clf()
+
+plt.rc('figure', figsize=(8, 6))
+plt.rc('font', size=MEDIUM_SIZE)          # controls default text sizes
+plt.rc('axes', titlesize=MEDIUM_SIZE)     # fontsize of the axes title
+plt.rc('axes', labelsize=MEDIUM_SIZE)     # fontsize of the x and y labels
+plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
+plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
+plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize
+plt.rc('figure', titlesize=MEDIUM_SIZE)  # fontsize of the figure title
+plt.rc('figure', figsize=(6,4))
+
+g = sns.histplot(x="total_power_mean_watts", data=df_all_results, bins=25, fill=False)
+#g.ax.set_yscale('log')
+g.set_xlabel("Total Power (watts)")
+g.set_ylabel("Number of Jobs")
+plt.xticks(ticks=[0,250,500,750,1000,1250,1500], rotation=30)
+plt.tight_layout(pad=0)
+plt.savefig("./fig2a-distrib-job-power-mean.svg")
+```
+
+### Max : Figure 2 (b)
+```{python}
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+TINY_SIZE = 2
+SMALL_SIZE = 5
+MEDIUM_SIZE = 20
+BIGGER_SIZE = 50
+FIG_WIDTH = 40
+FIG_HEIGHT = 10
+
+plt.clf()
+
+plt.rc('figure', figsize=(8, 6))
+plt.rc('font', size=MEDIUM_SIZE)          # controls default text sizes
+plt.rc('axes', titlesize=MEDIUM_SIZE)     # fontsize of the axes title
+plt.rc('axes', labelsize=MEDIUM_SIZE)     # fontsize of the x and y labels
+plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
+plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
+plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize
+plt.rc('figure', titlesize=MEDIUM_SIZE)  # fontsize of the figure title
+plt.rc('figure', figsize=(6,4))
+
+#g = sns.displot(x="total_power_max_watts", data=df_all_results)
+g = sns.histplot(x="total_power_max_watts", data=df_all_results, bins=25, fill=False)
+
+#g.ax.set_yscale('log')
+g.set_xlabel("Total Power (watts)")
+g.set_ylabel("Number of Jobs")
+plt.xticks(ticks=[0,250,500,750,1000,1250,1500,1750,2000], rotation=30)
+plt.tight_layout(pad=0)
+plt.savefig("./fig2b-distrib-job-power-max.svg")
+```
-- 
GitLab