Quelques modifications

2c2224a8 · Pierre LOTTE · 2d7c775a · 2c2224a8 · 2c2224a8 · 2c2224a8
Commit 2c2224a8 authored 7 months ago by Pierre LOTTE
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ This repository contains a few important directories and files:

 ## Installation

-To use this program, you can use poetry.
+To use this program, we recommend using poetry as it will allow easy library installation. If you do not want to use poetry, you will be able to find all the required libraries and their respective versions in the `pyproject.toml` file.

 First, install all the required dependencies using the following command:

@@ -22,24 +22,51 @@ poetry install

 Then, to run the program you can run:

+## Usage
+
+Our program has been designed to be able to do all the different steps required to test our method.
+
+### Data Generation
+
+The first thing to do is to generate synthetic data. To do so, we tell the program what configuration we want to use and where to put it.
+
 ```bash
-poetry run python paradise/main.py <args>
+poetry run python paradise/main.py generate -c <config_path> -o <output_dir>
 ```

-## Usage
+### Training

-The main use of our program is to generate some data, test the PARADISE methodology on it and extract the results. All of that can be done using a single command:
+Then, we tell the program which data configuration(s) we will train our algorithm for, which algorithm(s) we will train and where to find the data. The `-s` option tells the program to use the data partitioning step. It can be omitted if you only want to train the algorithm on non-partitioned data. 

 ```bash
-poetry run python paradise/main.py all -c <config_path> -a <algorithm_name> -o <output_dir> -s
+poetry run python paradise/main.py train -c <config_path> -a <algorithm_name> -i <input_dir> [-s]
 ```

-In this command we do a few different things. First we tell the program we want to perform data generation, training and results extraction steps one after another. Then we tell it where it will find the configuration file, which algorithm to use (the algorithm name is the name of the directory containing the algorithm implementation) and where to put the results. Finally, we tell it we want to perform the partitioning step. The splitting arguments can be omitted if the partitioning has already been performed and you want to save some time.
+### Result Extraction
+
+Last but not least, we want to extract the results. To do so, we need to tell the program where to find the results.

+```bash
+poetry run python paradise/main.py all -i <input_dir>
+```

-> For more information on the possible arguments and their value, please refer to the following command's output:
+### All in one
+
+If you want to tell the program to do everything at once you can use the command shown just below. It will perform all the tasks discussed up above.
+
+```bash
+poetry run python paradise/main.py all -c <config_path> -a <algorithm_name> -o <output_dir> [-s]
+```
+
+> For more information on the possible arguments, please refer to the following command's output:
 > ```bash
 > poetry run python paradise/main.py --help
 > ```

-## Algorithms to test
\ No newline at end of file
+## Algorithms to test
+
+As stated earlier, the algorithm implentations we used are taken from [TimeEval](https://github.com/TimeEval/TimeEval-algorithms). Lot of these implementations are usable without much modifications. Some of them require a few fix to work with the library versions we use but should not take too much time to fix.
+
+Because we are based on those implementations we expect a few things if you want to test your own algorithm. First, the csv format your model accept should contain the timestamp as the first column and the label as the last for both train and test values. Second, it should use the same call interface. For more information on this call interface. Third and last, your algorithm should output the result in a file readable using numpy's `loadtxt` method.
+
+If your algorithm follows those principles, it should be easy for you to run it as part of our pipeline. The last thing you need to do is to add an entry for your algorithm in the file called `algorithm_params.json` with the proper hyperparameters and their values.
\ No newline at end of file
--- a/paradise/generator/anomaly/varying_frequency.py
+++ b/paradise/generator/anomaly/varying_frequency.py
@@ -39,7 +39,7 @@ class VFAnomaly(BaseAnomaly):
            frequencies = np.tile(frqs, self.data.shape[1]//term["period"])
            missing_values = self.data.shape[1] - len(frequencies)
            frequencies = np.concatenate((frequencies, frequencies[:missing_values]))
-            frequencies[start:end] *= 3
+            frequencies[start:end] *= 0.7

            # Find the differences between two consecutive points of the data
            signal = np.zeros(self.data.shape[1])

--- a/paradise/main.py
+++ b/paradise/main.py
@@ -24,7 +24,7 @@ if __name__ == "__main__":
    # =================================================================================================================
    #                                           Fetch arguments from CLI
    # =================================================================================================================
-    # Create the parser
+n    # Create the parser
    parser = ArgumentParser(prog="Time Series Generator", description="Create time series.")

    # Add arguments

--- a/paradise/results/extractor.py
+++ b/paradise/results/extractor.py
@@ -5,6 +5,7 @@ import os
 import numpy as np
 import pandas as pd
 from sklearn.preprocessing import minmax_scale
+from tqdm import tqdm

 from utils import exec_cmd, reverse_window

@@ -138,8 +139,14 @@ class ResultExtractor():
            "Algorithm",
            "Dataset",
            "F1",
+            "Precision",
+            "Recall",
            "F1 paradise",
+            "Precision paradise",
+            "Recall paradise",
            "F1 ideal",
+            "Precision ideal",
+            "Recall ideal",
            "Resp. subset paradise",
            "Resp. subset ideal",
            "ROC",
@@ -149,7 +156,8 @@ class ResultExtractor():
        ]
        df = pd.DataFrame(columns=columns)

-        for algo, datasets in self.results.items():
+        for algo, datasets in tqdm(self.results.items()):
+            breakpoint()
            for dataset_name, dataset_results in datasets.items():
                f1_results = F1Metric().compute(dataset_results, self.ground_truth[dataset_name])
                roc_results = ROCMetric().compute(dataset_results, self.ground_truth[dataset_name])
@@ -159,9 +167,15 @@ class ResultExtractor():
                df.loc[len(df)] = [
                    algo,
                    dataset_name,
-                    f1_results[0],
-                    f1_results[1],
-                    f1_results[2],
+                    f1_results[0]["f1"],
+                    f1_results[0]["prec"],
+                    f1_results[0]["rec"],
+                    f1_results[1]["f1"],
+                    f1_results[1]["prec"],
+                    f1_results[1]["rec"],
+                    f1_results[2]["f1"],
+                    f1_results[2]["prec"],
+                    f1_results[2]["rec"],
                    f1_results[3],
                    f1_results[4],
                    roc_results[0],

--- a/paradise/results/metrics/f1.py
+++ b/paradise/results/metrics/f1.py
@@ -4,12 +4,14 @@ This module contains the code needed to compute the F1 score.
 from typing import Iterable, Tuple

 import numpy as np
-from sklearn.metrics import precision_recall_curve
+import matplotlib.pyplot as plt
+from sklearn.metrics import precision_recall_curve, roc_curve, f1_score, precision_score, recall_score


 class F1Metric:
    """
-    This class defines the code needed to compute the F1 score.
+    This class defines the code needed to compute the F1 score, the precision and recall metrics get
+    computed as a collateral as they are used to compute the F1 score.
    """
    def compute(self, results: dict, labels: Iterable) -> Tuple[float, float, float]:
        """
@@ -39,12 +41,20 @@ class F1Metric:
            else:
                final_scores = predictions

-            prec, rec, thresholds = precision_recall_curve(truth, final_scores)
-            fscore = (2 * prec * rec) / (prec + rec)
-            fscore = np.nan_to_num(fscore)
-            idx = np.argmax(fscore)
+            fpr, tpr, thresholds = roc_curve(truth, final_scores)
+            distances = np.zeros(len(thresholds))
+            for idx, (x, y) in enumerate(zip(fpr, tpr)):
+                distances[idx] = np.sqrt((0 - x)**2 + (1 - y)**2)
+            # fscore = (2 * prec * rec) / (prec + rec)
+            # fscore = np.nan_to_num(fscore)
+            idx = np.argmin(distances)
+            print(thresholds[idx], idx)
+            pred_labels = (final_scores > thresholds[idx]).astype(int) 
+            fscore = f1_score(truth, pred_labels)
+            prec = precision_score(truth, pred_labels)
+            rec = recall_score(truth, pred_labels)

-            return fscore[idx], final_scores, thresholds[idx]
+            return fscore, prec, rec, final_scores, thresholds[idx]

        def __find_responsible_subset(global_scores, local_scores, threshold):
            """
@@ -74,14 +84,21 @@ class F1Metric:
            return responsibles

        # Compute F1 score for classic approach
-        classic_f1, _, _ = __find_optimal_f1(labels, results["classic"])
+        breakpoint()
+        classic_f1, classic_prec, classic_rec, _, _ = __find_optimal_f1(labels, results["classic"])

        # Compute F1 score for paradise approach
-        paradise_f1, paradise_scores, threshold = __find_optimal_f1(labels, results["paradise"])
+        paradise_f1, paradise_prec, paradise_rec, paradise_scores, threshold = __find_optimal_f1(labels, results["paradise"])
        responsible_subset_paradise = __find_responsible_subset(paradise_scores, results["paradise"], threshold)

        # Compute F1 score for ideal approach
-        ideal_f1, ideal_scores, threshold = __find_optimal_f1(labels, results["ideal"])
+        ideal_f1, ideal_prec, ideal_rec, ideal_scores, threshold = __find_optimal_f1(labels, results["ideal"])
        responsible_subset_ideal = __find_responsible_subset(ideal_scores, results["ideal"], threshold)

-        return classic_f1, paradise_f1, ideal_f1, responsible_subset_paradise, responsible_subset_ideal
+        return (
+            {"f1": classic_f1, "prec": classic_prec, "rec": classic_rec},
+            {"f1": paradise_f1, "prec": paradise_prec, "rec": paradise_rec},
+            {"f1": ideal_f1, "prec": ideal_prec, "rec": ideal_rec},
+            responsible_subset_paradise,
+            responsible_subset_ideal
+        )