add new documentation

5330bb9c · laura.riviere · 732c34ef · 5330bb9c
Commit 5330bb9c authored 2 years ago by laura.riviere
--- a/global_config_file_guideline.md
+++ b/global_config_file_guideline.md
@@ -11,27 +11,31 @@
 - Keep comas as in the templates to avoid errors on JSON format.
-## For Usecase 1 : **Discourse Segmentation**  
+## Usecase 1 : **Discourse Segmentation**  
 - `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. e.g. ```"Config file for usecase_1 : from a text, get the same text but with EDU bracket using ToNy segmenter."```  
- `input:`{ These fields are mandatory for every Usecases.  
+- `input:`{ These fields are mandatory for every usecases.  
    - `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset.  e.g. ```"my.cool.dataset"```
-    - `"file":` [string] The extension of your input dataset that reflects its format.
+    - `"exte":` [string] The extension of your input dataset that reflects its format.
-        - OPTIONS :[".conllu", ".tok", ".ttok", ".ss"]
+        - OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"]
    - `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"```
+    - `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`. 
 - `"steps":` { 
    - `"main":` [string] : **"annotation"**
    - `"pre-processing":` {
-        - `"tokenization":` [false, true] *available for FR*
+        - `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
-        - `"sentence_split":` [false, true] *available for FR*
+        - `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
-        - `"sentence_split_splitor":` [string] This is the toolkit you want for sentence spliting.
+        - `"sentence_split":` [boolean] 
-            - OPTIONS : ["stanza"] 
+        - `"tokenization":` [boolean] 
-        - `"syntactic_parsing":` [boolean] : **false** *Not yet available*
+        - `"syntactic_parsing":` [boolean] 
-        - `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??*
+        - `"create_metadata":` { Create label following line and/or sentence splits.
+            - `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
+            - `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
+            - `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
    - `"discourse_segmenter":` {
        - `"model":` [string] Here the name or the path to the existing model you want to use. e.g. `"tony"`, `"/moredata/andiamo/discut/Results_split.tok/results_eng.rst.gum_bert/model.tar.gz"`
@@ -42,35 +46,46 @@
            - `"train_data_path":` **null**
            - `"validation_data_path":` **null**
-    - `"post-processing":` { The toolkit AllenNlp output a JSON.
+    - `"evaluation":` [boolean] **false** Can not be done without a gold dataset.
-        - `"json_to_tab":` [boolean] Set to true if you want also a conll-style output with predictions as last column.
-        - `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too.
-    - `"evaluation":` [boolean] : **false**
    - `"gold_test_data_path":` [string] **null**
+- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
+    - `"conll_file":` {
+        - `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
+        - `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear.
+        - `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
+    - `"txt_file":` {
+        - `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
+        - `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.
-## For Usecase 2 : **Segmentation Evaluation**  
+## Usecase 2 : **Segmentation Evaluation**  
 - `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_2 : Take a EDU gold segmented text au format tok as input, use a loaded model to make predictions. Output scores of model predictions against gold"```
- `input:`{ These fields are mandatory for every Usecases.  
+- `input:`{ These fields are mandatory for every usecases.  
    - `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset.  e.g. ```"my.cool.dataset"```
-    - `"file":` [string] The extension of your input dataset that reflects its format. 
+    - `"exte":` [string] The extension of your input dataset that reflects its format. 
-        - OPTIONS :[".conllu", ".tok", ".ttok", ".ss"]
+        - OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"]
    - `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"```
+    - `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`. 
 - `"steps":` { 
    - `"main":` [string] : **"test"**
    - `"pre-processing":` {
-        - `"tokenization":` [false, true] *available for FR*
+        - `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
-        - `"sentence_split":` [false, true] *available for FR*
+        - `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
-        - `"sentence_split_splitor":` [string] This is the toolkit you want for sentence spliting.
+        - `"sentence_split":` [boolean] 
-            - OPTIONS : ["stanza"] 
+        - `"tokenization":` [boolean] 
-        - `"syntactic_parsing":` [boolean] : **false** *Not yet available*
+        - `"syntactic_parsing":` [boolean] 
-        - `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??*
+        - `"create_metadata":` { Create label following line and/or sentence splits.
+            - `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
+            - `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
+            - `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
    - `"discourse_segmenter":` {
        - `"model":` [string] Here the name or the path to the existing model you want to use. e.g. `"tony"`, `"/moredata/andiamo/discut/Results_split.tok/results_eng.rst.gum_bert/model.tar.gz"`
@@ -81,35 +96,46 @@
            - `"train_data_path":` **null**
            - `"validation_data_path":` **null**
-    - `"post-processing":` { The toolkit AllenNlp output a JSON.
-        - `"json_to_tab":` [boolean] : **true**
-        - `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too.
    - `"evaluation":` [boolean] : **true**
    - `"gold_test_data_path":` [string] The path to your gold dataset to make predictions to, and to evaluate against.
+- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
-## For Usecase 3 : **Custom Model Creation**  
+    - `"conll_file":` {
+        - `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
+        - `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear.
+        - `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
+    - `"txt_file":` {
+        - `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
+        - `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.
+## Usecase 3 : **Custom Model Creation**  
 - `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_3 : Take a EDU gold segmented set of train/dev/test of texts au format conll as input, train a model, output scores."```
- `input:`{ These fields are mandatory for every Usecases.  
+- `input:`{ These fields are mandatory for every usecases.  
    - `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset.  e.g. ```"my.cool.dataset"```
-    - `"file":` [string] The extension of your input dataset that reflects its format. 
+    - `"exte":` [string] The extension of your input dataset that reflects its format. 
-        - OPTIONS :[".conllu", ".tok", ".ttok", ".ss"]
+        - OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"]
    - `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"```
+    - `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`. 
 - `"steps":` { 
    - `"main":` [string] : **"train"**
    - `"pre-processing":` {
-        - `"tokenization":` [false, true] *available for FR*
+        - `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
-        - `"sentence_split":` [false, true] *available for FR*
+        - `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
-        - `"sentence_split_splitor":` [string] This is the toolkit you want for sentence spliting.
+        - `"sentence_split":` [boolean] 
-            - OPTIONS : ["stanza"] 
+        - `"tokenization":` [boolean] 
-        - `"syntactic_parsing":` [boolean] : **false** *Not yet available*
+        - `"syntactic_parsing":` [boolean] 
-        - `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??*
+        - `"create_metadata":` { Create label following line and/or sentence splits.
+            - `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
+            - `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
+            - `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
    - `"discourse_segmenter":` {
        - `"model":` **null**
@@ -121,9 +147,66 @@
            - `"train_data_path":` [string] The path to your training dataset. e.g. `"../data/eng.rst.rstdt/eng.rst.rstdt_train.conllu"` *conflict with training_config ??* 
            - `"validation_data_path":` [string] The path to your development dataset. e.g. `"../data/eng.rst.rstdt/eng.rst.rstdt_dev.conllu"` *idem*
-    - `"post-processing":` { The toolkit AllenNlp output a JSON.
+    - `"evaluation":` [boolean] : [boolean] Set to true if you want to evaluate your new model against a testset (defined below)
-        - `"json_to_tab":` [boolean] : **true**
-        - `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too.
-    - `"evaluation":` [boolean] : **true**
    - `"gold_test_data_path":` [string] The path to your gold test dataset to make predictions on, and to evaluate against.
+- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
+    - `"conll_file":` {
+        - `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
+        - `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear.  n
+        - `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
+    - `"txt_file":` {
+        - `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
+        - `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.
+## Usecase 4 : **Custom Model Fine-tuning**  
+- `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_3 : Take a EDU gold segmented set of train/dev/test of texts au format conll as input, train a model, output scores."```
+- `input:`{ These fields are mandatory for every usecases.  
+    - `"name":` [string] The name of your input test dataset, without the extension. This is also the same name of the directory where you put your input dataset.  e.g. ```"my.cool.dataset"```, `"eng.rst.rstdt_dev"`
+    - `"exte":` [string] The extension of your input dataset that reflects its format. 
+        - OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"]
+    - `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"```
+    - `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`. 
+- `"steps":` { 
+    - `"main":` [string] : **"fine_tune"**
+    - `"pre-processing":` {
+        - `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
+        - `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
+        - `"sentence_split":` [boolean] 
+        - `"tokenization":` [boolean] 
+        - `"syntactic_parsing":` [boolean] 
+        - `"create_metadata":` { Create label following line and/or sentence splits.
+            - `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
+            - `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
+            - `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
+    - `"discourse_segmenter":` {
+        - `"model":` [string] Path of the model to be fine-tuned.
+        - `"training":` {
+            - `"toolkit":` [string] The toolkit to build your model (to be added : "jiant").
+                - OPTIONS : ["allennlp"]
+            - `"pre_trained_lm":` **bert** (to be added : roberta..)
+            - `"config_file":` [string] The path to the config file for training. e.g. `"../model/config_training.jsonnet"`. This file need to be completed accordingly.
+            - `"train_data_path":` [string] The path to your training dataset. e.g. `"../data/eng.sdrt.stac/eng.sdrt.stac_train.conllu"` *conflict with training_config ??* 
+            - `"validation_data_path":` [string] The path to your development dataset. e.g. `"../data/eng.sdrt.stac/eng.sdrt.stac_dev.conllu"` *idem*
+    - `"evaluation":` [boolean] : [boolean] Set to true if you want to evaluate your new model against a testset (defined below)
+    - `"gold_test_data_path":` [string] The path to your gold test dataset to make predictions on, and to evaluate against. e.g. `"eng.rst.rstdt_dev"`
+- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
+    - `"conll_file":` {
+        - `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
+        - `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear.
+        - `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.  
+    - `"txt_file":` {
+        - `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
+        - `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.