- Keep comas as in the templates to avoid errors on JSON format.
## For Usecase 1 : **Discourse Segmentation**
## Usecase 1 : **Discourse Segmentation**
- `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. e.g. ```"Config file for usecase_1 : from a text, get the same text but with EDU bracket using ToNy segmenter."```
- `input:`{ These fields are mandatory for every Usecases.
- `input:`{ These fields are mandatory for every usecases.
- `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"```
- `"file":` [string] The extension of your input dataset that reflects its format.
- OPTIONS :[".conllu", ".tok", ".ttok", ".ss"]
- `"exte":` [string] The extension of your input dataset that reflects its format.
- `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??*
- `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
- `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
- `"sentence_split":` [boolean]
- `"tokenization":` [boolean]
- `"syntactic_parsing":` [boolean]
- `"create_metadata":` { Create label following line and/or sentence splits.
- `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
- `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"discourse_segmenter":` {
- `"model":` [string] Here the name or the path to the existing model you want to use. e.g. `"tony"`, `"/moredata/andiamo/discut/Results_split.tok/results_eng.rst.gum_bert/model.tar.gz"`
...
...
@@ -42,35 +46,46 @@
- `"train_data_path":` **null**
- `"validation_data_path":` **null**
- `"post-processing":` { The toolkit AllenNlp output a JSON.
- `"json_to_tab":` [boolean] Set to true if you want also a conll-style output with predictions as last column.
- `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too.
- `"evaluation":` [boolean] : **false**
- `"evaluation":` [boolean] **false** Can not be done without a gold dataset.
- `"gold_test_data_path":` [string] **null**
- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
- `"conll_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
- `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear.
- `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
- `"txt_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
- `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.
## For Usecase 2 : **Segmentation Evaluation**
## Usecase 2 : **Segmentation Evaluation**
- `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_2 : Take a EDU gold segmented text au format tok as input, use a loaded model to make predictions. Output scores of model predictions against gold"```
- `input:`{ These fields are mandatory for every Usecases.
- `input:`{ These fields are mandatory for every usecases.
- `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"```
- `"file":` [string] The extension of your input dataset that reflects its format.
- OPTIONS :[".conllu", ".tok", ".ttok", ".ss"]
- `"exte":` [string] The extension of your input dataset that reflects its format.
- `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??*
- `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
- `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
- `"sentence_split":` [boolean]
- `"tokenization":` [boolean]
- `"syntactic_parsing":` [boolean]
- `"create_metadata":` { Create label following line and/or sentence splits.
- `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
- `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"discourse_segmenter":` {
- `"model":` [string] Here the name or the path to the existing model you want to use. e.g. `"tony"`, `"/moredata/andiamo/discut/Results_split.tok/results_eng.rst.gum_bert/model.tar.gz"`
...
...
@@ -81,35 +96,46 @@
- `"train_data_path":` **null**
- `"validation_data_path":` **null**
- `"post-processing":` { The toolkit AllenNlp output a JSON.
- `"json_to_tab":` [boolean] : **true**
- `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too.
- `"evaluation":` [boolean] : **true**
- `"gold_test_data_path":` [string] The path to your gold dataset to make predictions to, and to evaluate against.
- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
## For Usecase 3 : **Custom Model Creation**
- `"conll_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
- `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear.
- `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
- `"txt_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
- `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.
## Usecase 3 : **Custom Model Creation**
- `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_3 : Take a EDU gold segmented set of train/dev/test of texts au format conll as input, train a model, output scores."```
- `input:`{ These fields are mandatory for every Usecases.
- `input:`{ These fields are mandatory for every usecases.
- `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"```
- `"file":` [string] The extension of your input dataset that reflects its format.
- OPTIONS :[".conllu", ".tok", ".ttok", ".ss"]
- `"exte":` [string] The extension of your input dataset that reflects its format.
- `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??*
- `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
- `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
- `"sentence_split":` [boolean]
- `"tokenization":` [boolean]
- `"syntactic_parsing":` [boolean]
- `"create_metadata":` { Create label following line and/or sentence splits.
- `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
- `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"discourse_segmenter":` {
- `"model":` **null**
...
...
@@ -121,9 +147,66 @@
- `"train_data_path":` [string] The path to your training dataset. e.g. `"../data/eng.rst.rstdt/eng.rst.rstdt_train.conllu"` *conflict with training_config ??*
- `"validation_data_path":` [string] The path to your development dataset. e.g. `"../data/eng.rst.rstdt/eng.rst.rstdt_dev.conllu"` *idem*
- `"post-processing":` { The toolkit AllenNlp output a JSON.
- `"json_to_tab":` [boolean] : **true**
- `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too.
- `"evaluation":` [boolean] : **true**
- `"evaluation":` [boolean] : [boolean] Set to true if you want to evaluate your new model against a testset (defined below)
- `"gold_test_data_path":` [string] The path to your gold test dataset to make predictions on, and to evaluate against.
- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
- `"conll_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
- `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear. n
- `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
- `"txt_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
- `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.
## Usecase 4 : **Custom Model Fine-tuning**
- `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_3 : Take a EDU gold segmented set of train/dev/test of texts au format conll as input, train a model, output scores."```
- `input:`{ These fields are mandatory for every usecases.
- `"name":` [string] The name of your input test dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"```, `"eng.rst.rstdt_dev"`
- `"exte":` [string] The extension of your input dataset that reflects its format.
- `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"```
- `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`.
- `"steps":` {
- `"main":` [string] : **"fine_tune"**
- `"pre-processing":` {
- `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
- `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
- `"sentence_split":` [boolean]
- `"tokenization":` [boolean]
- `"syntactic_parsing":` [boolean]
- `"create_metadata":` { Create label following line and/or sentence splits.
- `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
- `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"discourse_segmenter":` {
- `"model":` [string] Path of the model to be fine-tuned.
- `"training":` {
- `"toolkit":` [string] The toolkit to build your model (to be added : "jiant").
- OPTIONS : ["allennlp"]
- `"pre_trained_lm":` **bert** (to be added : roberta..)
- `"config_file":` [string] The path to the config file for training. e.g. `"../model/config_training.jsonnet"`. This file need to be completed accordingly.
- `"train_data_path":` [string] The path to your training dataset. e.g. `"../data/eng.sdrt.stac/eng.sdrt.stac_train.conllu"` *conflict with training_config ??*
- `"validation_data_path":` [string] The path to your development dataset. e.g. `"../data/eng.sdrt.stac/eng.sdrt.stac_dev.conllu"` *idem*
- `"evaluation":` [boolean] : [boolean] Set to true if you want to evaluate your new model against a testset (defined below)
- `"gold_test_data_path":` [string] The path to your gold test dataset to make predictions on, and to evaluate against. e.g. `"eng.rst.rstdt_dev"`
- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
- `"conll_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
- `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear.
- `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
- `"txt_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
- `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.