Skip to content
Snippets Groups Projects
Commit 5330bb9c authored by laura.riviere's avatar laura.riviere
Browse files

add new documentation

parent 732c34ef
No related branches found
No related tags found
No related merge requests found
......@@ -11,27 +11,31 @@
- Keep comas as in the templates to avoid errors on JSON format.
## For Usecase 1 : **Discourse Segmentation**
## Usecase 1 : **Discourse Segmentation**
- `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. e.g. ```"Config file for usecase_1 : from a text, get the same text but with EDU bracket using ToNy segmenter."```
- `input:`{ These fields are mandatory for every Usecases.
- `input:`{ These fields are mandatory for every usecases.
- `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"```
- `"file":` [string] The extension of your input dataset that reflects its format.
- OPTIONS :[".conllu", ".tok", ".ttok", ".ss"]
- `"exte":` [string] The extension of your input dataset that reflects its format.
- OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"]
- `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"```
- `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`.
- `"steps":` {
- `"main":` [string] : **"annotation"**
- `"pre-processing":` {
- `"tokenization":` [false, true] *available for FR*
- `"sentence_split":` [false, true] *available for FR*
- `"sentence_split_splitor":` [string] This is the toolkit you want for sentence spliting.
- OPTIONS : ["stanza"]
- `"syntactic_parsing":` [boolean] : **false** *Not yet available*
- `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??*
- `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
- `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
- `"sentence_split":` [boolean]
- `"tokenization":` [boolean]
- `"syntactic_parsing":` [boolean]
- `"create_metadata":` { Create label following line and/or sentence splits.
- `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
- `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"discourse_segmenter":` {
- `"model":` [string] Here the name or the path to the existing model you want to use. e.g. `"tony"`, `"/moredata/andiamo/discut/Results_split.tok/results_eng.rst.gum_bert/model.tar.gz"`
......@@ -42,35 +46,46 @@
- `"train_data_path":` **null**
- `"validation_data_path":` **null**
- `"post-processing":` { The toolkit AllenNlp output a JSON.
- `"json_to_tab":` [boolean] Set to true if you want also a conll-style output with predictions as last column.
- `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too.
- `"evaluation":` [boolean] : **false**
- `"evaluation":` [boolean] **false** Can not be done without a gold dataset.
- `"gold_test_data_path":` [string] **null**
- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
- `"conll_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
- `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear.
- `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
- `"txt_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
- `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.
## For Usecase 2 : **Segmentation Evaluation**
## Usecase 2 : **Segmentation Evaluation**
- `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_2 : Take a EDU gold segmented text au format tok as input, use a loaded model to make predictions. Output scores of model predictions against gold"```
- `input:`{ These fields are mandatory for every Usecases.
- `input:`{ These fields are mandatory for every usecases.
- `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"```
- `"file":` [string] The extension of your input dataset that reflects its format.
- OPTIONS :[".conllu", ".tok", ".ttok", ".ss"]
- `"exte":` [string] The extension of your input dataset that reflects its format.
- OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"]
- `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"```
- `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`.
- `"steps":` {
- `"main":` [string] : **"test"**
- `"pre-processing":` {
- `"tokenization":` [false, true] *available for FR*
- `"sentence_split":` [false, true] *available for FR*
- `"sentence_split_splitor":` [string] This is the toolkit you want for sentence spliting.
- OPTIONS : ["stanza"]
- `"syntactic_parsing":` [boolean] : **false** *Not yet available*
- `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??*
- `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
- `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
- `"sentence_split":` [boolean]
- `"tokenization":` [boolean]
- `"syntactic_parsing":` [boolean]
- `"create_metadata":` { Create label following line and/or sentence splits.
- `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
- `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"discourse_segmenter":` {
- `"model":` [string] Here the name or the path to the existing model you want to use. e.g. `"tony"`, `"/moredata/andiamo/discut/Results_split.tok/results_eng.rst.gum_bert/model.tar.gz"`
......@@ -81,35 +96,46 @@
- `"train_data_path":` **null**
- `"validation_data_path":` **null**
- `"post-processing":` { The toolkit AllenNlp output a JSON.
- `"json_to_tab":` [boolean] : **true**
- `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too.
- `"evaluation":` [boolean] : **true**
- `"gold_test_data_path":` [string] The path to your gold dataset to make predictions to, and to evaluate against.
- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
## For Usecase 3 : **Custom Model Creation**
- `"conll_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
- `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear.
- `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
- `"txt_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
- `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.
## Usecase 3 : **Custom Model Creation**
- `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_3 : Take a EDU gold segmented set of train/dev/test of texts au format conll as input, train a model, output scores."```
- `input:`{ These fields are mandatory for every Usecases.
- `input:`{ These fields are mandatory for every usecases.
- `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"```
- `"file":` [string] The extension of your input dataset that reflects its format.
- OPTIONS :[".conllu", ".tok", ".ttok", ".ss"]
- `"exte":` [string] The extension of your input dataset that reflects its format.
- OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"]
- `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"```
- `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`.
- `"steps":` {
- `"main":` [string] : **"train"**
- `"pre-processing":` {
- `"tokenization":` [false, true] *available for FR*
- `"sentence_split":` [false, true] *available for FR*
- `"sentence_split_splitor":` [string] This is the toolkit you want for sentence spliting.
- OPTIONS : ["stanza"]
- `"syntactic_parsing":` [boolean] : **false** *Not yet available*
- `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??*
- `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
- `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
- `"sentence_split":` [boolean]
- `"tokenization":` [boolean]
- `"syntactic_parsing":` [boolean]
- `"create_metadata":` { Create label following line and/or sentence splits.
- `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
- `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"discourse_segmenter":` {
- `"model":` **null**
......@@ -121,9 +147,66 @@
- `"train_data_path":` [string] The path to your training dataset. e.g. `"../data/eng.rst.rstdt/eng.rst.rstdt_train.conllu"` *conflict with training_config ??*
- `"validation_data_path":` [string] The path to your development dataset. e.g. `"../data/eng.rst.rstdt/eng.rst.rstdt_dev.conllu"` *idem*
- `"post-processing":` { The toolkit AllenNlp output a JSON.
- `"json_to_tab":` [boolean] : **true**
- `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too.
- `"evaluation":` [boolean] : **true**
- `"evaluation":` [boolean] : [boolean] Set to true if you want to evaluate your new model against a testset (defined below)
- `"gold_test_data_path":` [string] The path to your gold test dataset to make predictions on, and to evaluate against.
- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
- `"conll_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
- `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear. n
- `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
- `"txt_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
- `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.
## Usecase 4 : **Custom Model Fine-tuning**
- `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_3 : Take a EDU gold segmented set of train/dev/test of texts au format conll as input, train a model, output scores."```
- `input:`{ These fields are mandatory for every usecases.
- `"name":` [string] The name of your input test dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"```, `"eng.rst.rstdt_dev"`
- `"exte":` [string] The extension of your input dataset that reflects its format.
- OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"]
- `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"```
- `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`.
- `"steps":` {
- `"main":` [string] : **"fine_tune"**
- `"pre-processing":` {
- `"to_do":` [boolean] Set to true if you need at last one pre-process to be done.
- `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available.
- `"sentence_split":` [boolean]
- `"tokenization":` [boolean]
- `"syntactic_parsing":` [boolean]
- `"create_metadata":` { Create label following line and/or sentence splits.
- `"to_do":` [boolean] Set to true if you want at last one creation of metadata.
- `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically.
- `"discourse_segmenter":` {
- `"model":` [string] Path of the model to be fine-tuned.
- `"training":` {
- `"toolkit":` [string] The toolkit to build your model (to be added : "jiant").
- OPTIONS : ["allennlp"]
- `"pre_trained_lm":` **bert** (to be added : roberta..)
- `"config_file":` [string] The path to the config file for training. e.g. `"../model/config_training.jsonnet"`. This file need to be completed accordingly.
- `"train_data_path":` [string] The path to your training dataset. e.g. `"../data/eng.sdrt.stac/eng.sdrt.stac_train.conllu"` *conflict with training_config ??*
- `"validation_data_path":` [string] The path to your development dataset. e.g. `"../data/eng.sdrt.stac/eng.sdrt.stac_dev.conllu"` *idem*
- `"evaluation":` [boolean] : [boolean] Set to true if you want to evaluate your new model against a testset (defined below)
- `"gold_test_data_path":` [string] The path to your gold test dataset to make predictions on, and to evaluate against. e.g. `"eng.rst.rstdt_dev"`
- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files.
- `"conll_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions.
- `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear.
- `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels.
- `"txt_file":` {
- `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets.
- `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment