diff --git a/global_config_file_guideline.md b/global_config_file_guideline.md index fb4d0ca8ec40c23e6b98376ea7bcb2a0aefca2e6..8b1b6b082298ad8eb43c1539d79e7a7704d598fa 100644 --- a/global_config_file_guideline.md +++ b/global_config_file_guideline.md @@ -11,27 +11,31 @@ - Keep comas as in the templates to avoid errors on JSON format. -## For Usecase 1 : **Discourse Segmentation** +## Usecase 1 : **Discourse Segmentation** - `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. e.g. ```"Config file for usecase_1 : from a text, get the same text but with EDU bracket using ToNy segmenter."``` -- `input:`{ These fields are mandatory for every Usecases. +- `input:`{ These fields are mandatory for every usecases. - `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"``` - - `"file":` [string] The extension of your input dataset that reflects its format. - - OPTIONS :[".conllu", ".tok", ".ttok", ".ss"] + - `"exte":` [string] The extension of your input dataset that reflects its format. + - OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"] - `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"``` + - `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`. - `"steps":` { - `"main":` [string] : **"annotation"** - `"pre-processing":` { - - `"tokenization":` [false, true] *available for FR* - - `"sentence_split":` [false, true] *available for FR* - - `"sentence_split_splitor":` [string] This is the toolkit you want for sentence spliting. - - OPTIONS : ["stanza"] - - `"syntactic_parsing":` [boolean] : **false** *Not yet available* - - `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??* + - `"to_do":` [boolean] Set to true if you need at last one pre-process to be done. + - `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available. + - `"sentence_split":` [boolean] + - `"tokenization":` [boolean] + - `"syntactic_parsing":` [boolean] + - `"create_metadata":` { Create label following line and/or sentence splits. + - `"to_do":` [boolean] Set to true if you want at last one creation of metadata. + - `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically. + - `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically. - `"discourse_segmenter":` { - `"model":` [string] Here the name or the path to the existing model you want to use. e.g. `"tony"`, `"/moredata/andiamo/discut/Results_split.tok/results_eng.rst.gum_bert/model.tar.gz"` @@ -42,35 +46,46 @@ - `"train_data_path":` **null** - `"validation_data_path":` **null** - - `"post-processing":` { The toolkit AllenNlp output a JSON. - - `"json_to_tab":` [boolean] Set to true if you want also a conll-style output with predictions as last column. - - `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too. - - - `"evaluation":` [boolean] : **false** + - `"evaluation":` [boolean] **false** Can not be done without a gold dataset. - `"gold_test_data_path":` [string] **null** +- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files. + + - `"conll_file":` { + - `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions. + - `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear. + - `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels. + - `"txt_file":` { + - `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets. + - `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear. + -## For Usecase 2 : **Segmentation Evaluation** + +## Usecase 2 : **Segmentation Evaluation** - `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_2 : Take a EDU gold segmented text au format tok as input, use a loaded model to make predictions. Output scores of model predictions against gold"``` -- `input:`{ These fields are mandatory for every Usecases. +- `input:`{ These fields are mandatory for every usecases. - `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"``` - - `"file":` [string] The extension of your input dataset that reflects its format. - - OPTIONS :[".conllu", ".tok", ".ttok", ".ss"] + - `"exte":` [string] The extension of your input dataset that reflects its format. + - OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"] - `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"``` + - `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`. - `"steps":` { - `"main":` [string] : **"test"** - `"pre-processing":` { - - `"tokenization":` [false, true] *available for FR* - - `"sentence_split":` [false, true] *available for FR* - - `"sentence_split_splitor":` [string] This is the toolkit you want for sentence spliting. - - OPTIONS : ["stanza"] - - `"syntactic_parsing":` [boolean] : **false** *Not yet available* - - `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??* + - `"to_do":` [boolean] Set to true if you need at last one pre-process to be done. + - `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available. + - `"sentence_split":` [boolean] + - `"tokenization":` [boolean] + - `"syntactic_parsing":` [boolean] + - `"create_metadata":` { Create label following line and/or sentence splits. + - `"to_do":` [boolean] Set to true if you want at last one creation of metadata. + - `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically. + - `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically. - `"discourse_segmenter":` { - `"model":` [string] Here the name or the path to the existing model you want to use. e.g. `"tony"`, `"/moredata/andiamo/discut/Results_split.tok/results_eng.rst.gum_bert/model.tar.gz"` @@ -81,35 +96,46 @@ - `"train_data_path":` **null** - `"validation_data_path":` **null** - - `"post-processing":` { The toolkit AllenNlp output a JSON. - - `"json_to_tab":` [boolean] : **true** - - `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too. - - `"evaluation":` [boolean] : **true** - `"gold_test_data_path":` [string] The path to your gold dataset to make predictions to, and to evaluate against. +- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files. -## For Usecase 3 : **Custom Model Creation** + - `"conll_file":` { + - `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions. + - `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear. + - `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels. + - `"txt_file":` { + - `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets. + - `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear. + + + +## Usecase 3 : **Custom Model Creation** - `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_3 : Take a EDU gold segmented set of train/dev/test of texts au format conll as input, train a model, output scores."``` -- `input:`{ These fields are mandatory for every Usecases. +- `input:`{ These fields are mandatory for every usecases. - `"name":` [string] The name of your input dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"``` - - `"file":` [string] The extension of your input dataset that reflects its format. - - OPTIONS :[".conllu", ".tok", ".ttok", ".ss"] + - `"exte":` [string] The extension of your input dataset that reflects its format. + - OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"] - `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"``` + - `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`. - `"steps":` { - `"main":` [string] : **"train"** - `"pre-processing":` { - - `"tokenization":` [false, true] *available for FR* - - `"sentence_split":` [false, true] *available for FR* - - `"sentence_split_splitor":` [string] This is the toolkit you want for sentence spliting. - - OPTIONS : ["stanza"] - - `"syntactic_parsing":` [boolean] : **false** *Not yet available* - - `"NER_format_initialisation":` [boolean] Set to true if your are working with ToNy. *Set to true anyway ??* + - `"to_do":` [boolean] Set to true if you need at last one pre-process to be done. + - `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available. + - `"sentence_split":` [boolean] + - `"tokenization":` [boolean] + - `"syntactic_parsing":` [boolean] + - `"create_metadata":` { Create label following line and/or sentence splits. + - `"to_do":` [boolean] Set to true if you want at last one creation of metadata. + - `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically. + - `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically. - `"discourse_segmenter":` { - `"model":` **null** @@ -121,9 +147,66 @@ - `"train_data_path":` [string] The path to your training dataset. e.g. `"../data/eng.rst.rstdt/eng.rst.rstdt_train.conllu"` *conflict with training_config ??* - `"validation_data_path":` [string] The path to your development dataset. e.g. `"../data/eng.rst.rstdt/eng.rst.rstdt_dev.conllu"` *idem* - - `"post-processing":` { The toolkit AllenNlp output a JSON. - - `"json_to_tab":` [boolean] : **true** - - `"tab_to_bracket":` [boolean] Set to true if you want also an output as the raw text with brackets as EDU delimiter. If so, `"json_to_tab"` has to be set to true too. - - - `"evaluation":` [boolean] : **true** + - `"evaluation":` [boolean] : [boolean] Set to true if you want to evaluate your new model against a testset (defined below) - `"gold_test_data_path":` [string] The path to your gold test dataset to make predictions on, and to evaluate against. + +- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files. + + - `"conll_file":` { + - `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions. + - `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear. n + - `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels. + - `"txt_file":` { + - `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets. + - `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear. + + +## Usecase 4 : **Custom Model Fine-tuning** + +- `"usecase_description":` [string] This field is not a fonctional one. You can describe your project or keep the default text. ```"Config file for usecase_3 : Take a EDU gold segmented set of train/dev/test of texts au format conll as input, train a model, output scores."``` + +- `input:`{ These fields are mandatory for every usecases. + + - `"name":` [string] The name of your input test dataset, without the extension. This is also the same name of the directory where you put your input dataset. e.g. ```"my.cool.dataset"```, `"eng.rst.rstdt_dev"` + - `"exte":` [string] The extension of your input dataset that reflects its format. + - OPTIONS :[".conllu", ".conll", ".txt" .tok", ".ttok", ".ss"] + - `"language":` [string] Language ID of your dataset following the ISO 639-1 Code. e.g. ```"en"``` + - `"existing_metadata"` [boolean] Set to true if your input text contains metadata. Each line of metadata will start with `#`. + +- `"steps":` { + - `"main":` [string] : **"fine_tune"** + + - `"pre-processing":` { + - `"to_do":` [boolean] Set to true if you need at last one pre-process to be done. + - `"syntactic_tool":` **"stanza"** For now, [stanza](https://stanfordnlp.github.io/stanza/) is the only tool available. + - `"sentence_split":` [boolean] + - `"tokenization":` [boolean] + - `"syntactic_parsing":` [boolean] + - `"create_metadata":` { Create label following line and/or sentence splits. + - `"to_do":` [boolean] Set to true if you want at last one creation of metadata. + - `"line":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically. + - `"sent":` [string] Assign label e.g. ```paragraph```. A umeral count will be added automatically. + + - `"discourse_segmenter":` { + - `"model":` [string] Path of the model to be fine-tuned. + - `"training":` { + - `"toolkit":` [string] The toolkit to build your model (to be added : "jiant"). + - OPTIONS : ["allennlp"] + - `"pre_trained_lm":` **bert** (to be added : roberta..) + - `"config_file":` [string] The path to the config file for training. e.g. `"../model/config_training.jsonnet"`. This file need to be completed accordingly. + - `"train_data_path":` [string] The path to your training dataset. e.g. `"../data/eng.sdrt.stac/eng.sdrt.stac_train.conllu"` *conflict with training_config ??* + - `"validation_data_path":` [string] The path to your development dataset. e.g. `"../data/eng.sdrt.stac/eng.sdrt.stac_dev.conllu"` *idem* + + - `"evaluation":` [boolean] : [boolean] Set to true if you want to evaluate your new model against a testset (defined below) + - `"gold_test_data_path":` [string] The path to your gold test dataset to make predictions on, and to evaluate against. e.g. `"eng.rst.rstdt_dev"` + +- `"output":` { The toolkit AllenNlp output a JSON. You can choose to add other output files. + + - `"conll_file":` { + - `"to_do":` [boolean] Set to true if you want to output a file tokenized with the predictions. + - `"metadata":` [boolean] Set to true if you want all metadata (from input or pre-processing) to appear. + - `"with_gold_labels":` [boolean] Set to true if you want to keep a column with gold labels. + - `"txt_file":` { + - `"to_do":` [boolean] Set to true if you want to output a file as plain text with EDU in between brackets. + - `"metadata":` [boolean]Set to true if you want all metadata (from input or pre-processing) to appear. +