[Paper] CODI 2024 -- Feature-augmented model for multilingual discourse relation classification
This repository hosts the code for the paper "Feature-augmented model for multilingual discourse relation classification" by Eleni Metheniti, Chloé Braud, and Philippe Muller. Paper to be presented in CODI 2024.
Datasets
The datasets come from the DISRPT 2021 Shared Task 3: Discourse Relation Classification across Formalisms. The repository for the data can be found on Github. (Note that some datasets are only made available by owning a version of non-open source corpora, such as PDTB 3.0 (Prasad et al., 2019). Please refer to the README files of each dataset in the Shared Task repository.)
After cloning the repo and converting the underscored files, either copy the data
folder to this repo's main folder, or point to the data folder with the --data_path
argument.
The full list of datasets with statistics: here.
Prerequisites
- torch (tested on 1.12 with CUDA)
- transformers
- scikit-learn
Install requirements with pip install -r requirements.txt
.
Run
Run the classifier with features
python classifier_features_pytorch.py \
--langs_to_use [LIST OF DATASETS IN ONE STR SEPARATED BY ;] \
--mappings_file [NAME TO SAVE THE MAPPINGS] \
--normalize_direction ['disco'/'discret'/'no']
Additional arguments and defaults:
--tranformer_model "bert-base-multilingual-cased" \
--num_epochs 10 \
--batch_size 8 \
--gradient_accumulation_steps 16 \
--use_cuda "yes"
References
Prasad, Rashmi, Webber, Bonnie, Lee, Alan, & Joshi, Aravind. (2019). Penn Discourse Tree bank Version 3.0 [Data set]. Linguistic Data Consortium. https://doi.org/10.35111/QEBF-GK47