Skip to content
Snippets Groups Projects
user avatar
emetheni authored
5776801b
History

[Paper] CODI 2024 -- Feature-augmented model for multilingual discourse relation classification

This repository hosts the code for the paper "Feature-augmented model for multilingual discourse relation classification" by Eleni Metheniti, Chloé Braud, and Philippe Muller. Paper to be presented in CODI 2024.

Datasets

The datasets come from the DISRPT 2021 Shared Task 3: Discourse Relation Classification across Formalisms. The repository for the data can be found on Github. (Note that some datasets are only made available by owning a version of non-open source corpora, such as PDTB 3.0 (Prasad et al., 2019). Please refer to the README files of each dataset in the Shared Task repository.)

After cloning the repo and converting the underscored files, either copy the data folder to this repo's main folder, or point to the data folder with the --data_path argument.

The full list of datasets with statistics: here.

Prerequisites

  • torch (tested on 1.12 with CUDA)
  • transformers
  • scikit-learn

Install requirements with pip install -r requirements.txt.

Run

Run the classifier with features

python classifier_features_pytorch.py \
	--langs_to_use [LIST OF DATASETS IN ONE STR SEPARATED BY ;] \
	--mappings_file [NAME TO SAVE THE MAPPINGS] \
	--normalize_direction ['disco'/'discret'/'no']

Additional arguments and defaults:

	--tranformer_model "bert-base-multilingual-cased" \
	--num_epochs 10 \
	--batch_size 8 \
	--gradient_accumulation_steps 16 \
	--use_cuda "yes"

References

Prasad, Rashmi, Webber, Bonnie, Lee, Alan, & Joshi, Aravind. (2019). Penn Discourse Tree bank Version 3.0 [Data set]. Linguistic Data Consortium. https://doi.org/10.35111/QEBF-GK47