diff --git a/README.md b/README.md index 8b8e9d434e22eec3256d85a761fcafa37ce322b3..b2d946820314bc602e173eb1565e17b3d1e62ef2 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,9 @@ # Dense Retrieval for Low Resource Languages - SIGIR 2025 - -## Overview This repository supports the experiments presented in our SIGIR 2025 paper. We explore the effectiveness of dense retrieval methods (ColBERTv2 and SPLADE) on Amharic, a morphologically rich and low-resource language. -## Datasets +## 1. Datasets We used three datasets: - **[Amharic DR Training Dataset](https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/)**: 152 queries along with their relevant documents, curated for training dense retrievers. @@ -13,15 +11,15 @@ We used three datasets: - **[AfriCLIRMatrix](https://aclanthology.org/2022.emnlp-main.597/)**: A cross-lingual dataset with queries in English, in our case, translated to Amharic using NLLB-200. -## Hardware +## 2. Hardware The Hardware used for the BM25 computations is 4 AMD EPYC Rome 7402 cores running at 2.8 GHz. The hardware used for the Splade and ColBERT experiments was 4 Intel Xeon 2640 CPU cores running at 2.4 GHz and a GTX1080 CPU on which the models were finetuned and evaluated. -## Detailed results and run time +## 3. Detailed results and run time Detailed results and run time are available as [supplementary material](Supplementary%20Material.pdf). -## Citation +## 4. Citation If you use this work, please cite: > Tilahun Yeshambel, Moncef Garouani, Serge Molina, and Josiane Mothe. 2025. @@ -29,11 +27,11 @@ If you use this work, please cite: > In Proceedings of SIGIR 2025. ACM. [PDF](XXX) -## Getting started +## 5. Getting started -### Dataset preparation +### 5.1 Dataset preparation -#### Train dataset +#### 5.1.a Train dataset The train dataset should be downloaded from [https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/](https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/) and stored in `datasets/train` using the following hierarchy: - `datasets/train/corpus.tsv`: the corpus containing the documents @@ -43,7 +41,7 @@ The train dataset should be downloaded from [https://www.irit.fr/AmharicResource - `datasets/train/triples_by_id_num.json`: the list of triples, formatted as JSON Lines where each line is an array with the following elements: `[topic_id, doc_id_positive, doc_id_negative]` -#### Test dataset +#### 5.1.b Test dataset The test dataset should be downloaded from [https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/](https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/) and stored in `datasets/test` using the following hierarchy: - `datasets/test/corpus.jsonl`: the corpus containing the documents, formatted as JSON Lines where each line has the following fields: @@ -57,7 +55,7 @@ The test dataset should be downloaded from [https://www.irit.fr/AmharicResources - `datasets/test/qrels.tsv`: the list of query-document relevancy scores, formatted as tab-separated values with: `topic_id, iteration, doc_id, relevance` -### Environment setup +### 5.2 Environment setup > Tested with Python `3.10.12` A virtual environment isolated from the user's global environment should be created using their preferred virtual environment tool. In our experiments, the following command was used: @@ -74,7 +72,7 @@ pip install -r requirements.txt ``` -### Models preparation +### 5.3 Models preparation The experiments can be run with BERT models compatible with the `transformers` library. To use a model, it should be downloaded from Hugging Face to the `models/` folder using: ```bash @@ -87,8 +85,12 @@ Example for the Amharic BERT model: huggingface-cli download rasyosef/bert-medium-amharic --local-dir models/rasyosef/bert-medium-amharic ``` -### Running an experiment -SPLADE and ColBERT experiments use the same configuration file structure, except for the `config['train']['triples']` field, which must be adjusted for the tool being used. +### 5.4 Running an experiment +SPLADE and ColBERT experiments use the same configuration file structure, except for the `config['train']['triples']` field, which must be adjusted according to the tool being used. + +Example configuration files: +- `configs/colbert/roberta-amharic-text-embedding-base.2AIRTC.training.json` +- `configs/splade/roberta-amharic-text-embedding-base.2AIRTC.training.json` To ensure traceability of results and used parameters, the configuration file is updated after execution with the following fields: @@ -100,13 +102,9 @@ To ensure traceability of results and used parameters, the configuration file is - `config['train']['results_path']`: path to the retrieval results - `config['train']['eval_path']`: path to the evaluation results -> It is recommended to pass a copy of the original configuration file when running experiments. +> It is recommended to pass a copy of the original configuration file when running experiments as **the passed configuration will be updated and written to disk in-place** with the produced search results and their trec evaluation. -Example configuration files: -- `configs/colbert/roberta-amharic-text-embedding-base.2AIRTC.training.json` -- `configs/splade/roberta-amharic-text-embedding-base.2AIRTC.training.json` - -#### With ColBERTv2 +#### 5.4.1 With ColBERTv2 Set the field `config['train']['triples']` to the JSONL triples file: ```bash @@ -119,7 +117,8 @@ Then run: python -m eval_colbert "$config_path" ``` -#### With SPLADE + +#### 5.4.2 With SPLADE Set the field `config['train']['triples']` to the TSV triples file: ```bash