# Dense Retrieval for Low Resource Languages - SIGIR 2025
# Dense Retrieval for Low Resource Languages - SIGIR 2025
## Overview
This repository supports the experiments presented in our SIGIR 2025 paper. We explore the effectiveness of dense retrieval methods (ColBERTv2 and SPLADE) on Amharic, a morphologically rich and low-resource language.
This repository supports the experiments presented in our SIGIR 2025 paper. We explore the effectiveness of dense retrieval methods (ColBERTv2 and SPLADE) on Amharic, a morphologically rich and low-resource language.
## Datasets
## 1. Datasets
We used three datasets:
We used three datasets:
-**[Amharic DR Training Dataset](https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/)**: 152 queries along with their relevant documents, curated for training dense retrievers.
-**[Amharic DR Training Dataset](https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/)**: 152 queries along with their relevant documents, curated for training dense retrievers.
...
@@ -13,15 +11,15 @@ We used three datasets:
...
@@ -13,15 +11,15 @@ We used three datasets:
-**[AfriCLIRMatrix](https://aclanthology.org/2022.emnlp-main.597/)**: A cross-lingual dataset with queries in English, in our case, translated to Amharic using NLLB-200.
-**[AfriCLIRMatrix](https://aclanthology.org/2022.emnlp-main.597/)**: A cross-lingual dataset with queries in English, in our case, translated to Amharic using NLLB-200.
## Hardware
## 2. Hardware
The Hardware used for the BM25 computations is 4 AMD EPYC Rome 7402 cores running at 2.8 GHz.
The Hardware used for the BM25 computations is 4 AMD EPYC Rome 7402 cores running at 2.8 GHz.
The hardware used for the Splade and ColBERT experiments was 4 Intel Xeon 2640 CPU cores running at 2.4 GHz and a GTX1080 CPU on which the models were finetuned and evaluated.
The hardware used for the Splade and ColBERT experiments was 4 Intel Xeon 2640 CPU cores running at 2.4 GHz and a GTX1080 CPU on which the models were finetuned and evaluated.
## Detailed results and run time
## 3. Detailed results and run time
Detailed results and run time are available as [supplementary material](Supplementary%20Material.pdf).
Detailed results and run time are available as [supplementary material](Supplementary%20Material.pdf).
@@ -29,11 +27,11 @@ If you use this work, please cite:
...
@@ -29,11 +27,11 @@ If you use this work, please cite:
> In Proceedings of SIGIR 2025. ACM. [PDF](XXX)
> In Proceedings of SIGIR 2025. ACM. [PDF](XXX)
## Getting started
## 5. Getting started
### Dataset preparation
### 5.1 Dataset preparation
#### Train dataset
#### 5.1.a Train dataset
The train dataset should be downloaded from [https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/](https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/) and stored in `datasets/train` using the following hierarchy:
The train dataset should be downloaded from [https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/](https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/) and stored in `datasets/train` using the following hierarchy:
-`datasets/train/corpus.tsv`: the corpus containing the documents
-`datasets/train/corpus.tsv`: the corpus containing the documents
...
@@ -43,7 +41,7 @@ The train dataset should be downloaded from [https://www.irit.fr/AmharicResource
...
@@ -43,7 +41,7 @@ The train dataset should be downloaded from [https://www.irit.fr/AmharicResource
-`datasets/train/triples_by_id_num.json`: the list of triples, formatted as JSON Lines where each line is an array with the following elements:
-`datasets/train/triples_by_id_num.json`: the list of triples, formatted as JSON Lines where each line is an array with the following elements:
`[topic_id, doc_id_positive, doc_id_negative]`
`[topic_id, doc_id_positive, doc_id_negative]`
#### Test dataset
#### 5.1.b Test dataset
The test dataset should be downloaded from [https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/](https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/) and stored in `datasets/test` using the following hierarchy:
The test dataset should be downloaded from [https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/](https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/) and stored in `datasets/test` using the following hierarchy:
-`datasets/test/corpus.jsonl`: the corpus containing the documents, formatted as JSON Lines where each line has the following fields:
-`datasets/test/corpus.jsonl`: the corpus containing the documents, formatted as JSON Lines where each line has the following fields:
...
@@ -57,7 +55,7 @@ The test dataset should be downloaded from [https://www.irit.fr/AmharicResources
...
@@ -57,7 +55,7 @@ The test dataset should be downloaded from [https://www.irit.fr/AmharicResources
-`datasets/test/qrels.tsv`: the list of query-document relevancy scores, formatted as tab-separated values with:
-`datasets/test/qrels.tsv`: the list of query-document relevancy scores, formatted as tab-separated values with:
`topic_id, iteration, doc_id, relevance`
`topic_id, iteration, doc_id, relevance`
### Environment setup
### 5.2 Environment setup
> Tested with Python `3.10.12`
> Tested with Python `3.10.12`
A virtual environment isolated from the user's global environment should be created using their preferred virtual environment tool. In our experiments, the following command was used:
A virtual environment isolated from the user's global environment should be created using their preferred virtual environment tool. In our experiments, the following command was used:
...
@@ -74,7 +72,7 @@ pip install -r requirements.txt
...
@@ -74,7 +72,7 @@ pip install -r requirements.txt
```
```
### Models preparation
### 5.3 Models preparation
The experiments can be run with BERT models compatible with the `transformers` library. To use a model, it should be downloaded from Hugging Face to the `models/` folder using:
The experiments can be run with BERT models compatible with the `transformers` library. To use a model, it should be downloaded from Hugging Face to the `models/` folder using:
```bash
```bash
...
@@ -87,8 +85,12 @@ Example for the Amharic BERT model:
...
@@ -87,8 +85,12 @@ Example for the Amharic BERT model:
SPLADE and ColBERT experiments use the same configuration file structure, except for the `config['train']['triples']` field, which must be adjusted for the tool being used.
SPLADE and ColBERT experiments use the same configuration file structure, except for the `config['train']['triples']` field, which must be adjusted according to the tool being used.
To ensure traceability of results and used parameters, the configuration file is updated after execution with the following fields:
To ensure traceability of results and used parameters, the configuration file is updated after execution with the following fields:
...
@@ -100,13 +102,9 @@ To ensure traceability of results and used parameters, the configuration file is
...
@@ -100,13 +102,9 @@ To ensure traceability of results and used parameters, the configuration file is
-`config['train']['results_path']`: path to the retrieval results
-`config['train']['results_path']`: path to the retrieval results
-`config['train']['eval_path']`: path to the evaluation results
-`config['train']['eval_path']`: path to the evaluation results
> It is recommended to pass a copy of the original configuration file when running experiments.
> It is recommended to pass a copy of the original configuration file when running experiments as **the passed configuration will be updated and written to disk in-place** with the produced search results and their trec evaluation.