Skip to content
Snippets Groups Projects

Dense Retrieval for Low Resource Languages - SIGIR 2025

This repository supports the experiments presented in our SIGIR 2025 paper. We explore the effectiveness of dense retrieval methods (ColBERTv2 and SPLADE) on Amharic, a morphologically rich and low-resource language.

1. Datasets

We used three datasets:

  • Amharic DR Training Dataset: 152 queries along with their relevant documents, curated for training dense retrievers.
  • 2AIRTC: A traditional IR test collection with 240 queries and 6,960 documents along with their relevance judgement.
  • AfriCLIRMatrix: A cross-lingual dataset with queries in English, in our case, translated to Amharic using NLLB-200.

2. Hardware

The Hardware used for the BM25 computations is 4 AMD EPYC Rome 7402 cores running at 2.8 GHz.

The hardware used for the Splade and ColBERT experiments was 4 Intel Xeon 2640 CPU cores running at 2.4 GHz and a GTX1080 GPU on which the models were finetuned and evaluated.

3. Detailed results and run time

Detailed results and run time are available as supplementary material.

4. Citation

If you use this work, please cite:

Tilahun Yeshambel, Moncef Garouani, Serge Molina, and Josiane Mothe. 2025.
Dense Retrieval for Low Resource Languages – The Case of Amharic Language.
In Proceedings of SIGIR 2025. ACM. PDF

5. Getting started

5.1 Dataset preparation

5.1.a Train dataset

The train dataset should be downloaded from https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/ and stored in datasets/train using the following hierarchy:

  • datasets/train/corpus.tsv: the corpus containing the documents
  • datasets/train/topics.tsv: the set of topics
  • datasets/train/triples_by_id_num.tsv: the list of triples, formatted as tab-separated values with:
    topic_id, doc_id_positive, doc_id_negative
  • datasets/train/triples_by_id_num.json: the list of triples, formatted as JSON Lines where each line is an array with the following elements:
    [topic_id, doc_id_positive, doc_id_negative]

5.1.b Test dataset

The test dataset should be downloaded from https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/ and stored in datasets/test using the following hierarchy:

  • datasets/test/corpus.jsonl: the corpus containing the documents, formatted as JSON Lines where each line has the following fields:
    • _id: document id
    • title: empty
    • text: unicode-escaped content of the document
  • datasets/test/topics.jsonl: the set of topics, formatted as JSON Lines where each line has the following fields:
    • _id: topic id
    • text: unicode-escaped content of the topic
    • metadata: empty
  • datasets/test/qrels.tsv: the list of query-document relevancy scores, formatted as tab-separated values with:
    topic_id, iteration, doc_id, relevance

5.2 Environment setup

Tested with Python 3.10.12

A virtual environment isolated from the user's global environment should be created using their preferred virtual environment tool. In our experiments, the following command was used:

virtualenv --system-site-packages .env
source .env/bin/activate

The required packages are listed in the requirements file and should be installed with the following command:

pip install -r requirements.txt

5.3 Models preparation

The experiments can be run with BERT models compatible with the transformers library. To use a model, it should be downloaded from Hugging Face to the models/ folder using:

huggingface-cli download MODEL_ID --local-dir models/MODEL_ID

Example for the Amharic BERT model:

huggingface-cli download rasyosef/bert-medium-amharic --local-dir models/rasyosef/bert-medium-amharic

5.4 Running an experiment

SPLADE and ColBERT experiments use the same configuration file structure, except for the config['train']['triples'] field, which must be adjusted according to the tool being used.

Example configuration files:

  • configs/colbert/roberta-amharic-text-embedding-base.2AIRTC.training.json
  • configs/splade/roberta-amharic-text-embedding-base.2AIRTC.training.json

To ensure traceability of results and used parameters, the configuration file is updated after execution with the following fields:

  • Before fine-tuning:

    • config['results_path']: path to the retrieval results
    • config['eval_path']: path to the evaluation results
  • After fine-tuning:

    • config['train']['results_path']: path to the retrieval results
    • config['train']['eval_path']: path to the evaluation results

It is recommended to pass a copy of the original configuration file when running experiments as the passed configuration will be updated and written to disk in-place with the produced search results and their trec evaluation.

5.4.1 With ColBERTv2

Set the field config['train']['triples'] to the JSONL triples file:

./datasets/train/triples_by_id_num.jsonl

Then run:

python -m eval_colbert "$config_path"

5.4.2 With SPLADE

Set the field config['train']['triples'] to the TSV triples file:

./datasets/train/triples_by_id_num.tsv

Then run:

python -m eval_splade "$config_path"