Dense Retrieval for Low Resource Languages - SIGIR 2025
Getting started
Dataset preparation
Train dataset
The train dataset should be downloaded from https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/ and stored in datasets/train
using the following hierarchy:
-
datasets/train/corpus.tsv
: the corpus containing the documents -
datasets/train/topics.tsv
: the set of topics -
datasets/train/triples_by_id_num.tsv
: the list of triples, formatted as tab-separated values with:
topic_id, doc_id_positive, doc_id_negative
-
datasets/train/triples_by_id_num.json
: the list of triples, formatted as JSON Lines where each line is an array with the following elements:
[topic_id, doc_id_positive, doc_id_negative]
Test dataset
The test dataset should be downloaded from https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/ and stored in datasets/test
using the following hierarchy:
-
datasets/test/corpus.jsonl
: the corpus containing the documents, formatted as JSON Lines where each line has the following fields:-
_id
: document id -
title
: empty -
text
: unicode-escaped content of the document
-
-
datasets/test/topics.jsonl
: the set of topics, formatted as JSON Lines where each line has the following fields:-
_id
: topic id -
text
: unicode-escaped content of the topic -
metadata
: empty
-
-
datasets/test/qrels.tsv
: the list of query-document relevancy scores, formatted as tab-separated values with:
topic_id, iteration, doc_id, relevance
Environment setup
Tested with Python
3.10.12
A virtual environment isolated from the user's global environment should be created using their preferred virtual environment tool. In our experiments, the following command was used:
virtualenv --system-site-packages .env
source .env/bin/activate
The required packages are listed in the requirements file and should be installed with the following command:
pip install -r requirements.txt
Models preparation
The experiments can be run with BERT models compatible with the transformers
library. To use a model, it should be downloaded from Hugging Face to the models/
folder using:
huggingface-cli download MODEL_ID --local-dir models/MODEL_ID
Example for the Amharic BERT model:
huggingface-cli download rasyosef/bert-medium-amharic --local-dir models/rasyosef/bert-medium-amharic
Running an experiment
SPLADE and ColBERT experiments use the same configuration file structure, except for the config['train']['triples']
field, which must be adjusted for the tool being used.
To ensure traceability of results and used parameters, the configuration file is updated after execution with the following fields:
-
Before fine-tuning:
-
config['results_path']
: path to the retrieval results -
config['eval_path']
: path to the evaluation results
-
-
After fine-tuning:
-
config['train']['results_path']
: path to the retrieval results -
config['train']['eval_path']
: path to the evaluation results
-
It is recommended to pass a copy of the original configuration file when running experiments.
Example configuration files:
configs/colbert/roberta-amharic-text-embedding-base.2AIRTC.training.json
configs/splade/roberta-amharic-text-embedding-base.2AIRTC.training.json
With ColBERTv2
Set the field config['train']['triples']
to the JSONL triples file:
./datasets/train/triples_by_id_num.jsonl
Then run:
python -m eval_colbert "$config_path"
With SPLADE
Set the field config['train']['triples']
to the TSV triples file:
./datasets/train/triples_by_id_num.tsv
Then run:
python -m eval_splade "$config_path"