Skip to content
Snippets Groups Projects
Serge MOLINA's avatar
Serge MOLINA authored
3c48927a
History

Dense Retrieval for Low Resource Languages - SIGIR 2025

Getting started

Dataset preparation

Train dataset

The train dataset should be downloaded from https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/ and stored in datasets/train using the following hierarchy:

  • datasets/train/corpus.tsv: the corpus containing the documents
  • datasets/train/topics.tsv: the set of topics
  • datasets/train/triples_by_id_num.tsv: the list of triples, formatted as tab-separated values with:
    topic_id, doc_id_positive, doc_id_negative
  • datasets/train/triples_by_id_num.json: the list of triples, formatted as JSON Lines where each line is an array with the following elements:
    [topic_id, doc_id_positive, doc_id_negative]

Test dataset

The test dataset should be downloaded from https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/ and stored in datasets/test using the following hierarchy:

  • datasets/test/corpus.jsonl: the corpus containing the documents, formatted as JSON Lines where each line has the following fields:
    • _id: document id
    • title: empty
    • text: unicode-escaped content of the document
  • datasets/test/topics.jsonl: the set of topics, formatted as JSON Lines where each line has the following fields:
    • _id: topic id
    • text: unicode-escaped content of the topic
    • metadata: empty
  • datasets/test/qrels.tsv: the list of query-document relevancy scores, formatted as tab-separated values with:
    topic_id, iteration, doc_id, relevance

Environment setup

Tested with Python 3.10.12

A virtual environment isolated from the user's global environment should be created using their preferred virtual environment tool. In our experiments, the following command was used:

virtualenv --system-site-packages .env
source .env/bin/activate

The required packages are listed in the requirements file and should be installed with the following command:

pip install -r requirements.txt

Models preparation

The experiments can be run with BERT models compatible with the transformers library. To use a model, it should be downloaded from Hugging Face to the models/ folder using:

huggingface-cli download MODEL_ID --local-dir models/MODEL_ID

Example for the Amharic BERT model:

huggingface-cli download rasyosef/bert-medium-amharic --local-dir models/rasyosef/bert-medium-amharic

Running an experiment

SPLADE and ColBERT experiments use the same configuration file structure, except for the config['train']['triples'] field, which must be adjusted for the tool being used.

To ensure traceability of results and used parameters, the configuration file is updated after execution with the following fields:

  • Before fine-tuning:

    • config['results_path']: path to the retrieval results
    • config['eval_path']: path to the evaluation results
  • After fine-tuning:

    • config['train']['results_path']: path to the retrieval results
    • config['train']['eval_path']: path to the evaluation results

It is recommended to pass a copy of the original configuration file when running experiments.

Example configuration files:

  • configs/colbert/roberta-amharic-text-embedding-base.2AIRTC.training.json
  • configs/splade/roberta-amharic-text-embedding-base.2AIRTC.training.json

With ColBERTv2

Set the field config['train']['triples'] to the JSONL triples file:

./datasets/train/triples_by_id_num.jsonl

Then run:

python -m eval_colbert "$config_path"

With SPLADE

Set the field config['train']['triples'] to the TSV triples file:

./datasets/train/triples_by_id_num.tsv

Then run:

python -m eval_splade "$config_path"