Skip to content
Snippets Groups Projects
Commit b5ae1fab authored by Serge MOLINA's avatar Serge MOLINA
Browse files

Add numbering and clarify that running the experiments modifies the provided files

parent c85c1506
Branches
No related tags found
No related merge requests found
# Dense Retrieval for Low Resource Languages - SIGIR 2025 # Dense Retrieval for Low Resource Languages - SIGIR 2025
## Overview
This repository supports the experiments presented in our SIGIR 2025 paper. We explore the effectiveness of dense retrieval methods (ColBERTv2 and SPLADE) on Amharic, a morphologically rich and low-resource language. This repository supports the experiments presented in our SIGIR 2025 paper. We explore the effectiveness of dense retrieval methods (ColBERTv2 and SPLADE) on Amharic, a morphologically rich and low-resource language.
## Datasets ## 1. Datasets
We used three datasets: We used three datasets:
- **[Amharic DR Training Dataset](https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/)**: 152 queries along with their relevant documents, curated for training dense retrievers. - **[Amharic DR Training Dataset](https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/)**: 152 queries along with their relevant documents, curated for training dense retrievers.
...@@ -13,15 +11,15 @@ We used three datasets: ...@@ -13,15 +11,15 @@ We used three datasets:
- **[AfriCLIRMatrix](https://aclanthology.org/2022.emnlp-main.597/)**: A cross-lingual dataset with queries in English, in our case, translated to Amharic using NLLB-200. - **[AfriCLIRMatrix](https://aclanthology.org/2022.emnlp-main.597/)**: A cross-lingual dataset with queries in English, in our case, translated to Amharic using NLLB-200.
## Hardware ## 2. Hardware
The Hardware used for the BM25 computations is 4 AMD EPYC Rome 7402 cores running at 2.8 GHz. The Hardware used for the BM25 computations is 4 AMD EPYC Rome 7402 cores running at 2.8 GHz.
The hardware used for the Splade and ColBERT experiments was 4 Intel Xeon 2640 CPU cores running at 2.4 GHz and a GTX1080 CPU on which the models were finetuned and evaluated. The hardware used for the Splade and ColBERT experiments was 4 Intel Xeon 2640 CPU cores running at 2.4 GHz and a GTX1080 CPU on which the models were finetuned and evaluated.
## Detailed results and run time ## 3. Detailed results and run time
Detailed results and run time are available as [supplementary material](Supplementary%20Material.pdf). Detailed results and run time are available as [supplementary material](Supplementary%20Material.pdf).
## Citation ## 4. Citation
If you use this work, please cite: If you use this work, please cite:
> Tilahun Yeshambel, Moncef Garouani, Serge Molina, and Josiane Mothe. 2025. > Tilahun Yeshambel, Moncef Garouani, Serge Molina, and Josiane Mothe. 2025.
...@@ -29,11 +27,11 @@ If you use this work, please cite: ...@@ -29,11 +27,11 @@ If you use this work, please cite:
> In Proceedings of SIGIR 2025. ACM. [PDF](XXX) > In Proceedings of SIGIR 2025. ACM. [PDF](XXX)
## Getting started ## 5. Getting started
### Dataset preparation ### 5.1 Dataset preparation
#### Train dataset #### 5.1.a Train dataset
The train dataset should be downloaded from [https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/](https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/) and stored in `datasets/train` using the following hierarchy: The train dataset should be downloaded from [https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/](https://www.irit.fr/AmharicResources/2025-amharic-dense-retrieval-training-dataset/) and stored in `datasets/train` using the following hierarchy:
- `datasets/train/corpus.tsv`: the corpus containing the documents - `datasets/train/corpus.tsv`: the corpus containing the documents
...@@ -43,7 +41,7 @@ The train dataset should be downloaded from [https://www.irit.fr/AmharicResource ...@@ -43,7 +41,7 @@ The train dataset should be downloaded from [https://www.irit.fr/AmharicResource
- `datasets/train/triples_by_id_num.json`: the list of triples, formatted as JSON Lines where each line is an array with the following elements: - `datasets/train/triples_by_id_num.json`: the list of triples, formatted as JSON Lines where each line is an array with the following elements:
`[topic_id, doc_id_positive, doc_id_negative]` `[topic_id, doc_id_positive, doc_id_negative]`
#### Test dataset #### 5.1.b Test dataset
The test dataset should be downloaded from [https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/](https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/) and stored in `datasets/test` using the following hierarchy: The test dataset should be downloaded from [https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/](https://www.irit.fr/AmharicResources/airtc-the-amharic-adhoc-information-retrieval-test-collection/) and stored in `datasets/test` using the following hierarchy:
- `datasets/test/corpus.jsonl`: the corpus containing the documents, formatted as JSON Lines where each line has the following fields: - `datasets/test/corpus.jsonl`: the corpus containing the documents, formatted as JSON Lines where each line has the following fields:
...@@ -57,7 +55,7 @@ The test dataset should be downloaded from [https://www.irit.fr/AmharicResources ...@@ -57,7 +55,7 @@ The test dataset should be downloaded from [https://www.irit.fr/AmharicResources
- `datasets/test/qrels.tsv`: the list of query-document relevancy scores, formatted as tab-separated values with: - `datasets/test/qrels.tsv`: the list of query-document relevancy scores, formatted as tab-separated values with:
`topic_id, iteration, doc_id, relevance` `topic_id, iteration, doc_id, relevance`
### Environment setup ### 5.2 Environment setup
> Tested with Python `3.10.12` > Tested with Python `3.10.12`
A virtual environment isolated from the user's global environment should be created using their preferred virtual environment tool. In our experiments, the following command was used: A virtual environment isolated from the user's global environment should be created using their preferred virtual environment tool. In our experiments, the following command was used:
...@@ -74,7 +72,7 @@ pip install -r requirements.txt ...@@ -74,7 +72,7 @@ pip install -r requirements.txt
``` ```
### Models preparation ### 5.3 Models preparation
The experiments can be run with BERT models compatible with the `transformers` library. To use a model, it should be downloaded from Hugging Face to the `models/` folder using: The experiments can be run with BERT models compatible with the `transformers` library. To use a model, it should be downloaded from Hugging Face to the `models/` folder using:
```bash ```bash
...@@ -87,8 +85,12 @@ Example for the Amharic BERT model: ...@@ -87,8 +85,12 @@ Example for the Amharic BERT model:
huggingface-cli download rasyosef/bert-medium-amharic --local-dir models/rasyosef/bert-medium-amharic huggingface-cli download rasyosef/bert-medium-amharic --local-dir models/rasyosef/bert-medium-amharic
``` ```
### Running an experiment ### 5.4 Running an experiment
SPLADE and ColBERT experiments use the same configuration file structure, except for the `config['train']['triples']` field, which must be adjusted for the tool being used. SPLADE and ColBERT experiments use the same configuration file structure, except for the `config['train']['triples']` field, which must be adjusted according to the tool being used.
Example configuration files:
- `configs/colbert/roberta-amharic-text-embedding-base.2AIRTC.training.json`
- `configs/splade/roberta-amharic-text-embedding-base.2AIRTC.training.json`
To ensure traceability of results and used parameters, the configuration file is updated after execution with the following fields: To ensure traceability of results and used parameters, the configuration file is updated after execution with the following fields:
...@@ -100,13 +102,9 @@ To ensure traceability of results and used parameters, the configuration file is ...@@ -100,13 +102,9 @@ To ensure traceability of results and used parameters, the configuration file is
- `config['train']['results_path']`: path to the retrieval results - `config['train']['results_path']`: path to the retrieval results
- `config['train']['eval_path']`: path to the evaluation results - `config['train']['eval_path']`: path to the evaluation results
> It is recommended to pass a copy of the original configuration file when running experiments. > It is recommended to pass a copy of the original configuration file when running experiments as **the passed configuration will be updated and written to disk in-place** with the produced search results and their trec evaluation.
Example configuration files: #### 5.4.1 With ColBERTv2
- `configs/colbert/roberta-amharic-text-embedding-base.2AIRTC.training.json`
- `configs/splade/roberta-amharic-text-embedding-base.2AIRTC.training.json`
#### With ColBERTv2
Set the field `config['train']['triples']` to the JSONL triples file: Set the field `config['train']['triples']` to the JSONL triples file:
```bash ```bash
...@@ -119,7 +117,8 @@ Then run: ...@@ -119,7 +117,8 @@ Then run:
python -m eval_colbert "$config_path" python -m eval_colbert "$config_path"
``` ```
#### With SPLADE
#### 5.4.2 With SPLADE
Set the field `config['train']['triples']` to the TSV triples file: Set the field `config['train']['triples']` to the TSV triples file:
```bash ```bash
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment