Skip to content
Snippets Groups Projects

DeepGrail

This repository contains a Python implementation of BertForTokenClassification using TLGbank data to develop part-of-speech taggers and supertaggers.

This code was designed to work with the DeepGrail Linker to provide a wide coverage syntactic and semantic parser for French. But the Tagger is independent, you can use it for your own tags.

Usage

Structure

.
├── Datasets                    # TLGbank data
├── SuperTagger                 # Implementation of BertForTokenClassification
│   ├── SuperTagger.py          # Main class
│   └── Tagging_bert_model.py   # Bert model
├── predict.py                  # Example of prediction
└── train.py                    # Example of train

Installation

Python 3.9.10 (Warning don't use Python 3.10+)

Clone the project locally. In a clean python venv do pip install -r requirements.txt

Download already trained models or prepare data for your train.

How To use

predict.py and train.py show simple examples of how to use the model, feel free to look at them before using the SupperTagger

Utils

For load m2_dataset.csv, you can use SuperTagger.Utils.utils.read_csv_pgbar(...). This function return a pandas dataframe.

Prediction

For predict on your data you need to load a model (save with this code).

df = read_csv_pgbar(file_path,20)
texts = df['X'].tolist()

tagger = SuperTagger()

tagger.load_weights("your/model/path")

pred_without_argmax, pred_convert, bert_hidden_state = tagger.predict(texts[7])

print(pred_convert)
#['let', 'dr(0,s,s)', 'let', 'dr(0,dr(0,s,s),np)', 'dr(0,np,n)', 'dr(0,n,n)', 'let', 'n', 'let', 'dl(0,n,n)', 'dr(0,dl(0,dl(0,n,n),dl(0,n,n)),dl(0,n,n))', 'dl(0,n,n)', 'let', 'dr(0,np,np)', 'np', 'dr(0,dl(0,np,np),np)', 'np', 'dr(0,dl(0,np,np),np)', 'np', 'dr(0,dl(0,np,s),dl(0,np,s))', 'dr(0,dl(0,np,s),np)', 'dl(1,s,s)', 'np', 'dr(0,dl(0,np,np),n)', 'n', 'dl(0,s,txt)']

Training

df = read_csv_pgbar(file_path,1000)
texts = df['X'].tolist()
tags = df['Z'].tolist()

#Dict for convert ID to token (The dict is save with the model for prediction)
index_to_super = load_obj('Datasets/index_to_super') 

tagger = SuperTagger()

bert_name = 'camembert-base'

tagger.create_new_model(len(index_to_super), bert_name, index_to_super)
# You can load your model for re-train this
# tagger.load_weights("your/model/path")

tagger.train(texts, tags, checkpoint=True)

pred_without_argmax, pred_convert, bert_hidden_state = tagger.predict(texts[7])

In train, if you use checkpoint=True, the model is automatically saved in a folder: Training_XX-XX_XX-XX. It saves after each epoch. Use tensorboard=True for log in same folder. (tensorboard --logdir=logs for see logs)

bert_name can be any model available on Hugging Face

Authors

Rabault Julien, de Pourtales Caroline