Group project for the LLM & Structured Data course at Télécom Paris.
Clone the project and git submodules:
git submodule update --initCreate and initialize a python environment:
conda create -n knowledge-base-construction python=3.12.1
conda activate knowledge-base-construction
pip install -r dataset/requirements.txtStructure of the kbc folder.
├── wikidata/ # Wikidata search and disambiguation utilities
│
├── dataset.py # Dataset loader
│
└── model.py # LLM wrappers
You must first ask for access to MistralAI and LLama models on huggingface:
- https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
- https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
Then generate a huggingface token with read access to gated models:
Last, login with the token using huggingface-cli:
huggingface-cli loginTest a prediction file against the ground truth:
python dataset/evaluate.py -g dataset/data/train.jsonl -p predictions.jsonlThere are 5 relations in the dataset:
| ID | Constraints | Description |
|---|---|---|
| countryLandBordersCountry | list, can be empty |
Which other countries share a land border with the given country |
| personHasCityOfDeath | Single value, can be null |
In which city the given person died |
| seriesHasNumberOfEpisodes | int |
How many episodes the TV series has |
| awardWonBy | list, can be empty |
What people won the given award |
| companyTradesAtStockExchange | list, can be empty |
In which stock exchange the given company trades |
The dataset is divided into 3. Both the training and validation datasets have the answers:
| Dataset | Questions |
|---|---|
| Training | 377 |
| Validation | 378 |
| Testing | 378 |
Note that the training dataset has been fixed (incorrect entries or rows have been altered or removed).
| Relation | Best Method Accuracy |
|---|---|
awardWonBy |
100% |
countryLandBordersCountry |
100% |
companyTradesAtStockExchange |
100% |
personHasCityOfDeath |
98.04% |
You can run these tests with the following commands:
python test_disambiguation.py <relation>
