Skip to content

momoth12/Knowledge-Base-Construction

Repository files navigation

Knowledge Base Construction

Group project for the LLM & Structured Data course at Télécom Paris.

Clone the project and git submodules:

git submodule update --init

Create and initialize a python environment:

conda create -n knowledge-base-construction python=3.12.1
conda activate knowledge-base-construction
pip install -r dataset/requirements.txt

Overview

Structure of the kbc folder.

├── wikidata/   # Wikidata search and disambiguation utilities
│
├── dataset.py  # Dataset loader
│
└── model.py    # LLM wrappers

Access Models

You must first ask for access to MistralAI and LLama models on huggingface:

Then generate a huggingface token with read access to gated models:

Last, login with the token using huggingface-cli:

huggingface-cli login

Performance

Test a prediction file against the ground truth:

python dataset/evaluate.py -g dataset/data/train.jsonl -p predictions.jsonl

Dataset

There are 5 relations in the dataset:

ID Constraints Description
countryLandBordersCountry list, can be empty Which other countries share a land border with the given country
personHasCityOfDeath Single value, can be null In which city the given person died
seriesHasNumberOfEpisodes int How many episodes the TV series has
awardWonBy list, can be empty What people won the given award
companyTradesAtStockExchange list, can be empty In which stock exchange the given company trades

The dataset is divided into 3. Both the training and validation datasets have the answers:

Dataset Questions
Training 377
Validation 378
Testing 378

Data Repartition

Dataset Balance

Answers Per Relation

Results

Disambiguation

Note that the training dataset has been fixed (incorrect entries or rows have been altered or removed).

Relation Best Method Accuracy
awardWonBy 100%
countryLandBordersCountry 100%
companyTradesAtStockExchange 100%
personHasCityOfDeath 98.04%

You can run these tests with the following commands:

python test_disambiguation.py <relation>

About

Final group project for the class APM_5AI29_TP Language models and structured data. The report is available in the code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors