Knowledge Base Construction

Group project for the LLM & Structured Data course at Télécom Paris.

Clone the project and git submodules:

git submodule update --init

Create and initialize a python environment:

conda create -n knowledge-base-construction python=3.12.1
conda activate knowledge-base-construction
pip install -r dataset/requirements.txt

Overview

Structure of the kbc folder.

├── wikidata/   # Wikidata search and disambiguation utilities
│
├── dataset.py  # Dataset loader
│
└── model.py    # LLM wrappers

Access Models

You must first ask for access to MistralAI and LLama models on huggingface:

Then generate a huggingface token with read access to gated models:

https://huggingface.co/settings/tokens

Last, login with the token using huggingface-cli:

huggingface-cli login

Performance

Test a prediction file against the ground truth:

python dataset/evaluate.py -g dataset/data/train.jsonl -p predictions.jsonl

Dataset

There are 5 relations in the dataset:

ID	Constraints	Description
countryLandBordersCountry	`list`, can be empty	Which other countries share a land border with the given country
personHasCityOfDeath	Single value, can be `null`	In which city the given person died
seriesHasNumberOfEpisodes	`int`	How many episodes the TV series has
awardWonBy	`list`, can be empty	What people won the given award
companyTradesAtStockExchange	`list`, can be empty	In which stock exchange the given company trades

The dataset is divided into 3. Both the training and validation datasets have the answers:

Dataset	Questions
Training	377
Validation	378
Testing	378

Data Repartition

Results

Disambiguation

Note that the training dataset has been fixed (incorrect entries or rows have been altered or removed).

Relation	Best Method Accuracy
`awardWonBy`	100%
`countryLandBordersCountry`	100%
`companyTradesAtStockExchange`	100%
`personHasCityOfDeath`	98.04%

You can run these tests with the following commands:

python test_disambiguation.py <relation>

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
RAG_files_and_configs		RAG_files_and_configs
cache		cache
dataset @ 18ee5f6		dataset @ 18ee5f6
images		images
kbc		kbc
.envrc		.envrc
.gitignore		.gitignore
.gitmodules		.gitmodules
Final report.pdf		Final report.pdf
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_disambiguation.py		test_disambiguation.py
test_finetuning.py		test_finetuning.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Base Construction

Overview

Access Models

Performance

Dataset

Data Repartition

Results

Disambiguation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowledge Base Construction

Overview

Access Models

Performance

Dataset

Data Repartition

Results

Disambiguation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages