Skip to content

Thorsten-Trinkaus/NERC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub License GitHub last commit

NERC of Different Granularities

Course project for Formale Semantik (University of Heidelberg). We investigate Named Entity Recognition & Classification (NERC) under increasing label granularity (from coarse-grained up to ultra-fine entity typing).

Most of our code was written and executed inside Jupyter Notebooks. The data we used and produced was too large to store it in the GitHub Repository, but they are stored in the corresponding last folder and do not need to be recreated. To run the notebooks (at least the T5 related notebook) we provide a requirements.txt. This file can be used via:

python -m pip install -r requirements.txt

Datasets

OntoNotes: The 90% Solution (Hovy et al., NAACL 2006)

Fine-grained entity recognition (FIGER) (Ling and Weld, AAAI 2012)

Ultra-Fine Entity Typing (Choi et al., ACL 2018)

Dataset Analysis: OntoNotes, FIGER, and Ultra-Fine

Overview

Dataset Structure

Challenges for T5 and NLI

Recommended Preprocessing Strategies


Dataset Structure

Note:
Since the three datasets differ significantly in structure, we first analyze each dataset individually.

For Ultra-Fine, we split the dataset into:

  • Ultra-Fine Crowdsourced (ds_fine_crowd)
  • Ultra-Fine Distantly Supervised (ds_fine_ds)

Due to their substantial differences, we effectively work with four datasets in total.


General Analysis and Overview

Dataset Task Granularity Multi-Label
OntoNotes Classical NER Coarse No
FIGER Fine-Grained Entity Typing Fine Yes
Ultra-Fine Ultra-Fine Entity Typing Very Fine Yes

Dataset Size

Dataset Size

The datasets vary significantly in size.
Ultra-Fine and FIGER are substantially larger than OntoNotes, while ds_fine_crowd is much smaller than ds_fine_ds.


Labels

Dataset Unique Labels Multi-Label
OntoNotes 4 No
FIGER ~100 Yes
Ultra-Fine 10k+ Yes

Average Labels

ds_fine_crowd has the highest number of labels per mention, closely followed by FIGER.
OntoNotes is strictly single-label, while ds_fine_ds also has relatively few labels per entity.


Mentions with no Label (%)

Both FIGER and OntoNotes contain a portion of mentions without labels.


Multi-word Entity Ratio

In FIGER, all entities are single tokens, whereas in the other datasets, a significant portion of entities consists of multiple words.


OntoNotes: The 90% Solution (Hovy et al., NAACL 2006)

OntoNotes is a benchmark dataset for classical Named Entity Recognition (NER).

ds_onto

Metric Value
Entities 35089
Unique Labels 4
Multi-word Entities 12917 (36.81%)
Avg Labels/Entity 1.00
Max Labels/Entity 1

Labels

  • PER
  • LOC
  • ORG
  • MISC

OntoNotes contains only four labels and is strictly single-label, making it the dataset with the lowest granularity.

OntoNotes Top Labels


Fine-Grained Entity Recognition (FIGER) (Ling and Weld, AAAI 2012)

FIGER extends classical NER to fine-grained entity typing. It is also the largest dataset used in this project.

ds_figer

Metric Value
Entities 4,047,079
Unique Labels 91
Multi-word Entities 0 (0.00%)
Avg Labels/Entity 4.62
Max Labels/Entity 25

Labels

Instead of broad categories like PERSON, FIGER introduces hierarchical labels such as:

  • /person/actor
  • /person/politician
  • /location/city
  • /organization/company

FIGER Top Labels

FIGER Label Distribution

Most entities have between 1 and 5 labels, although rare cases reach up to 25 labels.


Ultra-Fine Entity Typing (Choi et al., ACL 2018)

The Ultra-Fine dataset pushes entity typing further by allowing very specific semantic descriptions.


Ultra-Fine Distantly Supervised (ds_fine_ds)

Metric Value
Entities 3,152,711
Unique Labels 4261
Multi-word Entities 1,749,718 (55.50%)
Avg Labels/Entity 2.18
Max Labels/Entity 11

Ultra-Fine Crowdsourced (ds_fine_crowd)

Metric Value
Entities 5994
Unique Labels 2519
Multi-word Entities 3000 (50.05%)
Avg Labels/Entity 5.39
Max Labels/Entity 19

Differences

  • ds_fine_ds is significantly larger than ds_fine_crowd
  • ds_fine_ds contains more total labels but fewer labels per entity on average

Labels

Labels are often natural language descriptions rather than fixed ontology entries.

Examples:

  • person
  • musician
  • politician
  • father
  • skyscraper

Ultra-Fine Crowdsourced Top Labels

Ultra-Fine Distantly Supervised Top Labels

Interresting how location in fine_ds is the Top Label, but in fine_crowd it is way less comon.

Ultra-Fine Crowd Label Distribution

Ultra-Fine DS Label Distribution


Challenges for T5 and NLI

T5

T5 (Text-to-Text Transfer Transformer) is a transformer-based model developed by Google that converts all NLP tasks into a text-to-text format.


OntoNotes

  • Span dependency:
    The model must correctly identify entity spans before classification.

FIGER

  • Multi-label generation:
    Entities often have multiple labels → requires generating label sets.

  • Hierarchical labels:
    Labels require structured understanding.

  • Class imbalance:
    Frequent labels dominate training.


Ultra-Fine

  • Extremely large label space:
    Thousands of labels → difficult generalization.

  • Open vocabulary labels:
    Labels are natural language.

  • Long-tail distribution:
    Many rare labels.


NLI


OntoNotes

  • Inefficient formulation:
    Too few labels for NLI to be efficient.

FIGER

  • Ignored label dependencies:
    Hierarchy is not modeled.

Ultra-Fine

  • Label ambiguity:
    Semantic overlap between labels.

Recommended Preprocessing Strategies

General Strategies

1. Label Normalization

Convert labels into a consistent format:

  • /person/actor → actor
  • film_actor → actor

2. Lowercasing and Cleaning

  • Convert to lowercase
  • Remove special characters (/, _)

Dataset-Specific Strategies

OntoNotes

  • Extract one label per entity span

Map labels to natural language:

  • PERSON → person
  • ORG → organization

FIGER

Example:

Muddy Waters → ['/person/musician', '/person/actor', '/person/artist']

  • Split hierarchical labels:

'/person/actor' → person, actor

  • Optionally limit hierarchy depth

Ultra-Fine

Example:

They → ['expert', 'scholar', 'scientist', 'person']

  • Frequency filtering

  • Remove rare labels (<10 occurrences)

  • Top-k selection

  • Keep most relevant labels per mention


Strategies for T5

Format output as:

entity → label1, label2, label3

Use:

  • sorted labels
  • limit number of generated labels

Strategies for NLI

Template standardization

Example:

"The entity is a musician."

About

Project to investigate how pre-trained language models handle NER/NERC as label granularity increases, comparing fine-tuning, probing, and NLI-based approaches.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Contributors