Skip to content

snizio/Lexical-Proficiency

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluating Lexical Proficiency in Neural Language Models

Screenshot of a comment on a GitHub issue showing an image, added in the Markdown, of an Octocat smiling and raising a tentacle.

This is the public repository for our paper: Evaluating Lexical Proficiency in Neural Language Models, C. Ciaccio, A. Miaschi, F. Dell'Orletta (ACL 2025).

The repository contains the resources and code that we developed in order to run our experiments for assessing the lexical proficiency of Italian neural language models. Specifically:

Datasets

The data folder contains:

  • 100-neos.csv → neologism dataset
  • 100-nonce.csv → nonce words dataset
  • ONLI-NEO → data extracted from Osservatorio Neologico della Lingua Italiana, ONLI.
  • it-dictionary-gz → data extracted from the Italian Wiktionary Wikizionario
  • (train-test-val)_dataset.csv → the splits for train, test and validation used in our experiments

More resources related on the Wiktionary data format, the parser and the ONLI scraper can be found in our Italian Wiktionary Parser repository.

Code

The code folder contains the file "finetuningT5.py" that we used to finetune all T5 models in a text-to-text multi-task learning setup (training + evaluation).

Human evaluation data

The annotation folder contains the human annotated scores of novelty and adhesion for the nonce words setting for each model (in batches of 25).

If you use any of the following contents for your work, we kindly ask you to cite our paper:

@inproceedings{ciaccio-etal-2025-evaluating,
    title = "Evaluating Lexical Proficiency in Neural Language Models",
    author = "Ciaccio, Cristiano  and
      Miaschi, Alessio  and
      Dell{'}Orletta, Felice",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.64/",
    pages = "1267--1286",
    ISBN = "979-8-89176-251-0",
    abstract = "We present a novel evaluation framework designed to assess the lexical proficiency and linguistic creativity of Transformer-based Language Models (LMs). We validate the framework by analyzing the performance of a set of LMs of different sizes, in both mono- and multilingual configuration, across tasks involving the generation, definition, and contextual usage of lexicalized words, neologisms, and nonce words. To support these evaluations, we developed a novel dataset of lexical entries for the Italian language, including curated definitions and usage examples sourced from various online platforms. The results highlight the robustness and effectiveness of our framework in evaluating multiple dimensions of LMs' linguistic understanding and offer an insight, through the assessment of their linguistic creativity, on the lexical generalization abilities of LMs."
}

Abstract: We present a novel evaluation framework designed to assess the lexical proficiency and linguistic creativity of Transformer-based Language Models (LMs). We validate the framework by analyzing the performance of a set of LMs of different sizes, in both mono- and multilingual configuration, across tasks involving the generation, definition, and contextual usage of lexicalized words, neologisms, and nonce words. To support these evaluations, we developed a novel dataset of lexical entries for the Italian language, including curated definitions and usage examples sourced from various online platforms. The results highlight the robustness and effectiveness of our framework in evaluating multiple dimensions of LMs' linguistic understanding and offer an insight, through the assessment of their linguistic creativity, on the lexical generalization abilities of LMs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages