-
Notifications
You must be signed in to change notification settings - Fork 4
Expand file tree
/
Copy pathCITATION.cff
More file actions
81 lines (78 loc) · 2.89 KB
/
CITATION.cff
File metadata and controls
81 lines (78 loc) · 2.89 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Aerospace Language Understanding Evaluation (ALUE): Large
Language Benchmark with Aerospace Datasets
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Eugene
family-names: Mangortey
email: emangortey@mitre.org
affiliation: The MITRE Corporation
- given-names: Kunal
family-names: Sarkhel
email: ksarkhel@mitre.org
affiliation: The MITRE Corporation
orcid: 'https://orcid.org/0000-0003-4721-9416'
- given-names: Satyen
family-names: Singh
email: ssingh@mitre.org
affiliation: The MITRE Corporation
- given-names: Shuo
family-names: Chen
email: chen@mitre.org
affiliation: The MITRE Corporation
- given-names: Bulent
family-names: Ayhan
email: bayhan@mitre.org
affiliation: The MITRE Corporation
abstract: >-
Large Language Models(LLMs) present revolutionary potential
for the aviation industry, enabling stakeholders to derive
critical intelligence and improve operational efficiencies
through automation. However, given the safety-critical
nature of aviation, a rigorous domain-specific evaluation
of LLMs is paramount before their integration into
workflows. General-purpose LLM benchmarks often do not
capture the nuanced understanding of aerospace-specific
knowledge and the phraseology required for reliable
application. This paper introduces the Aerospace Language
Understanding Evaluation (ALUE) benchmark, an
aviation-specific framework designed for scalable
evaluation, assessment, and benchmarking of LLMs against
specialized aviation datasets and language tasks. ALUE
incorporates diverse datasets and tasks, including binary
and multiclass classification for hazard identification,
extractive question answering for
precise information retrieval (e.g., tail numbers,
runways), sentiment analysis, and multiclass token
classification for fine-grained analysis of air traffic
control communications. ALUE also introduces several
metrics for evaluating the correctness of generated
responses utilizing LLMs to identify and judge claims made
in generated responses. Our findings demonstrate that
structured prompts and in-context examples significantly
improve model performance, highlighting that general
models struggle with aviation tasks without such guidance
and often produce verbose or unstructured outputs. ALUE
provides a crucial tool for guiding the development and
safe deployment of LLMs tailored to the unique demands of
the aviation and
aerospace domains.
keywords:
- aviation
- aerospace
- benchmark
- evaluation
- dataset
- LLM
- large language model
license: Apache-2.0
identifiers:
- type: doi
value: 10.2514/6.2025-3247
date-released: 2021-08-11