The latest version of deployed model is on the GradeSpeare app.
- Overview
- Datasets
- Compilation and Cleaning
- Augmentation and Balancing
- NLP - Dependency Matching and Doc Vectors
- Model Selection
- References
This is my capstone project for the Concordia Data Science Diploma program. Its goal is to create a model that can predict the level of a written text according to the CEFR.
The current model is a multi-layer perceptron (MLP) classifier, and it predicts the level with 75% overall accuracy. It was trained on 5,943 texts from the PELIC dataset, 194 texts from the ASAG dataset, and 862 artificially augmented texts. The level that is predicted is a combination of the 'level_id' variable from the PELIC dataset, and the 'grade_majority_vote' variable from the ASAG dataset. 'level_id' indicates the level of the English course that the writer was taking at the time of production. 'grade_majority_vote' indicates the majority vote of three grades given by trained TOEFL examiners.
- The University of Pittsburgh English Language Institute Corpus (PELIC, Juffs 2020)
- Université Catholique de Louvain - CEFR Automated Short Answer Grading (ASAG, Tack et al.)
PELIC and ASAG compiled datasets are stored as .pkl files in the data folder. The pickle format was chosen to conserve the datatype format of the 'doc_vector' column, which is converted to a string in .csv.
Compilation and cleaning steps are demonstrated in .ipynb files in their respective subfolders in the notebooks folder.
The PELIC dataset consists of five .csv files that were merged with Pandas. The dataset initially consisted of 47,667 rows and 47 columns. The two main variables that were used in training the model were 'answer', which is the student's written text, and 'level_id', which is the level of the course that the student was taking at the time of writing.
Null values and texts of insufficient quality were removed from the dataset in such a way as to conserve as much data as possible. First, different versions of the same texts were removed, as they were essentially duplicates. Next, the answers from different course types and question types were inspected to see which questions allowed for an open text answer (many of the questions in the dataset only allowed for a selection answer, which wouldn't provide good data). It was found that all course and question types could be conserved, and only answers that were not produced in an open text field were removed.
After that, null values needed to be taken care of. Before dropping null values from the entire dataset, columns containing fewer than 36,304 answers were removed. These columns consisted of several variables such as birth year, gender, and test scores. It was considered that test scores could also be used as potential 'y' variables; however, keeping them in the dataset would have reduced the number of texts - it was decided that 'level_id' would serve as the data label.
The ASAG dataset consists of 299 .xml files that were scraped using BeautifulSoup. The questions, answers, and grades in this dataset are very clear and didn't need to be cleaned.
Instead of choosing a minimum answer length, spaCy was used to filter out answers that did not contain at least one subject and one verb. This allowed for the conservation of data from level 2, and eliminated one-word responses and multiple-choice answers. No maxiumum answer length was set.
| Dataset | Rows Before Cleaning | Rows After Cleaning |
|---|---|---|
| PELIC | 47,667 | 31,099 |
| ASAG | 299 | 268 |
The PELIC dataset was very imbalanced by level. To address this issue, the level 2 class was doubled using GPT2Tokenizer and GPT2LMHeadModel. The texts were augmented by using AI to rephrase and generate a continuation of each answer. The second half of the AI generated texts were then truncated to get the augmented data sample. The generator uses top-k and nucleus sampling, which helps to retain the style of the text. It was considered that a simpler augmentation technique could be used, such as random shuffling and insertion, or synonym replacement; however, this wouldn't have conserved the grammatical structure of the answers, which is needed to be able to match patterns. The augmentation function is found in Augment.ipynb
Once the answers of the level 2 class were augmented, the answers from the remaining classes (3, 4, and 5) were reduced. The reduction was not random; rather, a function was used to choose the longest answers first, and to not choose an answer from the same question twice, where possible, until the data was balanced. The balancing function is found in the notebooks in the balancing folder.
| Dataset | Level | Original Count | Count After Augmentation | Count After Balancing |
|---|---|---|---|---|
| PELIC | 4 | 12,163 | 12,163 | 1,698 |
| PELIC | 5 | 10,094 | 10,094 | 1,698 |
| PELIC | 3 | 7,993 | 7,993 | 1,698 |
| PELIC | 2 | 849 | 1,698 | 1,698 |
| ASAG | 3 | 97 | 97 | 56 |
| ASAG | 4 | 67 | 67 | 56 |
| ASAG | 2 | 54 | 56 | 56 |
| ASAG | 5 | 28 | 56 | 56 |
spaCy's doc.vector function was used to generate 300-dimensional document vectors, which were included in X. No lemmatization, stop word, or punctuation removal was done prior to vectorization. The reason for this decision was to preserve the grammatical integrity of the documents for classification, since what is being classified is not a topic, but a level of complexity.
Patterns for 26 verb tense combinations, 3 gerund dependencies, and two modal verbs were defined using spaCy's DependencyMatcher. The count of these patterns, along with the number of sentences and the average sentence length per answer, were calculated. The dependency patterns were squared before adding them to X to increase their chance of being detected during model training. The average sentence length was added to X raw, and the number of sentences was excluded from X (the sheer number of sentences wasn't expected to be a good indicator of level).
Patterns were created and combined using the functions in the pattern_matching notebook folder. The pattern dictionary is in a .json file in the patterns folder.
The following verbal structures were searched for. A search with auxiliaries ("aux") is included where appropriate to include negatives. Searches with modals ("modal") exclude the lemmas will and would so that these could be searched for separately:
| Tense | Aspect | Voice | Aux do | Modal |
|---|---|---|---|---|
| Present | Simple | Active | ✓ | ✓ |
| Present | Simple | Passive | ✓ | |
| Present | Continuous | Active | ✓ | |
| Present | Continuous | Passive | ✓ | |
| Present | Perfect | Active | ✓ | |
| Present | Perfect | Passive | ✓ | |
| Present | Perfect-Continuous | Active | ✓ | |
| Present | Perfect-Continuous | Passive | ✓ | |
| Past | Simple | Active | ✓ | |
| Past | Simple | Passive | ||
| Past | Continuous | Active | ||
| Past | Continuous | Passive | ||
| Past | Perfect | Active | ||
| Past | Perfect | Passive | ||
| Past | Perfect-Continuous | Active | ||
| Past | Perfect-Continuous | Passive |
| Tag | Dependency |
|---|---|
| Gerund | Subject |
| Gerund | Complement of a Preposition |
| Gerund | Open Complement |
| Tag | Lemma |
|---|---|
| Modal | will |
| Modal | would |
| Feature | Level 2 | Level 3 | Level 4 | Level 5 |
|---|---|---|---|---|
| present simple active | 2.097 | 3.976 | 7.805 | 6.954 |
| past simple active | 0.958 | 1.534 | 2.805 | 3.263 |
| will | 0.202 | 0.385 | 0.645 | 0.581 |
| present continuous active | 0.200 | 0.216 | 0.449 | 0.398 |
| presen simple active modal | 0.153 | 0.737 | 1.646 | 1.516 |
| present simple active aux | 0.110 | 0.202 | 0.404 | 0.358 |
| present perfect active | - | 0.262 | 0.356 | 0.377 |
| gerund pcomp | - | 0.214 | 0.680 | 0.728 |
| present simple passive | - | 0.110 | 0.330 | 0.495 |
| gerund xcomp | - | - | 0.259 | 0.226 |
| gerund subject | - | - | 0.253 | 0.233 |
| past continuous active | - | - | 0.204 | 0.184 |
| would | - | - | 0.191 | 0.285 |
| past perfect active | - | - | 0.148 | 0.172 |
| past simple passive | - | 0.110 | 0.137 | 0.304 |
| past simple active aux | - | - | 0.135 | 0.172 |
| present simple passive modal | - | - | - | 0.168 |
| Class | Level description | CEFR level |
|---|---|---|
| 2 | Pre-Intermediate | A2/B1 |
| 3 | Intermediate | B1 |
| 4 | Upper-Intermediate | B1+/B2 |
| 5 | Advanced | B2+/C1 |
Four different algorithms have been used to classify the data with y being 'level_id' for the PELIC dataset and 'grade_majority_vote' for the ASAG dataset. The performace metrics of each model are below, in descending order by performance.
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| 2 | 0.84 | 0.80 | 0.82 |
| 3 | 0.73 | 0.77 | 0.75 |
| 4 | 0.67 | 0.72 | 0.69 |
| 5 | 0.74 | 0.67 | 0.71 |
| weighted avg | 0.74 | 0.74 | 0.74 |
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| 2 | 0.66 | 0.90 | 0.76 |
| 3 | 0.92 | 0.65 | 0.76 |
| 4 | 0.73 | 0.69 | 0.71 |
| 5 | 0.71 | 0.70 | 0.71 |
| weighted avg | 0.75 | 0.73 | 0.73 |
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| 2 | 0.72 | 0.86 | 0.78 |
| 3 | 0.51 | 0.70 | 0.59 |
| 4 | 0.52 | 0.17 | 0.25 |
| 5 | 0.65 | 0.72 | 0.68 |
| weighted avg | 0.60 | 0.61 | 0.58 |
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| 2 | 0.68 | 0.80 | 0.74 |
| 3 | 0.40 | 0.67 | 0.50 |
| 4 | 0.57 | 0.17 | 0.26 |
| 5 | 0.47 | 0.40 | 0.43 |
| weighted avg | 0.53 | 0.51 | 0.48 |
- Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. http://doi.org/10.5281/zenodo.3991977
- Tack, Anaïs, Thomas François, Sophie Roekhaut, and Cédric Fairon. (2017) "Human and Automated CEFR-based Grading of Short Answers." In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 169-179. Association for Computational Linguistics, 2017. Paper DOI



