This repository contains code for the paper "Annotation Imputation to Individualize Predictions: Initial Studies on Distribution Dynamics and Model Predictions" by London Lowmanstone (lowma016@umn.edu), Ruyuan Wan, Risako Owan, Jaehyung Kim, and Dongyeop Kang.
If you have any issues running this project, please create a GitHub issue, and if it's not addressed, please also email lowma016@umn.edu. We really care about making sure that our results are reproducible and that our methods are helpful for understanding the impacts on new datasets and imputation methods.
We use 6 different datasets in our paper. The .zip file containing all of our datasets can be found at https://drive.google.com/file/d/1dir3CxFoHkO1b4eviN5onbEwOiwTtgqE/view?usp=sharing and should be loaded into the /datasets folder.
This project consists of different scripts for each portion of our paper. Here is an overview of each of the scripts and what they do. Python scripts may also have associated .bat (for Windows) or .sh (for Linux) scripts in their folder. These executable scripts may be used to run the Python script how we ran it, or to better understand the parameters of the script for easy adaptation to your own datasets and imputation methods.
The requirements for the code can be handled by using the Docker image located at https://registry.hub.docker.com/r/londonl/nlperspectives to ensure functionality. You can also find the full requirements at requirements/requirements.txt (the Docker image is merely a light wrapper around these requirements). The container should be launched with a command such as docker run -it --gpus all --volume ~/annotation-imputation:/workspace londonl/nlperspectives where ~/annotation-imputation is this cloned GitHub repo.
- Figure 1
- N/A
- Figure 2
- N/A
- Table 1
- Read the original papers for each dataset
- Look at the values in the original datasets or our cleaned versions of them
- Table 2
- Clean the datasets (see Data Cleaning Scripts below)
- Run the
utilities/test_imputer_rmsescript- (No imputation needs to be done beforehand - the script should handle everything.)
- Figure 3
- Clean the datasets (see Data Cleaning Scripts below)
- Run the Multitask (
multitask), NCF (ncf_matrix_factorization), and Kernel (kernel_matrix_factorization) imputers on the cleaned datasets - Use the resulting imputed datasets (and the original dataset) and run them through the
utilities/pca_plotsscript
- Table 3
- Clean the datasets (see Data Cleaning Scripts below)
- Run the NCF imputer (
ncf_matrix_factorization) on the cleaned datasets - Run the
utilities/variance_disagreement_graphsscript to compare the original and NCF-imputed datasets- Use the output json data to recreate the table
- Figure 4
- Clean the datasets (see Data Cleaning Scripts below)
- Run the NCF imputer (
ncf_matrix_factorization) on the cleaned datasets - Run the
utilities/variance_disagreement_graphsscript to compare the original and NCF-imputed datasets- The script will output the figure
- Table 4
- Clean the datasets (see Data Cleaning Scripts below)
- Run the Multitask (
multitask), NCF (ncf_matrix_factorization), and Kernel (kernel_matrix_factorization) imputers on the cleaned datasets - Compute the KL Divergence using
disagreement_shift/label_distribution - Use the KL Divergence to run
disagreement_shift/compare_klto compare different imputers, which will also contain the mean and standard deviation of the KL across all examples- Use these results to recreate the table
- Table 5
- Clean the datasets (see Data Cleaning Scripts below)
- Run the Multitask (
multitask) and NCF (ncf_matrix_factorization) imputers on the cleaned datasets - Rerun the Multitask imputer twice: 1. Train it on the output of the first Multitask run 2. Train it on the output of the NCF imputer
- Use the F1 Score from the final epoch (as recorded on Weights and Bases or Tensorboard if you choose to use that). Compute the mean and standard deviation across folds by hand.
- Figure 5
- Clean the datasets (see Data Cleaning Scripts below)
- Run the NCF (
ncf_matrix_factorization) and Kernel (kernel_matrix_factorization) imputers on the cleaned datasets - Use
disagreement_shift/label_distributionto compute the KL divergence values between the original and imputed data. The same script will also compute the distributional/soft labels in the process. - Use the KL divergence and soft label outputs from the previous step, rename them correctly (as described in
disagreement_shift/website/README.md), and then run thewebsitescript which will launch the site. - Visit the site, which will automatically download the outputs.
- Table 6
- See the instructions for replicating Table 5, but use the aggregate rather than individual statistics.
- Table 7
- Clean the datasets (see Data Cleaning Scripts below)
- Run the Multitask imputer (
multitask) on the cleaned datasets- This will output a
.jsonfile with the final predictions
- This will output a
- Use the
.jsonoutput with the predictions together with the original dataset to run thesplit_based_on_disagreementscript, which will output all the data for the table
- Figure 6 / Appendix G
- See the steps for replicating Figure 3
- Figure 7 / Appendix H
- See the steps for replicating Figure 4
- Table 8
- Clean the datasets (see Data Cleaning Scripts below)
- Run the NCF imputer (
ncf_matrix_factorization) on the cleaned datasets - Run the
post_arxiv_gpt/individual/power_of_imputation.pyscript - Analyze the outputs of the script as described in
post_arxiv_gpt/README.md(the corresponding README).
- Figure 8-13 / Appendix I
- See the steps for replicating Figure 5
- Table 9
- See steps for replicating Table 8, but run the
individual_vs_aggregate_gptscript instead
- See steps for replicating Table 8, but run the
- Table 10
- See steps for replicating Table 8, but run the
post_arxiv_gpt/individual/annotator_tagging.pyscript instead
- See steps for replicating Table 8, but run the
- Table 11
- See steps for replicating Table 8, but run the
post_arxiv_gpt/distributionalscript instead
- See steps for replicating Table 8, but run the
- Table 12
- See steps for replicating Table 8, but run the
post_arxiv_gpt/individual/data_ablation.pyscript instead
- See steps for replicating Table 8, but run the
- Table 13
- See steps for replicating Table 8, but run the
post_arxiv_gpt/individual/header_ablation.pyscript instead
- See steps for replicating Table 8, but run the
- Extra Utilities
- Data Cleaning Scripts (already used to create the datasets in the
datasets/cleanedanddatasets/cleaned_foldsfolder, but helpful for cleaning your own datasets)utilities/check_duplicate_texts.py- Checks to see if there are any duplicate examples in a dataset. If this test comes back with duplicates, the utilities/fix_duplicate_texts.py script should be used.
utilites/fix_duplicate_texts.py- Merges duplicate examples. If there are conflicting labels, either raises an error, or can choose a label at random (which is what we did for our datasets)
utilities/double_dataset.py(used inutilities/double_SBIC.bat)- Some datasets (such as SBIC) use 0.5 increments for labels. This script doubles all the values in a dataset so that labels are all integer labels
utilities/not_too_sparse.py- Makes sure there aren't any rows or columns with no data whatsoever
utilities/compute_disagreement/compute_disagreement.py- Computes disagreement in a dataset; used in
split_based_on_disagreement/main.py
- Computes disagreement in a dataset; used in
utilities/prediction_fold_split.py- Splits the dataset into folds used by other scripts. (If you are replicating our paper, these splits are already provided in the datasets folder.)
- Data Cleaning Scripts (already used to create the datasets in the
Since these scripts were run at different points in time with different file structures, you will likely need to ensure that the scripts point to the correct file locations while running. Everything else should likely remain as the default.
As mentioned above, if you encounter any issues, please leave a GitHub issue and email lowma016@umn.edu if the GitHub issue is not addressed.