Skip to content

Which variables should we seed on? #37

@MaxGhenis

Description

@MaxGhenis

We've seen that, in general, the more seeds in the synthesis production, the higher-fidelity the synthesis is, at the expense of privacy. More precisely, the relationship probably has to do with the unique identifiability of records when limited to the seeds.

For example, the only difference between the green and red bars here is that the green adds several more seeds:
image

Furthermore, even calculated seeds (which are dropped after the synthesis to be recalculated with Tax-Calculator) produce this relationship. The green bar above used calculated seeds.

Another data point supporting this is synthpop8, which used 9 calculated seeds ('E00100', 'E04600', 'P04470', 'E04800', 'E62100', 'E05800', 'E08800', 'E59560', 'E26190') that together uniquely identified over 80% of records. Each row in this synthesis exactly matched a training record, indicating we need to use far fewer seeds.

While we shouldn't use too many, we may also care a special amount about these calculated features, which could justify seeding on them rather than seeding on some other raw feature. Whether this approach improves the validity of calculated features like AGI is an empirical question we haven't tested, but it seems like a reasonable hypothesis.

Selecting the seeds is therefore one of the most important decisions in the synthesis process. I'd suggest a couple factors to consider in this decision:

  1. Prioritizing categorical features. This simplifies the synthesis process to be only on continuous measures. So for example, we'd want to prioritize MARS.
  2. Prioritizing logically "initial" features. For example, XTOT, nu18, MARS etc. are features of the household which logically precede income and deduction measures. This feeds into the question of visit sequence.
  3. Prioritizing the most important features. This could be critical calculated features like AGI, or the most important features in determining those critical calculated features.

Regarding (3): I ran a random forests model to determine the importance of each "raw" feature in predicting the 9 calculated features in synthpop8. Here are the top 5, according to the average rank in predicting those 9:

  1. E00200 (salaries and wages): most important for predicting E26190 (non-passive income) and E59560 (earned income for EIC).
  2. E18400 (SALT): most important for E05800 (income tax before credit), E08800 (income tax after credits), and P04470 (total deductions).
  3. S006 (weight): most important for E04800 (taxable income), E05800 (taxbc), and E08800 (taxac).
  4. E02000 (Schedule E), most important for E26190 (non-passive income).
  5. P23250 (Long-term gains less losses), most important for E00100 (AGI), E04800 (taxable income), and E62100 (alternative minimum taxable income).

image

Together these 5 features uniquely identify 61% of PUF records, so we'd probably still want a subset, especially if we add something like MARS and XTOT, but I suspect these will be valuable and avoid extra complexity of seeding on calculated features (also makes a simpler story to SOI that we're only using 65 features).

FEATURES = ['E00200', 'E18400', 'S006', 'E02000', 'P23250']
~pd.read_csv('~/puf2011.csv', usecols=FEATURES).duplicated(keep=False)).mean()
# 0.6131326698821662

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions