Which variables should we seed on?

We've seen that, in general, the more seeds in the synthesis production, the higher-fidelity the synthesis is, at the expense of privacy. More precisely, the relationship probably has to do with the unique identifiability of records when limited to the seeds. 

For example, the only difference between the green and red bars here is that the green adds several more seeds:
![image](https://user-images.githubusercontent.com/6076111/53920659-b1f1f280-4022-11e9-9100-35b86c4d11f5.png)

Furthermore, even calculated seeds (which are dropped after the synthesis to be recalculated with Tax-Calculator) produce this relationship. The green bar above used calculated seeds.

Another data point supporting this is `synthpop8`, which used 9 calculated seeds (`'E00100', 'E04600', 'P04470', 'E04800', 'E62100', 'E05800', 'E08800', 'E59560', 'E26190'`) that together uniquely identified over 80% of records. Each row in this synthesis exactly matched a training record, indicating we need to use far fewer seeds.

While we shouldn't use too many, we may also care a special amount about these calculated features, which could justify seeding on them rather than seeding on some other raw feature. Whether this approach improves the validity of calculated features like AGI is an empirical question we haven't tested, but it seems like a reasonable hypothesis.

Selecting the seeds is therefore one of the most important decisions in the synthesis process. I'd suggest a couple factors to consider in this decision:
1. **Prioritizing categorical features.** This simplifies the synthesis process to be only on continuous measures. So for example, we'd want to prioritize MARS.
2. **Prioritizing logically "initial" features.** For example, XTOT, nu18, MARS etc. are features of the household which logically precede income and deduction measures. This feeds into the question of visit sequence.
3. **Prioritizing the most important features.** This could be critical calculated features like AGI, or the most important features in determining those critical calculated features.

Regarding (3): I ran a random forests model to determine the importance of each "raw" feature in predicting the 9 calculated features in `synthpop8`. Here are the top 5, according to the average rank in predicting those 9:
1. `E00200` (salaries and wages): most important for predicting `E26190` (non-passive income) and `E59560` (earned income for EIC).
2. `E18400` (SALT): most important for `E05800` (income tax before credit), `E08800` (income tax after credits), and `P04470` (total deductions).
3. `S006` (weight): most important for `E04800` (taxable income), `E05800` (taxbc), and `E08800` (taxac).
4. `E02000` (Schedule E), most important for `E26190` (non-passive income).
5. `P23250` (Long-term gains less losses), most important for `E00100` (AGI), `E04800` (taxable income), and `E62100` (alternative minimum taxable income).

![image](https://user-images.githubusercontent.com/6076111/53921882-06976c80-4027-11e9-839a-6a01f9753ed2.png)

Together these 5 features uniquely identify 61% of PUF records, so we'd probably still want a subset, especially if we add something like `MARS` and `XTOT`, but I suspect these will be valuable and avoid extra complexity of seeding on calculated features (also makes a simpler story to SOI that we're only using 65 features).
```
FEATURES = ['E00200', 'E18400', 'S006', 'E02000', 'P23250']
~pd.read_csv('~/puf2011.csv', usecols=FEATURES).duplicated(keep=False)).mean()
# 0.6131326698821662
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which variables should we seed on? #37

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Which variables should we seed on? #37

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions