We've seen that, in general, the more seeds in the synthesis production, the higher-fidelity the synthesis is, at the expense of privacy. More precisely, the relationship probably has to do with the unique identifiability of records when limited to the seeds.
For example, the only difference between the green and red bars here is that the green adds several more seeds:

Furthermore, even calculated seeds (which are dropped after the synthesis to be recalculated with Tax-Calculator) produce this relationship. The green bar above used calculated seeds.
Another data point supporting this is synthpop8, which used 9 calculated seeds ('E00100', 'E04600', 'P04470', 'E04800', 'E62100', 'E05800', 'E08800', 'E59560', 'E26190') that together uniquely identified over 80% of records. Each row in this synthesis exactly matched a training record, indicating we need to use far fewer seeds.
While we shouldn't use too many, we may also care a special amount about these calculated features, which could justify seeding on them rather than seeding on some other raw feature. Whether this approach improves the validity of calculated features like AGI is an empirical question we haven't tested, but it seems like a reasonable hypothesis.
Selecting the seeds is therefore one of the most important decisions in the synthesis process. I'd suggest a couple factors to consider in this decision:
- Prioritizing categorical features. This simplifies the synthesis process to be only on continuous measures. So for example, we'd want to prioritize MARS.
- Prioritizing logically "initial" features. For example, XTOT, nu18, MARS etc. are features of the household which logically precede income and deduction measures. This feeds into the question of visit sequence.
- Prioritizing the most important features. This could be critical calculated features like AGI, or the most important features in determining those critical calculated features.
Regarding (3): I ran a random forests model to determine the importance of each "raw" feature in predicting the 9 calculated features in synthpop8. Here are the top 5, according to the average rank in predicting those 9:
E00200 (salaries and wages): most important for predicting E26190 (non-passive income) and E59560 (earned income for EIC).
E18400 (SALT): most important for E05800 (income tax before credit), E08800 (income tax after credits), and P04470 (total deductions).
S006 (weight): most important for E04800 (taxable income), E05800 (taxbc), and E08800 (taxac).
E02000 (Schedule E), most important for E26190 (non-passive income).
P23250 (Long-term gains less losses), most important for E00100 (AGI), E04800 (taxable income), and E62100 (alternative minimum taxable income).

Together these 5 features uniquely identify 61% of PUF records, so we'd probably still want a subset, especially if we add something like MARS and XTOT, but I suspect these will be valuable and avoid extra complexity of seeding on calculated features (also makes a simpler story to SOI that we're only using 65 features).
FEATURES = ['E00200', 'E18400', 'S006', 'E02000', 'P23250']
~pd.read_csv('~/puf2011.csv', usecols=FEATURES).duplicated(keep=False)).mean()
# 0.6131326698821662
We've seen that, in general, the more seeds in the synthesis production, the higher-fidelity the synthesis is, at the expense of privacy. More precisely, the relationship probably has to do with the unique identifiability of records when limited to the seeds.
For example, the only difference between the green and red bars here is that the green adds several more seeds:

Furthermore, even calculated seeds (which are dropped after the synthesis to be recalculated with Tax-Calculator) produce this relationship. The green bar above used calculated seeds.
Another data point supporting this is
synthpop8, which used 9 calculated seeds ('E00100', 'E04600', 'P04470', 'E04800', 'E62100', 'E05800', 'E08800', 'E59560', 'E26190') that together uniquely identified over 80% of records. Each row in this synthesis exactly matched a training record, indicating we need to use far fewer seeds.While we shouldn't use too many, we may also care a special amount about these calculated features, which could justify seeding on them rather than seeding on some other raw feature. Whether this approach improves the validity of calculated features like AGI is an empirical question we haven't tested, but it seems like a reasonable hypothesis.
Selecting the seeds is therefore one of the most important decisions in the synthesis process. I'd suggest a couple factors to consider in this decision:
Regarding (3): I ran a random forests model to determine the importance of each "raw" feature in predicting the 9 calculated features in
synthpop8. Here are the top 5, according to the average rank in predicting those 9:E00200(salaries and wages): most important for predictingE26190(non-passive income) andE59560(earned income for EIC).E18400(SALT): most important forE05800(income tax before credit),E08800(income tax after credits), andP04470(total deductions).S006(weight): most important forE04800(taxable income),E05800(taxbc), andE08800(taxac).E02000(Schedule E), most important forE26190(non-passive income).P23250(Long-term gains less losses), most important forE00100(AGI),E04800(taxable income), andE62100(alternative minimum taxable income).Together these 5 features uniquely identify 61% of PUF records, so we'd probably still want a subset, especially if we add something like
MARSandXTOT, but I suspect these will be valuable and avoid extra complexity of seeding on calculated features (also makes a simpler story to SOI that we're only using 65 features).