You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had been wondering why the early results with synthpop on a file subset seemed to do so well in comparison to the early results with random forests from synthimpute.
I have since done a few runs on the full file with synthpop.
The latest such run that I've looked at is synthpop2.csv in the Google Drive. I also created synthpop3.csv, which I think might be slightly better, but I have not looked at it yet.
Here's what I did:
used MARS, XTOT, S006, E00100 (AGI), and E09600 (AMT) as X variables, meaning that they are included among the predictors for every variable
I dropped E00100 and E09600 from the synthesized file, of course, because they will be calculated
MARS, XTOT, and S006 will be exactly the same in the synthesized as in PUF (I know that @feenberg has expressed some concern about XTOT - I suggest we see what our disclosure measures tell us about this, and I can also run variants that synthesize XTOT)
used CART on all variables except:
-- I synthesized pensions and dividends using the ratio approach discussed in Ensure e00600 >= e00650 and e01500 >= e01700 #17
-- I made a minor mistake with E07600 (credit for prior year minimum tax) and accidentally synthesized it by random sampling
created a visit sequence based on the magnitudes of the absolute value of the weighted variables, in reverse order (largest variable first)
modified that sequence for one really problematic variable, E02000 (Schedule E net income or loss), by putting it first; I don't know why, but that improved it dramatically
The results, maybe not surprisingly, look pretty good; I put summary results in the html file eval_taxcalc_2018-12-17.html in the Google Drive folder synpuf_analyses I set up for summary results that can be public.
I ran the results through Tax-Calculator. The graph below compares taxbc (tax before credit) for synpuf6 (latest done with synthimpute and random forests), and the first two versions with the R package synthpop (synthpop1 and synthpop2):
Along the way, I compared some subset results to the same subset of synpuf6; often there were enormous differences, that usually were better in the CART approach. It makes me think I should add comparisons by marital status to the evaluation program I have.
In general, it seems like the synthpop/CART approach produces results closer to PUF than the random forests approach; would like to understand why. Would like to see how the results do on disclosure measures
Running the R package synthpop
I had been wondering why the early results with synthpop on a file subset seemed to do so well in comparison to the early results with random forests from synthimpute.
I have since done a few runs on the full file with synthpop.
The latest such run that I've looked at is synthpop2.csv in the Google Drive. I also created synthpop3.csv, which I think might be slightly better, but I have not looked at it yet.
Here's what I did:
-- I synthesized pensions and dividends using the ratio approach discussed in Ensure e00600 >= e00650 and e01500 >= e01700 #17
-- I made a minor mistake with E07600 (credit for prior year minimum tax) and accidentally synthesized it by random sampling
The results, maybe not surprisingly, look pretty good; I put summary results in the html file eval_taxcalc_2018-12-17.html in the Google Drive folder synpuf_analyses I set up for summary results that can be public.
I ran the results through Tax-Calculator. The graph below compares taxbc (tax before credit) for synpuf6 (latest done with synthimpute and random forests), and the first two versions with the R package synthpop (synthpop1 and synthpop2):
Along the way, I compared some subset results to the same subset of synpuf6; often there were enormous differences, that usually were better in the CART approach. It makes me think I should add comparisons by marital status to the evaluation program I have.
In general, it seems like the synthpop/CART approach produces results closer to PUF than the random forests approach; would like to understand why. Would like to see how the results do on disclosure measures