-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathLogisticRegressionExample.do
More file actions
199 lines (167 loc) · 9.42 KB
/
Copy pathLogisticRegressionExample.do
File metadata and controls
199 lines (167 loc) · 9.42 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
/// DETERMINE EXTENT OF MISSING DATA ///
* Note: to determine whether which covariates of interest to include, whether categorical or continuous, etc., perform literature review and discuss with experts
// Determine which outcomes have missing data
mdesc angina hospmi mi_fchd anychd stroke cvd hyperten death
sum angina hospmi mi_fchd anychd stroke cvd hyperten death
tab1 angina hospmi mi_fchd anychd stroke cvd hyperten death
* Results: no missing values
* Results: sufficient number of cases and controls and is a serious condition for anychd
* Results: choose coronary heart disease (anychd) as the outcome
// Determine which covariates have missing data
mdesc sex age agecat sysbp diabp bpmeds cursmoke cigpday totchol bmi bmicat obese_normal glucose diabetes heartrte prevap prevchd prevmi prevstrk prevhyp
sum sex age agecat sysbp diabp bpmeds cursmoke cigpday totchol bmi bmicat obese_normal glucose diabetes heartrte prevap prevchd prevmi prevstrk prevhyp
tabulate sysbpcat
* Results: obese_normal has a significant number of missing values (42.43%), as well as glucose (8.26%)
* Results: choose categorical systolic blood pressure as main effect to evaluate, adjusting for sex, categorical age, continuous BMI, categorical serum total cholesterol, and continuous heart rate
/// CREATE VARIABLES ///
// Create variable to tag rows with missing data for these variables of interest
mdesc anychd sysbpcat sex2 age50 totcholcat bmi heartrte
generate keep = 1
replace keep = 0 if sysbpcat == . | sex2 == . | age50 == . | totcholcat == . | bmi == .| heartrte == .
// Recode sex categories with 0 and 1, instead of 1 and 2
gen sex2 = 0
replace sex2 = 1 if sex == 2.
// Create variable with age categories
gen age50 = 0
replace age50 = 1 if agecat == 3
replace age50 = 1 if agecat == 4
replace age50 = . if agecat == .
// Create variable with serum total cholesterol categories
gen totcholcat = 0
replace totcholcat = 1 if totchol > 199
replace totcholcat = 2 if totchol > 239
replace totcholcat = . if totchol == .
// Create variable with heart rate categories
gen heartrtecat = 0
replace heartrtecat = 1 if heartrte > 75
replace heartrtecat = . if heartrte == .
// Create variable with systolic blood pressure categories
gen sysbpcat = 0
replace sysbpcat = 1 if sysbp >= 120
replace sysbpcat = 2 if sysbp >= 140
replace sysbpcat = 3 if sysbp >= 160
gen sysbpcat = . if sysbp == .
// Create variable with systolic blood pressure categories as indicator variables
gen sysbpcat01 = 0 if sysbpcat == 0
replace sysbpcat01 = 1 if sysbpcat == 1
gen sysbpcat02 = 0 if sysbpcat == 0
replace sysbpcat02 = 1 if sysbpcat == 2
gen sysbpcat03 = 0 if sysbpcat == 0
replace sysbpcat03 = 1 if sysbpcat == 3
/// DESCRIBE COVARIATES BY OUTCOME ///
// Describe variables by systolic blood pressure
by sysbpcat, sort : tabulate sex2
by sysbpcat, sort : tabulate age50
by sysbpcat, sort : tabulate totcholcat
by sysbpcat, sort: sum bmi, detail
by sysbpcat, sort: sum heartrte, detail
tabulate sex sysbpcat01, cchi2 chi2
tabulate sex sysbpcat02, cchi2 chi2
tabulate sex sysbpcat03, cchi2 chi2
tabulate age50 sysbpcat01, cchi2 chi2
tabulate age50 sysbpcat02, cchi2 chi2
tabulate age50 sysbpcat03, cchi2 chi2
tabulate totcholcat sysbpcat01, cchi2 chi2
tabulate totcholcat sysbpcat02, cchi2 chi2
tabulate totcholcat sysbpcat03, cchi2 chi2
ttest heartrte, by(sysbpcat01)
ttest heartrte, by(sysbpcat02)
ttest heartrte, by(sysbpcat03)
ttest bmi, by(sysbpcat01)
ttest bmi, by(sysbpcat02)
ttest bmi, by(sysbpcat03)
/// MODEL BUILDING- Method 1: Forward selection model building ///
// Add 1 adjusting covariate at a time to the model of systolic blood pressure in predicting coronary heart disease
// P_entry = 0.1
logistic anychd i.sysbpcat sex2 if keep == 1
logistic anychd i.sysbpcat age50 if keep == 1
logistic anychd i.sysbpcat bmi if keep == 1
logistic anychd i.sysbpcat i.totcholcat if keep == 1
logistic anychd i.sysbpcat heartrte if keep == 1
* Results: heart rate covariate is not significant at p = 0.10, so propose not including it in the model
* Results: as all of the other covariates are highly significant, propose adding sex covariate to the model
// Add 1 adjusting covariate at a time to the model of systolic blood pressure and sex in predicting coronary heart disease
logistic anychd i.sysbpcat sex2 age50 if keep == 1
logistic anychd i.sysbpcat sex2 i.totcholcat if keep == 1
logistic anychd i.sysbpcat sex2 bmi if keep == 1
* Results: all of the other covariates are highly significant, so propose adding age covariate to the model
// Add 1 adjusting covariate at a time to the model of systolic blood pressure, sex, and age in predicting coronary heart disease
logistic anychd i.sysbpcat sex2 age50 i.totcholcat if keep == 1
logistic anychd i.sysbpcat sex2 age50 bmi if keep == 1
* Results: both of the other covariates are significant, so propose adding BMI covariate to the model
// Add 1 adjusting covariate at a time to the model of systolic blood pressure, sex, age, and BMI in predicting coronary heart disease
logistic anychd i.sysbpcat sex2 age50 bmi i.totcholcat if keep == 1
* Results: all of the covariates are significant at the p = 0.10 level, so have all potential final main effects in the model
// Assess whether we need to adjust for these final covariates because they are confounders
logistic anychd i.sysbpcat if keep == 1
logistic anychd i.sysbpcat sex2 if keep == 1
logistic anychd i.sysbpcat age50 if keep == 1
logistic anychd i.sysbpcat sex2 age50 if keep == 1
logistic anychd i.sysbpcat sex2 age50 bmi if keep == 1
logistic anychd i.sysbpcat sex2 age50 i.totcholcat if keep == 1
logistic anychd i.sysbpcat sex2 age50 i.totcholcat bmi if keep == 1
* Results: ORs for systolic blood pressure are significantly different (>10%), when adjusting for age and sex in the model
* Results: ORs for systolic blood pressure are significantly different (>10%), when adjusting for BMI in the model
* Results: ORs for systolic blood pressure are not significantly different (>10%), when adjusting for cholesterol in the model
* Results: Include covariates of systolic blood pressure, adjusting for the confounders of age, sex, and BMI, in the model
/// Assess inclusion of non-linear covariate terms
gam anychd sysbpcat sex2 age50 totcholcat bmi heartrte if keep == 1, family(binomial) link(logit) df(heartrte: 4)
gam anychd sysbpcat sex2 age50 totcholcat bmi heartrte if keep == 1, family(binomial) link(logit) df(bmi: 4)
gam anychd sysbpcat sex2 age50 totcholcat bmi heartrte if keep == 1, family(binomial) link(logit) df(totcholcat: 2)
gam anychd sysbpcat sex2 age50 totcholcat bmi heartrte if keep == 1, family(binomial) link(logit) df(sysbpcat: 3)
* Results: sysbp does not depart from linearity, bmi does not depart from linearity
// Out of these covariates, determine which ones might by highly correrlated--> multicollinearity
corr sex2 age50
corr sex2 sysbpcat
corr sex2 bmi
corr age50 sysbpcat
corr age50 bmi
corr sysbpcat bmi
corr totcholcat bmi
* Results: age50 and sysbp have r = 0.3819, sysbp and bmi r = 0.313
// Add in relevant interaction terms believed to be of interest to analyze evidence of effect modification
gen intx_sysbpcat_bmi = sysbpcat*bmi
gen intx_sex2_age50 = sex2*age50
gen intx_sex2_bmi = sex2*bmi
gen intx_age50_bmi = age50*bmi
logistic anychd i.sysbpcat sex2 age50 bmi intx_sysbpcat_bmi if keep == 1
logistic anychd i.sysbpcat sex2 age50 bmi intx_sex2_age50 if keep == 1
logistic anychd i.sysbpcat sex2 age50 bmi intx_sex2_bmi if keep == 1
logistic anychd i.sysbpcat sex2 age50 bmi intx_age50_bmi if keep == 1
* Results: interactions of interest are not significant
* Results: no evidence of effect modification between the chosen covariates
/// MODEL BUILDING- Method 2: Backward elimination model building ///
// P_remove = 0.11
// Include all main effects of interest in model
logistic anychd i.sysbpcat sex2 age50 i.totcholcat bmi heartrte if keep == 1
estat gof, group(10)
lroc, nograph
estat ic
* Results: adequate calibration using Hosmer-Lemeshow goodness-of-fit test (p = 0.4238)
* Results: good discrimination, area under ROC curve = 0.7028
* Results: AIC = 1224.88, BIC = 1275.207
* Results: heartrte not significant (p = .168), so propose to remove from model
// Removed heart rate variable from the model
logistic anychd i.sysbpcat sex2 age50 i.totcholcat bmi if keep == 1
estat gof, group(10)
lroc, nograph
estat ic
* Results: adequate calibration using Hosmer-Lemeshow goodness-of-fit test (p = 0.4058)
* Results: good discrimination, area under ROC curve = 0.7015
* Results: AIC = 1224.801, BIC = 1270.095
* Results: every covariate is significant except for systolic blood pressure, predictor of interest
* Results: propose assessing confounding to determine necessity of either BMI or cholesterol variables
* Note: from analysis above, cholesterol variable does not have significant evidence of being a confounder; thus propose to remove from the model
// Removed cholesterol variable from the model
logistic anychd i.sysbpcat sex2 age50 bmi if keep == 1
estat gof, group(10)
lroc
estat ic
// Represent model in terms of its coefficients rather than the odds ratios
logit anychd i.sysbpcat sex2 age50 bmi if keep == 1
* Results: every adjusting covariate currently in the model is significant and has evidence of being a confounder
* Results: adequate calibration (p = 0.2340)
* Results: good discrimination, area under ROC curve = 0.6916
* Results: AIC = 1235.586, BIC = 1270.814
* Results: These are the final main effects in the model
* Note: from analysis above, no evidence of effect modification or non-linear terms; thus, this is the final model