data_analysis/36-endogeneity.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Endogeneity {#sec-endogeneity}

In applied research, it's often tempting to treat regression coefficients as if they represent **causal relationships**. A positive coefficient on advertising spend, for example, might be interpreted as evidence that increasing ad budgets will increase sales. But such interpretations rely on a critical assumption: that the independent variables we include in a model are **exogenous**.

This chapter explores the central threat to this assumption: [endogeneity](#sec-endogeneity).

Endogeneity refers to any situation where an explanatory variable is correlated with the error term in a regression model. When this happens, **our coefficient estimates are biased and inconsistent**, and any causal claims are invalid.

------------------------------------------------------------------------

To understand where endogeneity comes from, let's begin with the familiar linear regression model:

$$
\mathbf{Y = X \beta + \epsilon}
$$

Where:

-   $\mathbf{Y}$ is an $n \times 1$ vector of observed outcomes,
-   $\mathbf{X}$ is an $n \times k$ matrix of explanatory variables (including a column of ones for the intercept, if present),
-   $\beta$ is a $k \times 1$ vector of unknown parameters,
-   $\epsilon$ is an $n \times 1$ vector of unobserved error terms.

The [Ordinary Least Squares] estimator is:

$$
\begin{aligned}
\hat{\beta}_{OLS} &= (\mathbf{X}'\mathbf{X})^{-1}(\mathbf{X}'\mathbf{Y}) \\
&= (\mathbf{X}'\mathbf{X})^{-1}(\mathbf{X}'(\mathbf{X\beta + \epsilon})) \\
&= (\mathbf{X}'\mathbf{X})^{-1}(\mathbf{X}'\mathbf{X})\beta + (\mathbf{X}'\mathbf{X})^{-1}(\mathbf{X}'\epsilon) \\
&= \beta + (\mathbf{X}'\mathbf{X})^{-1}(\mathbf{X}'\epsilon)
\end{aligned}
$$

This derivation makes it clear: OLS is **unbiased** only if the second term vanishes in expectation. That is:

$$
E[(\mathbf{X}'\mathbf{\epsilon})] = 0 \quad \text{or equivalently,} \quad Cov(\mathbf{X}, \epsilon) = 0
$$

To produce valid estimates, OLS requires two conditions:

1.  **Zero Conditional Mean**:\
    $$
    E[\epsilon \mid \mathbf{X}] = 0
    $$ This implies that once we condition on the regressors, there is no systematic error left.

2.  **No Correlation Between Regressors and Errors**:\
    $$
    Cov(\mathbf{X}, \epsilon) = 0
    $$ This is a stronger requirement. If it fails, we have an endogeneity problem.

The first condition is often satisfied by including an intercept or accounting for the distributional properties of errors. The second condition---lack of correlation between $\mathbf{X}$ and $\epsilon$---is much harder to satisfy, especially in observational data.

Endogeneity violates one of the core assumptions of regression, and it has serious consequences:

-   **Coefficient bias**: Estimates systematically differ from the true parameter values.
-   **Inconsistency**: The bias does not vanish as the sample size increases.
-   **Incorrect inference**: Hypothesis tests and confidence intervals become unreliable.
-   **Misleading decisions**: In business and policy settings, this can lead to costly errors.

------------------------------------------------------------------------

There are several common sources of endogeneity [@hill2021endogeneity]. However, most problems fall into two broad categories, treated in turn below: [Endogenous Treatment](#sec-endogenous-treatment) and [Endogenous Sample Selection](#sec-endogenous-sample-selection).

### Endogenous Treatment

The treatment variable is correlated with the error term. The four canonical mechanisms below all reduce to a violation of $Cov(\mathbf{X}, \epsilon) = 0$, but they differ in the direction of the bias and the kind of remedy that helps.

-   [**Omitted Variable Bias (OVB)**](#sec-omitted-variable-bias). This occurs when a relevant variable is left out of the model and is correlated with both the explanatory variable(s) and the outcome. OVB is a problem when the omitted variable is correlated with an included regressor *and* also affects the dependent variable; if either condition fails, there is no bias.

    Example (Economics): We want to estimate the effect of school on earnings, but typical unobservables (e.g., motivation, ability/talent, or self-selection) pose a threat to our identification strategy.

    Example (Marketing): Suppose we regress sales on advertising spend, but omit product quality. If higher-quality products get more advertising and also generate more sales, the ad spend coefficient picks up some of the effect of quality, resulting in an upward bias.

    Example (Finance): Regressing firm performance on executive compensation might omit executive ability. If more able executives both command higher compensation and deliver better results, OVB leads to biased inferences.

-   [**Simultaneity (Feedback Effects)**](#sec-simultaneity). Simultaneity arises when the dependent variable and an explanatory variable are determined **jointly**, in equilibrium.

    Example (Economics): Price and quantity demanded are determined together in supply-and-demand models. A regression of quantity on price without modeling supply will yield a biased estimate of demand sensitivity.

-   [**Reverse Causality**](#sec-reverse-causality). A special case of simultaneity where the causation runs opposite to what the model assumes.

    Example (Health Policy): A naive model might regress health outcomes on insurance coverage. But it is plausible that people in poor health are more likely to purchase insurance, causing reverse causality.

    Over longer time intervals (e.g., yearly business data), reverse causality can look just like simultaneity in terms of its effect on regression estimates.

-   [**Measurement Errors**](#sec-measurement-error). Even if a relevant variable is included, **imprecise measurement** introduces bias. Classical measurement error in $X$ leads to **attenuation bias**, estimated coefficients are biased toward zero, and occurs frequently in survey data, behavioral measures, and administrative records.

    Example (Digital Marketing): Click-through rates or exposure to ads may be tracked with browser cookies or device IDs, but such identifiers are imperfect. The resulting measurement error biases the estimated effect of advertising downward.

### Endogenous Sample Selection

Sample selection becomes a source of endogeneity when inclusion in the sample is related to the outcome variable.

Example (Labor Economics): Estimating the effect of education on wages using only employed individuals excludes those not currently working. If employment is correlated with unobserved traits (e.g., motivation), the wage equation is biased.

------------------------------------------------------------------------

Summary Table: Types of Endogeneity, listed in Table \@ref(tab:endog-types).

+-----------------------------------------------------------------+-----------------------------------------------------------+---------------------------------------------+
| Type                                                            | Mechanism                                                 | Example Context                             |
+=================================================================+===========================================================+=============================================+
| [Omitted Variable Bias](#sec-omitted-variable-bias)             | Omitted variable affects both $X$ and $Y$                 | Managerial talent in finance, brand quality |
+-----------------------------------------------------------------+-----------------------------------------------------------+---------------------------------------------+
| [Simultaneity](#sec-simultaneity)                               | $X$ and $Y$ determined jointly                            | Price $\leftrightarrow$ Demand              |
+-----------------------------------------------------------------+-----------------------------------------------------------+---------------------------------------------+
| [Reverse Causality](#sec-reverse-causality)                     | $Y$ causes $X$ (opposite direction from model assumption) | Health $\to$ Insurance                      |
|                                                                 |                                                           |                                             |
|                                                                 |                                                           | Revenue $\to$ Ad Spend                      |
+-----------------------------------------------------------------+-----------------------------------------------------------+---------------------------------------------+
| [Measurement Error](#sec-measurement-error)                     | $X$ is observed with error                                | Digital metrics, survey measures            |
+-----------------------------------------------------------------+-----------------------------------------------------------+---------------------------------------------+
| [Endogenous Sample Selection](#sec-endogenous-sample-selection) | Sample selection is correlated with outcome               | Labor force participation, customer panels  |
+-----------------------------------------------------------------+-----------------------------------------------------------+---------------------------------------------+

Table: (\#tab:endog-types) Common sources of endogeneity, the mechanism that drives each, and a representative example context.

------------------------------------------------------------------------

Endogeneity is not always fatal---if we can identify it and adjust for it, we can still make credible inferences.

1.  [Control Variables](#sec-controls)

If you suspect an omitted variable but have data on it, you can include it as a control. This is called a "[selection on observables](#sec-selection-on-observables)" approach.

However, this strategy is often insufficient because:

-   Many important factors are **unobserved** (e.g., motivation, ability, expectations).
-   Measured variables may contain **measurement error**, creating new biases.

2.  Toolbox for Endogeneity

To address more complex cases, including those involving unobservables, we introduce more advanced methods (see [Causal Inference Toolbox](#sec-causal-inference))

------------------------------------------------------------------------

## Endogenous Treatment {#sec-endogenous-treatment}

Endogenous treatment occurs when the variable of interest (the "treatment") is not randomly assigned and is correlated with unobserved determinants of the outcome. As discussed earlier, this can arise from omitted variables, simultaneity, or reverse causality. But even if the true variable is theoretically exogenous, [measurement error](#sec-measurement-error) can make it endogenous in practice.

This section focuses on how [measurement errors](#sec-measurement-error), especially in explanatory variables, introduce bias---typically **attenuation bias**---and why they are a central concern in applied research.

------------------------------------------------------------------------

### Measurement Errors {#sec-measurement-error}

Measurement error refers to the difference between the **true value** of a variable and its **observed (measured) value**. Almost every empirical dataset contains some discrepancy between what the researcher records and what actually occurred, and the size of that gap controls how much of an estimate reflects the underlying causal relationship versus the recording process itself. In observational settings, the problem is rarely a single rogue data point: it is a systematic feature of how variables are collected, transcribed, and reported.

The most common pathways through which mismeasurement enters a dataset can be grouped into a few broad sources:

-   Sources of measurement error:
    -   **Coding errors**: Manual or software-induced data entry mistakes.
    -   **Reporting errors**: Self-report bias, recall issues, or strategic misreporting.

These mechanical sources matter because they shape the *statistical structure* of the noise, and that structure determines whether the resulting bias is benign or destructive.

#### Two Broad Types of Measurement Error

For analytical purposes, it is helpful to separate measurement error into two regimes that have very different consequences for OLS. The first behaves like ordinary noise that washes out in large samples; the second introduces structure that survives even in the limit and can flip the sign of an estimate.

1.  **Random (Stochastic) Error** --- [*Classical Measurement Error*](#sec-classical-measurement-error)
    -   Noise is unpredictable and averages out in expectation.
    -   Error is **uncorrelated** with the true variable and the regression error.
    -   Common in survey data, tracking errors.
2.  **Systematic (Non-classical) Error** --- [*Non-Random Bias*](#sec-non-classical-measurement-error)
    -   Measurement error exhibits consistent patterns across observations.
    -   Often arises from:
        -   **Instrument error**: e.g., faulty sensors, uncalibrated scales.
        -   **Method error**: poor sampling, survey design flaws.
        -   **Human error**: judgment errors, social desirability bias.

The distinction between the two regimes is not just bookkeeping. It dictates which remedy is appropriate and how much we can hope to recover. Random error preserves the direction of the estimated effect, just dampening its magnitude, whereas systematic error can pull estimates in either direction depending on the correlation pattern.

**Key insight**:

-   *Random error* adds **noise**, pushing estimates toward zero.
-   *Systematic error* introduces **bias**, pushing estimates either upward or downward.

The remainder of this section formalizes each regime, beginning with the [classical case](#sec-classical-measurement-error) where standard derivations yield clean attenuation results, before turning to the [non-classical case](#sec-non-classical-measurement-error) where remedies typically require an [instrumental variable](#sec-instrumental-variables) or a validation sample.

------------------------------------------------------------------------

#### Classical Measurement Error {#sec-classical-measurement-error}

##### Right-Hand Side Variable {#sec-right-hand-side-variable}

Let's examine the most common and analytically tractable case: **classical measurement error** in an explanatory variable.

Suppose the true model is:

$$
Y_i = \beta_0 + \beta_1 X_i + u_i
$$

But we do not observe $X_i$ directly. Instead, we observe:

$$
\tilde{X}_i = X_i + e_i
$$

where $e_i$ is the **measurement error**, assumed classical:

-   $E[e_i] = 0$
-   $Cov(X_i, e_i) = 0$
-   $Cov(e_i, u_i) = 0$

Now, substitute $\tilde{X}_i$ into the regression:

$$
\begin{aligned}
Y_i &= \beta_0 + \beta_1 ( \tilde{X}_i - e_i ) + u_i \\
&= \beta_0 + \beta_1 \tilde{X}_i + (u_i - \beta_1 e_i) \\
&= \beta_0 + \beta_1 \tilde{X}_i + v_i
\end{aligned}
$$

where $v_i = u_i - \beta_1 e_i$ is a **composite error** term.

Since $\tilde{X}_i$ contains $e_i$, and $v_i$ contains $e_i$, we now have:

$$
Cov(\tilde{X}_i, v_i) \neq 0
$$

This correlation violates the exogeneity assumption and introduces [endogeneity](#sec-endogeneity).

------------------------------------------------------------------------

We can derive the asymptotic bias:

$$
\begin{aligned}
E[\tilde{X}_i v_i] &= E[(X_i + e_i)(u_i - \beta_1 e_i)] \\
&= -\beta_1 Var(e_i) \\
&\neq 0
\end{aligned}
$$

This implies:

-   If $\beta_1 > 0$, then $\hat{\beta}_1$ is biased **downward**.
-   If $\beta_1 < 0$, then $\hat{\beta}_1$ is biased **upward**.

This is called **attenuation bias**: the estimated effect is biased toward zero.

As the **variance of the error** $Var(e_i)$ increases or $\frac{Var(e_i)}{Var(\tilde{X}_i)} \to 1$, this bias becomes more severe.

------------------------------------------------------------------------

**Attenuation Factor**

The OLS estimator based on the noisy regressor is

$$
\hat{\beta}_{OLS} = \frac{ \text{cov}(\tilde{X}, Y)}{\text{var}(\tilde{X})} = \frac{\text{cov}(X + e, \beta X + u)}{\text{var}(X + e)}.
$$

Using the assumptions of classical measurement error, it follows that:

$$
plim\ \hat{\beta}_{OLS} = \beta \cdot \frac{\sigma_X^2}{\sigma_X^2 + \sigma_e^2} = \beta \cdot \lambda,
$$

where:

-   $\sigma_X^2$ is the variance of the true regressor $X$,
-   $\sigma_e^2$ is the variance of the measurement error $e$, and
-   $\lambda = \frac{\sigma_X^2}{\sigma_X^2 + \sigma_e^2}$ is called the **reliability ratio**, **signal-to-total variance ratio**, or **attenuation factor**.

Since $\lambda \in (0, 1]$, the bias always attenuates the estimate toward zero. The degree of attenuation bias is:

$$
\hat{\beta}_{OLS} - \beta = - (1 - \lambda)\beta,
$$

which implies:

-   If $\lambda = 1$, then $\hat{\beta}_{OLS} = \beta$ --- no bias (no measurement error).
-   If $\lambda < 1$, then $\hat{\beta}_{OLS} < \beta$ --- attenuation toward zero.

------------------------------------------------------------------------

**Important Notes on Measurement Error**

-   **Data transformations can magnify measurement error.**

    Suppose the true model is nonlinear:

    $$
    y = \beta x + \gamma x^2 + \epsilon,
    $$

    and $x$ is measured with classical error. Then, the attenuation factor for $\hat{\gamma}$ is **approximately the square** of the attenuation factor for $\hat{\beta}$:

    $$
    \lambda_{\hat{\gamma}} \approx \lambda_{\hat{\beta}}^2.
    $$

    This shows how nonlinear transformations (e.g., squares, logs) can exacerbate measurement error problems.

-   **Including covariates can increase attenuation bias.**

    Adding covariates that are correlated with the mismeasured variable can **worsen** bias in the coefficient of interest, especially if the measurement error is not accounted for in those covariates.

------------------------------------------------------------------------

**Remedies for Measurement Error**

To address attenuation bias caused by classical measurement error, consider the following strategies:

1.  **Use validation data or survey information** to estimate $\sigma_X^2$, $\sigma_e^2$, or $\lambda$ and apply correction methods (e.g., SIMEX, regression calibration).
2.  [Instrumental Variables Approach](#sec-instrumental-variables)\
    Use an instrument $Z$ that:
    -   Is correlated with the true variable $X$,
    -   Is uncorrelated with the regression error $\epsilon$, and
    -   Is uncorrelated with the measurement error $e$.
3.  **Reconsider the analysis**\
    If no good instruments or validation data exist, and the attenuation bias is too severe, it may be prudent to reconsider the analysis or research question.

------------------------------------------------------------------------

##### Left-Hand Side Variable {#sec-left-hand-side-variable}

Measurement error in the **dependent variable** (i.e., the response or outcome) is fundamentally different from measurement error in explanatory variables. Its consequences are often **less problematic** for consistent estimation of regression coefficients (e.g., the zero conditional mean assumption is not violated), but **not necessarily for statistical inference** (e.g., higher standard errors) or model fit.

------------------------------------------------------------------------

Suppose we are interested in the standard linear regression model:

$$
Y_i = \beta X_i + u_i,
$$

but we do not observe $Y_i$ directly. Instead, we observe:

$$
\tilde{Y}_i = Y_i + v_i,
$$

where:

-   $v_i$ is measurement error in the dependent variable,
-   $E[v_i] = 0$ (mean-zero),
-   $v_i$ is uncorrelated with $X_i$ and $u_i$,
-   $v_i$ is **homoskedastic** and independent across observations.

> **Be extra careful here!**
>
> These are classical‐error assumptions:
>
> 1.  **Mean zero:** $\mathbb{E}[v\mid X]=0$.
> 2.  **Exogeneity:** $v$ is uncorrelated with each regressor **and** with the structural disturbance $\epsilon$ (i.e., $\operatorname{Cov}(X,v)=\operatorname{Cov}(\epsilon,v)=0$).
> 3.  **Homoskedasticity / finite moments** for the law‑of‑large‑numbers to apply.

------------------------------------------------------------------------

The regression we actually estimate is:

$$
\tilde{Y}_i = \beta X_i + u_i + v_i.
$$

We can define a composite error term:

$$
\tilde{u}_i = u_i + v_i,
$$

so that the model becomes:

$$
\tilde{Y}_i = \beta X_i + \tilde{u}_i.
$$

Under the classical-error assumptions, the extra noise simply enlarges the composite error term $\tilde{u}_i$, leaving

$$
\hat\beta^{\text{OLS}}    =\beta + ( X' X)^{-1} X'(u+v) \xrightarrow{p} \beta ,
$$

so the estimator remains **consistent** and only its variance rises.

------------------------------------------------------------------------

**Key Insights**

-   **Unbiasedness and Consistency of** $\hat{\beta}$:

    As long as $E[\tilde{u}_i \mid X_i] = 0$, which holds under the classical assumptions (i.e., $E[u_i \mid X_i] = 0$ and $E[v_i \mid X_i] = 0$), the OLS estimator of $\beta$ remains **unbiased** and **consistent**.

    This is because measurement error in the [left-hand side](#sec-left-hand-side-variable) does **not** induce endogeneity. The zero conditional mean assumption is preserved.

-   **Interpretation (Identification of the Causal Effect)**:

    Econometricians and causal researchers often focus on **consistent estimation** of causal effects under strict exogeneity. Since $v_i$ just adds noise to the outcome and doesn't systematically relate to $X_i$, the slope estimate $\hat{\beta}$ remains a valid estimate of the causal effect $\beta$.

-   **Statistical Implications (Inference and Precision)**:

    Although $\hat{\beta}$ is consistent, the variance of the error term increases due to the added noise $v_i$. Specifically:

    $$
    \text{Var}(\tilde{u}_i) = \text{Var}(u_i) + \text{Var}(v_i) = \sigma_u^2 + \sigma_v^2.
    $$

    This leads to:

    -   **Higher residual variance** $\Rightarrow$ lower $R^2$
    -   **Higher standard errors** for coefficient estimates
    -   **Wider confidence intervals**, reducing the precision of inference

    Thus, even though the point estimate is valid, **inference becomes weaker**: hypothesis tests are less powerful, and conclusions less precise.

------------------------------------------------------------------------

**Practical Illustration**

-   Suppose $X$ is a marketing investment and $Y$ is sales revenue.
-   If sales are measured with noise (e.g., misrecorded sales data, rounding, reporting delays), the coefficient on marketing is still consistently estimated.
-   However, uncertainty around the estimate grows: wider confidence intervals might make it harder to detect statistically significant effects, especially in small samples.

------------------------------------------------------------------------

**Summary Table: Measurement Error Consequences**

Table \@ref(tab:endog-measurement-error-consequences) compares the consequences of measurement error in the regressor versus in the outcome.

| Location of Measurement Error | Bias in $\hat{\beta}$ | Consistency | Affects Inference? | Typical Concern           |
|-------------------------------|-----------------------|-------------|--------------------|---------------------------|
| Regressor ($X$)               | Yes (attenuation)     | No          | Yes                | Econometric & statistical |
| Outcome ($Y$)                 | No                    | Yes         | Yes                | Mainly statistical        |

Table: (\#tab:endog-measurement-error-consequences) Consequences of classical measurement error in the regressor versus the outcome for OLS bias, consistency, and inference.

------------------------------------------------------------------------

#### Non-Classical Measurement Error {#sec-non-classical-measurement-error}

In the classical measurement error model, we assume that the measurement error $\epsilon$ is **independent** of the true variable $X$ and of the regression disturbance $u$. However, in many realistic data scenarios, this assumption does not hold. [Non-classical measurement error](#sec-non-classical-measurement-error) refers to cases where:

-   $\epsilon$ is **correlated** with $X$,
-   or possibly even **correlated** with $u$.

Violating the classical assumptions introduces additional and potentially complex biases in OLS estimation.

------------------------------------------------------------------------

Recall that in the [classical measurement error model](#sec-classical-measurement-error), we observe:

$$
\tilde{X} = X + \epsilon,
$$

where:

-   $\epsilon$ is independent of $X$ and $u$,
-   $E[\epsilon] = 0$.

The true model is:

$$
Y = \beta X + u.
$$

Then, OLS based on the mismeasured regressor gives:

$$
\hat{\beta}_{OLS} = \frac{\text{cov}(\tilde{X}, Y)}{\text{var}(\tilde{X})} = \frac{\text{cov}(X + \epsilon, \beta X + u)}{\text{var}(X + \epsilon)}.
$$

With classical assumptions, this simplifies to:

$$
plim\ \hat{\beta}_{OLS} = \beta \cdot \frac{\sigma_X^2}{\sigma_X^2 + \sigma_\epsilon^2} = \beta \cdot \lambda,
$$

where $\lambda$ is the **reliability ratio**, which attenuates $\hat{\beta}$ toward zero.

------------------------------------------------------------------------

Let us now relax the independence assumption and allow for correlation between $X$ and $\epsilon$. In particular, suppose:

-   $\text{cov}(X, \epsilon) = \sigma_{X\epsilon} \ne 0$.

Then the probability limit of the OLS estimator becomes:

$$
\begin{aligned}
plim\ \hat{\beta}
&= \frac{\text{cov}(X + \epsilon, \beta X + u)}{\text{var}(X + \epsilon)} \\
&= \frac{\beta (\sigma_X^2 + \sigma_{X\epsilon})}{\sigma_X^2 + \sigma_\epsilon^2 + 2 \sigma_{X\epsilon}}.
\end{aligned}
$$

We can rewrite this as:

$$
\begin{aligned}
plim\ \hat{\beta}
&= \beta \left(1 - \frac{\sigma_\epsilon^2 + \sigma_{X\epsilon}}{\sigma_X^2 + \sigma_\epsilon^2 + 2 \sigma_{X\epsilon}} \right) \\
&= \beta (1 - b_{\epsilon \tilde{X}}),
\end{aligned}
$$

where $b_{\epsilon \tilde{X}}$ is the **regression coefficient of** $\epsilon$ on $\tilde{X}$, or more precisely:

$$
b_{\epsilon \tilde{X}} = \frac{\text{cov}(\epsilon, \tilde{X})}{\text{var}(\tilde{X})}.
$$

This makes clear that the bias in $\hat{\beta}$ depends on how strongly the measurement error is correlated with the observed regressor $\tilde{X}$. This general formulation nests the [classical case](#sec-classical-measurement-error) as a special case:

-   In classical error: $\sigma_{X\epsilon} = 0 \Rightarrow b_{\epsilon \tilde{X}} = \frac{\sigma^2_\epsilon}{\sigma^2_X + \sigma^2_\epsilon} = 1 - \lambda$.

------------------------------------------------------------------------

**Implications of Non-Classical Measurement Error**

-   When $\sigma_{X\epsilon} > 0$, the **attenuation bias can increase or decrease** depending on the balance of variances.
-   In particular:
    -   If more than **half of the variance in** $\tilde{X}$ is due to measurement error, increasing $\sigma_{X\epsilon}$ increases attenuation.
    -   If less than half is due to measurement error, it can actually **reduce** attenuation.
-   This phenomenon is sometimes called **mean-reverting measurement error**: the measurement error pulls observed values toward the mean, distorting estimates [@bound1989measurement, @bound2001measurement].

------------------------------------------------------------------------

##### A General Framework for Non-Classical Measurement Error

@bound2001measurement offer a unified matrix framework that accommodates measurement error in both the independent and dependent variables.

Let the true model be:

$$
\mathbf{Y = X \beta + \epsilon},
$$

but we observe $\tilde{X} = X + U$ and $\tilde{Y} = Y + v$, where:

-   $U$ is a matrix of measurement error in $X$,
-   $v$ is a vector of measurement error in $Y$.

Then, the observed model becomes:

$$
\hat{\beta} = (\tilde{X}' \tilde{X})^{-1} \tilde{X}' \tilde{Y}.
$$

Substituting the observed quantities:

$$
\begin{aligned}
\tilde{Y} &= Y + v = X \beta + \epsilon + v, \\
&= \tilde{X} \beta - U \beta + v + \epsilon.
\end{aligned}
$$

Hence,

$$
\hat{\beta} = (\tilde{X}' \tilde{X})^{-1} \tilde{X}' (\tilde{X} \beta - U \beta + v + \epsilon),
$$

which simplifies to:

$$
\hat{\beta} = \beta + (\tilde{X}' \tilde{X})^{-1} \tilde{X}' (-U \beta + v + \epsilon).
$$

Taking the probability limit:

$$
plim\ \hat{\beta} = \beta + plim\ [(\tilde{X}' \tilde{X})^{-1} \tilde{X}' (-U \beta + v)],
$$

Now define:

$$
W = [U \quad v],
$$

and we can express the bias compactly as:

$$
plim\ \hat{\beta} = \beta + plim\ [(\tilde{X}' \tilde{X})^{-1} \tilde{X}' W
\begin{bmatrix}
- \beta \\
1
\end{bmatrix}
].
$$

This formulation highlights a powerful insight:

> Bias in $\hat{\beta}$ arises from the linear projection of the measurement errors onto the observed $\tilde{X}$.

This expression **does not assert** that $v$ *necessarily* biases $\hat\beta$; it simply makes explicit that bias arises whenever the *linear projection* of $(U\beta-v)$ onto $\tilde X$ is non‑zero. Three cases illustrate the point (Table \@ref(tab:endog-measurement-error-cases)).

+--------------------------------------------------------------------+-----------------------------------+----------------------------------------------------+
| Case                                                               | Key correlation                   | Consequence for $\hat\beta$                        |
+--------------------------------------------------------------------+-----------------------------------+----------------------------------------------------+
| [**Classical Y‑error only**](#sec-left-hand-side-variable)         | projection term vanishes          | **Consistent**; larger standard errors             |
|                                                                    |                                   |                                                    |
| $U\equiv0, \operatorname{Cov}(\tilde X,v)=0$                     |                                   |                                                    |
+--------------------------------------------------------------------+-----------------------------------+----------------------------------------------------+
| **Correlated Y‑error**                                             | projection picks up $v$           | **Biased** (attenuation or sign reversal possible) |
|                                                                    |                                   |                                                    |
| $U\equiv0, \operatorname{Cov}(\tilde X,v)\neq0$                  |                                   |                                                    |
+--------------------------------------------------------------------+-----------------------------------+----------------------------------------------------+
| **Both X‑ and Y‑error, independent**                               | $U\beta$ projects onto $\tilde X$ | **Biased** because of $U$, **not** $v$             |
|                                                                    |                                   |                                                    |
| $\operatorname{Cov}(X,U)\neq0, \operatorname{Cov}(\tilde X,v)=0$ |                                   |                                                    |
+--------------------------------------------------------------------+-----------------------------------+----------------------------------------------------+

Table: (\#tab:endog-measurement-error-cases) Three measurement-error cases and the consequence each has for OLS bias and consistency.

Hence, [your usual](#sec-left-hand-side-variable) "harmless $Y$-noise" result is the special case in the first row.

------------------------------------------------------------------------

**Practical implications**

1.  **Check assumptions explicitly.** If the dataset was generated by self‑reports, simultaneous proxies, or modelled outcomes, it is rarely safe to assume $\operatorname{Cov}(X,v)=0$.

2.  **Correlated errors in** $Y$ can creep in through:

    -   **Common data‑generating mechanisms** (e.g., same survey module records both earnings ($Y$) and hours worked ($X$)).
    -   **Prediction‑generated variables** where $v$ inherits correlation with the features used to build $\tilde Y$.

3.  **Joint mis‑measurement** ($U$ and $v$ correlated) is common in administrative or sensor data; here, even "classical" $v$ with respect to $X$ can correlate with $\tilde X=X+U$.

> **Measurement error in** $Y$ is benign *only* under strong exogeneity and independence conditions. The Bound--Brown--Mathiowetz matrix form [@bound2001measurement] simply shows that once those conditions fail---or once $X$ itself is mis‑measured---the same projection logic that produces attenuation bias for $X$ can also transmit bias from $v$ to $\hat\beta$.

So the rule of thumb you learned is true in its narrow, classical setting, but @bound2001measurement remind us that empirical work often strays outside that safe harbor.

------------------------------------------------------------------------

**Consequences and Correction**

-   Non-classical error can lead to **over- or underestimation**, unlike the always-attenuating classical case.
-   The direction and magnitude of bias depend on the correlation structure of $X$, $\epsilon$, and $v$.
-   This poses serious problems in many survey and administrative data settings where systematic misreporting occurs.

------------------------------------------------------------------------

**Practical Solutions**

1.  [Instrumental Variables](#sec-instrumental-variables)\
    Use an instrument $Z$ that is correlated with the true variable $X$, but uncorrelated with both measurement error and the regression disturbance. IV can help eliminate both [classical](#sec-classical-measurement-error) and [non-classical](#sec-non-classical-measurement-error) error-induced biases.

2.  **Validation Studies**\
    Use a subset of the data with accurate measures to estimate the structure of measurement error and correct estimates via techniques such as regression calibration, multiple imputation, or SIMEX.

3.  **Modeling the Error Process**\
    Explicitly model the measurement error process, especially in longitudinal or panel data (e.g., via state-space models or Bayesian approaches).

4.  **Binary/Dummy Variable Case**\
    Non-classical error in binary regressors (e.g., misclassification) also leads to bias, but IV methods still apply. For example, if education level is misreported in survey data, a valid instrument (e.g., policy-based variation) can correct for misclassification bias.

------------------------------------------------------------------------

**Summary**

Table \@ref(tab:endog-classical-vs-nonclassical-error) contrasts classical and non-classical measurement error.

| Feature                      | Classical Error    | Non-Classical Error             |
|------------------------------|--------------------|---------------------------------|
| $\text{Cov}(X, \epsilon)$    | 0                  | $\ne 0$                         |
| Bias in $\hat{\beta}$        | Always attenuation | Can attenuate or inflate        |
| Consistency of OLS           | No                 | No                              |
| Effect of Variance Structure | Predictable        | Depends on $\sigma_{X\epsilon}$ |
| Fixable with IV              | Yes                | Yes                             |

Table: (\#tab:endog-classical-vs-nonclassical-error) Classical versus non-classical measurement error, compared on bias direction, consistency, and remediation.

> In short, **non-classical measurement error breaks the comforting regularity of attenuation bias**. It can produce arbitrary biases depending on the nature and structure of the error. [Instrumental variables](#sec-instrumental-variables) and validation studies are often the only reliable tools for addressing this complex problem.

------------------------------------------------------------------------

#### Solution to Measurement Errors in Correlation Estimation

##### Bayesian Correction for Correlation Coefficient

We begin by expressing the Bayesian posterior for a correlation coefficient $\rho$:

$$
\begin{aligned}
P(\rho \mid \text{data}) &= \frac{P(\text{data} \mid \rho) P(\rho)}{P(\text{data})} \\
\text{Posterior Probability} &\propto \text{Likelihood} \times \text{Prior Probability}
\end{aligned}
$$

Where:

-   $\rho$ is the true population correlation coefficient
-   $P(\text{data} \mid \rho)$ is the likelihood function
-   $P(\rho)$ is the prior density of $\rho$
-   $P(\text{data})$ is the marginal likelihood (a normalizing constant)

With sample correlation coefficient $r$:

$$
r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}}
$$

According to @schisterman2003estimation, pp. 3, the posterior density of $\rho$ can be approximated as:

$$
P(\rho \mid x, y) \propto P(\rho) \cdot \frac{(1 - \rho^2)^{(n - 1)/2}}{(1 - \rho r)^{n - 3/2}}
$$

This approximation leads to a posterior that can be modeled via the Fisher transformation:

-   Let $\rho = \tanh(\xi)$, where $\xi \sim N(z, 1/n)$
-   $r = \tanh(z)$ is the Fisher-transformed correlation

Using conjugate normal approximations, we derive the posterior for the transformed correlation $\xi$ as:

-   **Posterior Variance:**

$$
\sigma^2_{\text{posterior}} = \frac{1}{n_{\text{prior}} + n_{\text{likelihood}}}
$$

-   **Posterior Mean:**

$$
\mu_{\text{posterior}} = \sigma^2_{\text{posterior}} \left(n_{\text{prior}} \cdot \tanh^{-1}(r_{\text{prior}}) + n_{\text{likelihood}} \cdot \tanh^{-1}(r_{\text{likelihood}})\right)
$$

To simplify the mathematics, we may assume a prior of the form:

$$
P(\rho) \propto (1 - \rho^2)^c
$$

where $c$ controls the strength of the prior. If no prior information is available, we can set $c = 0$ so that $P(\rho) \propto 1$.

------------------------------------------------------------------------

**Example: Combining Estimates from Two Studies**

Let:

-   Current study: $r_{\text{likelihood}} = 0.5$, $n_{\text{likelihood}} = 200$
-   Prior study: $r_{\text{prior}} = 0.2765$, $n_{\text{prior}} = 50205$

**Step 1: Posterior Variance**

$$
\sigma^2_{\text{posterior}} = \frac{1}{50205 + 200} = 0.0000198393
$$

**Step 2: Posterior Mean**

Apply Fisher transformation:

-   $\tanh^{-1}(0.2765) \approx 0.2841$
-   $\tanh^{-1}(0.5) = 0.5493$

Then:

$$
\begin{aligned}
\mu_{\text{posterior}} &= 0.0000198393 \times (50205 \times 0.2841 + 200 \times 0.5493) \\
&= 0.0000198393 \times (14260.7 + 109.86) \\
&= 0.0000198393 \times 14370.56 = 0.2850
\end{aligned}
$$

Thus, the posterior distribution of $\xi = \tanh^{-1}(\rho)$ is:

$$
\xi \sim N(0.2850, 0.0000198393)
$$

Transforming back:

-   Posterior mean correlation: $\rho = \tanh(0.2850) = 0.2776$
-   95% CI for $\xi$: $0.2850 \pm 1.96 \cdot \sqrt{0.0000198393} = (0.2762, 0.2937)$
-   Transforming endpoints: $\tanh(0.2762) = 0.2694$, $\tanh(0.2937) = 0.2855$

The Bayesian posterior distribution for the correlation coefficient is:

-   Mean: $\hat{\rho}_{\text{posterior}} = 0.2776$
-   95% CI: $(0.2694,\ 0.2855)$

------------------------------------------------------------------------

This Bayesian adjustment is especially useful when:

1.  There is high sampling variation due to small sample sizes
2.  Measurement error attenuates the observed correlation
3.  Combining evidence from multiple studies (meta-analytic context)

By leveraging prior information and applying the Fisher transformation, researchers can obtain a more stable and accurate estimate of the true underlying correlation.

```{r}
# Define inputs
n_new  <- 200
r_new  <- 0.5
alpha  <- 0.05

# Bayesian update function for correlation coefficient
update_correlation <- function(n_new, r_new, alpha) {

  # Prior (meta-analysis study)
  n_meta <- 50205
  r_meta <- 0.2765

  # Step 1: Posterior variance (in Fisher-z space)
  var_xi <- 1 / (n_new + n_meta)

  # Step 2: Posterior mean (in Fisher-z space)
  mu_xi <- var_xi * (n_meta * atanh(r_meta) + n_new * atanh(r_new))

  # Step 3: Confidence interval in Fisher-z space
  z_crit    <- qnorm(1 - alpha / 2)
  upper_xi  <- mu_xi + z_crit * sqrt(var_xi)
  lower_xi  <- mu_xi - z_crit * sqrt(var_xi)

  # Step 4: Transform back to correlation scale
  mean_rho  <- tanh(mu_xi)
  upper_rho <- tanh(upper_xi)
  lower_rho <- tanh(lower_xi)

  # Return all values as a list
  list(
    mu_xi     = mu_xi,
    var_xi    = var_xi,
    upper_xi  = upper_xi,
    lower_xi  = lower_xi,
    mean_rho  = mean_rho,
    upper_rho = upper_rho,
    lower_rho = lower_rho
  )
}


# Run update
updated <-
    update_correlation(n_new = n_new,
                       r_new = r_new,
                       alpha = alpha)

# Display updated posterior mean and confidence interval
cat("Posterior mean of rho:", round(updated$mean_rho, 4), "\n")
cat(
    "95% CI for rho: (",
    round(updated$lower_rho, 4),
    ",",
    round(updated$upper_rho, 4),
    ")\n"
)

# For comparison: Classical (frequentist) confidence interval around r_new
se_r  <- sqrt(1 / n_new)
z_r   <- qnorm(1 - alpha / 2) * se_r
ci_lo <- r_new - z_r
ci_hi <- r_new + z_r

cat("Frequentist 95% CI for r:",
    round(ci_lo, 4),
    "to",
    round(ci_hi, 4),
    "\n")
```

------------------------------------------------------------------------

### Simultaneity {#sec-simultaneity}

Simultaneity arises when at least one of the explanatory variables in a regression model is **jointly determined** with the dependent variable, violating a critical assumption for causal inference: **temporal precedence**.

#### Why Simultaneity Matters

-   In classical regression, we assume that regressors are determined **exogenously**---they are not influenced by the dependent variable.
-   Simultaneity introduces [endogeneity](#sec-endogeneity), where regressors are correlated with the error term, rendering OLS **estimators biased and inconsistent**.
-   This has major implications in fields like economics, marketing, finance, and social sciences, where feedback mechanisms or equilibrium processes are common.

#### Real-World Examples

-   **Demand and supply**: Price and quantity are determined together in market equilibrium.
-   **Sales and advertising**: Advertising influences sales, but firms also adjust advertising based on current or anticipated sales.
-   **Productivity and investment**: Higher productivity may attract investment, but investment can improve productivity.

------------------------------------------------------------------------

#### Simultaneous Equation System

We begin with a basic two-equation structural model:

$$
\begin{aligned}
Y_i &= \beta_0 + \beta_1 X_i + u_i \quad \text{(Structural equation for } Y) \\
X_i &= \alpha_0 + \alpha_1 Y_i + v_i \quad \text{(Structural equation for } X)
\end{aligned}
$$

Here:

-   $Y_i$ and $X_i$ are [endogenous variables](#sec-endogeneity) --- both determined within the system.
-   $u_i$ and $v_i$ are structural error terms, assumed to be uncorrelated with the exogenous variables (if any).

The equations form a **simultaneous system** because each endogenous variable appears on the right-hand side of the other's equation.

------------------------------------------------------------------------

To uncover the statistical properties of these equations, we solve for $Y_i$ and $X_i$ as functions of the error terms only:

$$
\begin{aligned}
Y_i &= \frac{\beta_0 + \beta_1 \alpha_0}{1 - \alpha_1 \beta_1} + \frac{\beta_1 v_i + u_i}{1 - \alpha_1 \beta_1} \\
X_i &= \frac{\alpha_0 + \alpha_1 \beta_0}{1 - \alpha_1 \beta_1} + \frac{v_i + \alpha_1 u_i}{1 - \alpha_1 \beta_1}
\end{aligned}
$$

These are the **reduced-form equations**, expressing the endogenous variables as functions of exogenous factors and disturbances.

------------------------------------------------------------------------

#### Simultaneity Bias in OLS

If we naïvely estimate the first equation using OLS, assuming $X_i$ is exogenous, we get:

$$
\text{Bias: } \quad Cov(X_i, u_i) = Cov\left(\frac{v_i + \alpha_1 u_i}{1 - \alpha_1 \beta_1}, u_i\right) = \frac{\alpha_1}{1 - \alpha_1 \beta_1} \cdot Var(u_i)
$$

This violates the [Gauss-Markov Theorem] that regressors be uncorrelated with the error term. The OLS estimator for $\beta_1$ is **biased and inconsistent**.

------------------------------------------------------------------------

To allow for identification and estimation, we introduce **exogenous variables**:

$$
\begin{cases}
Y_i = \beta_0 + \beta_1 X_i + \beta_2 T_i + u_i \\
X_i = \alpha_0 + \alpha_1 Y_i + \alpha_2 Z_i + v_i
\end{cases}
$$

Where:

-   $X_i$, $Y_i$ --- [endogenous](#sec-endogeneity) variables
-   $T_i$, $Z_i$ --- **exogenous** variables, not influenced by any variable in the system

------------------------------------------------------------------------

Solving this system algebraically yields the reduced form model:

$$
\begin{cases}\begin{aligned}Y_i &= \frac{\beta_0 + \beta_1 \alpha_0}{1 - \alpha_1 \beta_1} + \frac{\beta_1 \alpha_2}{1 - \alpha_1 \beta_1} Z_i + \frac{\beta_2}{1 - \alpha_1 \beta_1} T_i + \tilde{u}_i \\&= B_0 + B_1 Z_i + B_2 T_i + \tilde{u}_i\end{aligned}\\\begin{aligned}X_i &= \frac{\alpha_0 + \alpha_1 \beta_0}{1 - \alpha_1 \beta_1} + \frac{\alpha_2}{1 - \alpha_1 \beta_1} Z_i + \frac{\alpha_1\beta_2}{1 - \alpha_1 \beta_1} T_i + \tilde{v}_i \\&= A_0 + A_1 Z_i + A_2 T_i + \tilde{v}_i\end{aligned}\end{cases}
$$

The reduced form expresses **endogenous variables as functions of exogenous instruments**, which we can estimate using OLS.

------------------------------------------------------------------------

Using reduced-form estimates $(A_1, A_2, B_1, B_2)$, we can identify (recover) the structural coefficients:

$$
\begin{aligned}
\beta_1 &= \frac{B_1}{A_1} \\
\beta_2 &= B_2 \left(1 - \frac{B_1 A_2}{A_1 B_2}\right) \\
\alpha_1 &= \frac{A_2}{B_2} \\
\alpha_2 &= A_1 \left(1 - \frac{B_1 A_2}{A_1 B_2} \right)
\end{aligned}
$$

------------------------------------------------------------------------

#### Identification Conditions

Estimation of structural parameters is only possible if the model is **identified**.

**Order Condition (Necessary but Not Sufficient)**

A structural equation is **identified** if:

$$
K - k \ge m - 1
$$

Where:

-   $M$ = total number of endogenous variables in the system

-   $m$ = number of endogenous variables in the given equation

-   $K$ = number of total exogenous variables in the system

-   $k$ = number of exogenous variables appearing in the given equation

-   **Just-identified**: $K - k = m - 1$ (exact identification)

-   **Over-identified**: $K - k > m - 1$ (more instruments than necessary)

-   **Under-identified**: $K - k < m - 1$ (cannot be estimated)