An Independent Implementation of "Parental environment and student achievement: Does a Matthew effect exist?"
Repository: https://github.com/chirindaopensource/parental_environment_and_student_achievement
Owner: 2025 Craig Chirinda (Open Source Projects)
This repository contains an independent, professional-grade Python implementation of the research methodology from the 2025 paper entitled "Parental environment and student achievement: Does a Matthew effect exist?" by:
- Gaëlle Aymeric
- Emmanuelle Lavaine
- Brice Magdalou
The project provides a complete, end-to-end computational framework for replicating the paper's findings. It delivers a modular, auditable, and extensible pipeline that executes the entire research workflow: from rigorous data validation and feature engineering to the core econometric estimations (OLS and IV with Fixed Effects) and the final generation of all tables, figures, and reports.
- Introduction
- Theoretical Background
- Features
- Methodology Implemented
- Core Components (Notebook Structure)
- Key Callable:
execute_full_study - Prerequisites
- Installation
- Input Data Structure
- Usage
- Output Structure
- Project Structure
- Customization
- Contributing
- Recommended Extensions
- License
- Citation
- Acknowledgments
This project provides a Python implementation of the methodologies presented in the 2025 paper "Parental environment and student achievement: Does a Matthew effect exist?". The core of this repository is the iPython Notebook parental_environment_and_student_achievement_draft.ipynb, which contains a comprehensive suite of functions to replicate the paper's findings, from initial data validation to the final generation of all analytical tables and figures.
The paper investigates the causal impact of parental environment on student achievement, distinguishing between a static "persistent effect" and a dynamic "Matthew Effect" (the changing influence of parental background over time). This codebase operationalizes the paper's framework, allowing users to:
- Rigorously validate and manage the entire experimental configuration via a single
config.yamlfile. - Process raw student survey data and external instrument data, applying a sequence of cleaning, harmonization, and feature engineering steps.
- Estimate all 33 econometric models from the paper, including pooled OLS, interaction OLS, and their 2SLS instrumental variable counterparts, all with high-dimensional (school and year) fixed effects and clustered standard errors.
- Run a full suite of diagnostic and robustness checks to validate the credibility of the findings.
- Automatically generate all key figures and a final narrative report summarizing the results and policy implications.
The implemented methods are grounded in modern microeconometrics and causal inference.
1. High-Dimensional Fixed Effects (HDFE) OLS:
The baseline models for the persistent effect (Equation 1) and the Matthew Effect (Equation 2) are estimated via OLS with fixed effects for both school (
2. Two-Stage Least Squares (2SLS) with an Instrumental Variable (IV):
To address the endogeneity of parental education (PHE), the study employs a 2SLS strategy. The instrument is the gender gap in tertiary education in the parent's country of origin in 1960 (GG).
- First Stage (Relevance): The endogenous variable is regressed on the instrument and all exogenous controls. $$ \text{PHE}_i = \pi_0 + \pi_1 \text{GG}i + \text{Controls} + a_t + b_s + \nu{its} \quad (3) $$
- Second Stage (Causal Effect): The outcome is regressed on the predicted value of the endogenous variable from the first stage. $$ Y_{its} = \alpha_0 + \alpha_1 \widehat{\text{PHE}}i + \text{Controls} + a_t + b_s + \epsilon{its} \quad (4) $$ This methodology is extended to the interaction models to estimate the causal Matthew Effect.
3. Cluster-Robust Standard Errors:
All models are estimated with standard errors clustered at the school_id level. This accounts for the fact that residuals for students within the same school may be correlated, providing more conservative and reliable statistical inference.
The provided iPython Notebook (parental_environment_and_student_achievement_draft.ipynb) implements the full research pipeline, including:
- Modular, Multi-Task Architecture: The entire pipeline is broken down into 27 distinct, modular tasks, each with its own orchestrator function, ensuring clarity and testability.
- Configuration-Driven Design: All study parameters are managed in an external
config.yamlfile, allowing for easy customization and replication. - Rigorous Data Validation: A multi-stage validation process checks the schema, content integrity, and statistical properties of all input data before any analysis begins.
- Systematic Feature Engineering: A transparent, step-by-step process for constructing the main endogenous variable (
PHE), the instrumental variable (GG), and the full set of dummy-coded control variables. - Advanced Econometric Engine: A generic estimation function that correctly handles high-dimensional fixed effects and clustered standard errors for both OLS and 2SLS models using the
linearmodelslibrary. - Comprehensive Estimation Suite: Programmatically specifies and estimates all 33 models from the paper.
- Automated Diagnostics and Reporting: Concludes by automatically generating all key figures, a full suite of robustness checks, and a final narrative summary of the findings.
The core analytical steps directly implement the methodology from the paper:
- Data Validation & Cleaning (Tasks 1-5): Ingests and validates the
config.yamland raw data, cleanses the student survey data by applying inclusion criteria, and harmonizes all categorical codings. - Feature Engineering (Tasks 6-8): Constructs the
PHEvariable and its dummies, creates the full set of control variables, and constructs theGGinstrumental variable. - Sample & Model Definition (Tasks 9-11): Programmatically defines the 33 distinct analysis samples using listwise deletion and creates the formal specifications for all OLS and IV models.
- Estimation (Tasks 12-20): Executes all OLS and IV regressions (pooled, by-grade, and interaction), extracting and formatting the results.
- Visualization & Diagnostics (Tasks 21-24): Generates all CDF and coefficient plots, and performs final diagnostic checks on sample composition and data normalization.
- Robustness & Reporting (Tasks 25-27): Provides top-level orchestrators to run the full pipeline, execute a suite of robustness checks, and synthesize the final narrative report.
The parental_environment_and_student_achievement_draft.ipynb notebook is structured as a logical pipeline with modular orchestrator functions for each of the major tasks. All functions are self-contained, fully documented with type hints and docstrings, and designed for professional-grade execution.
The project is designed around a single, top-level user-facing interface function:
execute_full_study: This master orchestrator function, located in the final section of the notebook, runs the entire automated research pipeline from end-to-end. A single call to this function reproduces the entire computational portion of the project, from data validation to the final report generation.
- Python 3.9+
- Core dependencies:
pandas,numpy,matplotlib,linearmodels,pyyaml,faker.
-
Clone the repository:
git clone https://github.com/chirindaopensource/parental_environment_and_student_achievement.git cd parental_environment_and_student_achievement -
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install Python dependencies:
pip install pandas numpy matplotlib linearmodels pyyaml Faker
The pipeline requires two pandas.DataFrame objects with specific schemas, which are rigorously validated by the pipeline. A synthetic data generator is included in the notebook for demonstration purposes.
student_parent_survey_raw: The primary dataset containing student and parent information.barro_lee_raw: The external dataset with historical education data for constructing the instrument.
All other parameters are controlled by the config.yaml file.
The parental_environment_and_student_achievement_draft.ipynb notebook provides a complete, step-by-step guide. The primary workflow is to execute the final cell of the notebook, which demonstrates how to use the top-level execute_full_study orchestrator:
# Final cell of the notebook
# This block serves as the main entry point for the entire project.
if __name__ == '__main__':
# The notebook includes a synthetic data generation stage.
# We assume `student_parent_survey_raw` and `barro_lee_raw` are in memory.
# Load the configuration from the YAML file.
with open('config.yaml', 'r') as f:
config = yaml.safe_load(f)
# Define the top-level directory for all outputs.
RESULTS_DIRECTORY = "study_replication_output"
# Execute the entire replication study.
full_results = execute_full_study(
student_parent_survey_raw=student_parent_survey_raw,
barro_lee_raw=barro_lee_raw,
config=config,
output_dir=RESULTS_DIRECTORY
)The pipeline creates a base_output_dir and returns a comprehensive dictionary containing all analytical artifacts, structured as follows:
main_analysis: Contains all primary results.results: Nested dictionary of all regression tables (DataFrames).diagnostics: Diagnostic tables (e.g., sample flow).figures: List of paths to all generated plots (PNG files).
robustness_checks: Contains all sensitivity analysis reports.final_summary_report: A string containing the final narrative summary.
parental_environment_and_student_achievement/
│
├── parental_environment_and_student_achievement_draft.ipynb # Main implementation notebook
├── config.yaml # Master configuration file
├── requirements.txt # Python package dependencies
│
├── study_replication_output/ # Example output directory
│ └── figures/
│ ├── cdf_math_grade_3.png
│ └── ...
│
├── LICENSE # MIT Project License File
└── README.md # This file
The pipeline is highly customizable via the config.yaml file. Users can easily modify all study parameters, including date ranges, variable definitions, and model specifications, without altering the core Python code.
Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.
Future extensions could include:
- Alternative Instruments: Integrating and testing alternative instrumental variables to check the robustness of the causal claims.
- Heterogeneity Analysis: Exploring how the Matthew Effect varies across different subgroups of the population (e.g., by gender, immigrant status, or school type).
- Machine Learning Models: Using non-parametric or machine learning methods (e.g., causal forests) to explore potential non-linearities in the treatment effects.
- Structural Modeling: Building and estimating a structural model of skill formation to provide deeper theoretical insights into the observed dynamic patterns.
This project is licensed under the MIT License. See the LICENSE file for details.
If you use this code or the methodology in your research, please cite the original paper:
@article{aymeric2025parental,
title={Parental environment and student achievement: Does a Matthew effect exist?},
author={Aymeric, Ga{\"e}lle and Lavaine, Emmanuelle and Magdalou, Brice},
journal={arXiv preprint arXiv:2510.18481},
year={2025}
}For the implementation itself, you may cite this repository:
Chirinda, C. (2025). A Professional-Grade Implementation of "Parental environment and student achievement: Does a Matthew effect exist?".
GitHub repository: https://github.com/chirindaopensource/parental_environment_and_student_achievement
- Credit to Gaëlle Aymeric, Emmanuelle Lavaine, and Brice Magdalou for the foundational research that forms the entire basis for this computational replication.
- This project is built upon the exceptional tools provided by the open-source community. Sincere thanks to the developers of the scientific Python ecosystem, including Pandas, NumPy, Matplotlib, and the
linearmodelslibrary.
--
This README was generated based on the structure and content of the parental_environment_and_student_achievement_draft.ipynb notebook and follows best practices for research software documentation.