-
Notifications
You must be signed in to change notification settings - Fork 9
Generate Synthetic CSV Data for "volunteer_applications" and "user_skills" tables from Database Schema #118
Description
Table1 : volunteer_applications
References: user_id → users
Table2 : user_skills
References: user_id → users, cat_id → help_categories
Objective:
Create CSV files containing synthetic (mock) data for the above tables, following the provided schema structure. This is useful for testing, development, and demonstrations without using real/sensitive data.
Key Details
Input:
Database schema structure containing all tables names, their respective column names and data types.
Input file : https://github.com/saayam-for-all/data/tree/main/database/Saayam_Table.column.names_data.xlsx
And the same is available in the programmatically extractable way in the https://github.com/saayam-for-all/data/tree/main/database/mock-data-generation/db_info.json
Lookup table/Reference table file path : https://github.com/saayam-for-all/data/tree/main/database/lookup_tables
Output:
One CSV file per table with realistic synthetic data
Adheres to data types and constraints (string lengths, date formats, relationships)
Typically ~100 records per table (configurable)
Output File path : https://github.com/saayam-for-all/data/tree/main/database/mock_db/file_name
Data Quality Requirements:
String/Text fields: Plausible names, emails, addresses, etc.
Numeric fields: Reasonable ranges and distributions
Date/Time fields: Valid and relevant dates
Foreign keys: Respect relationships between tables (valid ID references)
Relationships between columns are maintained: ex: if there are state and city columns, the city column values are based off on the state values etc
Implementation Steps:
Analyze Schema
Extract all table names, field names, data types provided in the xls sheet
Identify constraints (primary keys, foreign keys, unique constraints)
Select Data Generation Tool
Explore different fake data generating libraries, hugging face, or via LLMs
Develop Generation Scripts
Write code to generate CSVs matching your schema
Ensure correct field naming, ordering, and data types
Enforce referential integrity for foreign keys
Output :
The scripts should go in data/tree/main/database/mock-data-generation
Update README.md documenting how to run scripts and what each file represents
Store CSVs in database/mock_db (e.g., users.csv, orders.csv):
The csv files created with the mock data should be stored in /data/tree/main/database/mock_db folder
Quality Review & Commit
Validate CSV structure and completeness
Commit all scripts and generated files to repository
Acceptance Criteria:
✅ CSV files exist for all tables in the schema
✅ Each CSV contains at most 100 rows of realistic synthetic data which can be scalable to at least 40000 rows later
✅ Field types, formats, and relationships are respected
✅ Documentation (README) with reproduction instructions included
✅ Scripts are properly documented and reusable