Skip to content

Generate Synthetic CSV Data for "volunteer_applications" and "user_skills" tables from Database Schema #118

@prachi080588

Description

@prachi080588

Table1 : volunteer_applications
References: user_id → users

Table2 : user_skills
References: user_id → users, cat_id → help_categories

Objective:
Create CSV files containing synthetic (mock) data for the above tables, following the provided schema structure. This is useful for testing, development, and demonstrations without using real/sensitive data.

Key Details
Input:
Database schema structure containing all tables names, their respective column names and data types.

Input file : https://github.com/saayam-for-all/data/tree/main/database/Saayam_Table.column.names_data.xlsx

And the same is available in the programmatically extractable way in the https://github.com/saayam-for-all/data/tree/main/database/mock-data-generation/db_info.json

Lookup table/Reference table file path : https://github.com/saayam-for-all/data/tree/main/database/lookup_tables

Output:
One CSV file per table with realistic synthetic data
Adheres to data types and constraints (string lengths, date formats, relationships)
Typically ~100 records per table (configurable)
Output File path : https://github.com/saayam-for-all/data/tree/main/database/mock_db/file_name

Data Quality Requirements:
String/Text fields: Plausible names, emails, addresses, etc.
Numeric fields: Reasonable ranges and distributions
Date/Time fields: Valid and relevant dates
Foreign keys: Respect relationships between tables (valid ID references)
Relationships between columns are maintained: ex: if there are state and city columns, the city column values are based off on the state values etc

Implementation Steps:
Analyze Schema
Extract all table names, field names, data types provided in the xls sheet
Identify constraints (primary keys, foreign keys, unique constraints)

Select Data Generation Tool
Explore different fake data generating libraries, hugging face, or via LLMs

Develop Generation Scripts
Write code to generate CSVs matching your schema
Ensure correct field naming, ordering, and data types
Enforce referential integrity for foreign keys
Output :
The scripts should go in data/tree/main/database/mock-data-generation
Update README.md documenting how to run scripts and what each file represents

Store CSVs in database/mock_db (e.g., users.csv, orders.csv):
The csv files created with the mock data should be stored in /data/tree/main/database/mock_db folder

Quality Review & Commit
Validate CSV structure and completeness
Commit all scripts and generated files to repository

Acceptance Criteria:
✅ CSV files exist for all tables in the schema
✅ Each CSV contains at most 100 rows of realistic synthetic data which can be scalable to at least 40000 rows later
✅ Field types, formats, and relationships are respected
✅ Documentation (README) with reproduction instructions included
✅ Scripts are properly documented and reusable

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions