Generate Synthetic CSV Data for "volunteer_applications" and "user_skills" tables from Database Schema

Table1 : **volunteer_applications**
References: user_id → users

Table2 : **user_skills**
References: user_id → users, cat_id → help_categories

**Objective:**
Create CSV files containing synthetic (mock) data for the above tables, following the provided schema structure. This is useful for testing, development, and demonstrations without using real/sensitive data.

Key Details
**Input:**
Database schema structure containing all tables names, their respective column names and data types.

Input file : **https://github.com/saayam-for-all/data/tree/main/database/Saayam_Table.column.names_data.xlsx**

And the same is available in the programmatically extractable way in the **https://github.com/saayam-for-all/data/tree/main/database/mock-data-generation/db_info.json**

Lookup table/Reference table file path : **https://github.com/saayam-for-all/data/tree/main/database/lookup_tables**

**Output:**
One CSV file per table with realistic synthetic data
Adheres to data types and constraints (string lengths, date formats, relationships)
Typically ~100 records per table (configurable)
Output File path : https://github.com/saayam-for-all/data/tree/main/database/mock_db/file_name

**Data Quality Requirements:**
String/Text fields: Plausible names, emails, addresses, etc.
Numeric fields: Reasonable ranges and distributions
Date/Time fields: Valid and relevant dates
Foreign keys: Respect relationships between tables (valid ID references)
**Relationships between columns are maintained: ex: if there are state and city columns, the city column values are based off on the state values etc**

**Implementation Steps:**
Analyze Schema
Extract all table names, field names, data types provided in the xls sheet
Identify constraints (primary keys, foreign keys, unique constraints)

Select Data Generation Tool
Explore different fake data generating libraries, hugging face, or via LLMs

Develop Generation Scripts
Write code to generate CSVs matching your schema
Ensure correct field naming, ordering, and data types
Enforce referential integrity for foreign keys
Output :
The scripts should go in **data/tree/main/database/mock-data-generation**
Update README.md documenting how to run scripts and what each file represents

Store CSVs in database/mock_db (e.g., users.csv, orders.csv):
The csv files created with the mock data should be stored in **/data/tree/main/database/mock_db** folder

Quality Review & Commit
Validate CSV structure and completeness
Commit all scripts and generated files to repository

**Acceptance Criteria:**
✅ CSV files exist for all tables in the schema
✅ Each CSV contains at most 100 rows of realistic synthetic data which can be scalable to at least 40000 rows later
✅ Field types, formats, and relationships are respected
✅ Documentation (README) with reproduction instructions included
✅ Scripts are properly documented and reusable



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate Synthetic CSV Data for "volunteer_applications" and "user_skills" tables from Database Schema #118

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generate Synthetic CSV Data for "volunteer_applications" and "user_skills" tables from Database Schema #118

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions