Skip to content

Generate Synthetic CSV Data for "fraud_requests" and "notifications" tables from Database Schema #121

@prachi080588

Description

@prachi080588

Table1 : fraud_requests
References: user_id → users

Table2 : notifications
References: user_id → users, type_id → notification_types, channel_id → notification_channels

Objective:
Create CSV files containing synthetic (mock) data for the above tables, following the provided schema structure. This is useful for testing, development, and demonstrations without using real/sensitive data.

Key Details
Input:
Database schema structure containing all tables names, their respective column names and data types.

Input file : https://github.com/saayam-for-all/data/tree/main/database/Saayam_Table.column.names_data.xlsx

And the same is available in the programmatically extractable way in the https://github.com/saayam-for-all/data/tree/main/database/mock-data-generation/db_info.json

Lookup table/Reference table file path : https://github.com/saayam-for-all/data/tree/main/database/lookup_tables

Output:
One CSV file per table with realistic synthetic data
Adheres to data types and constraints (string lengths, date formats, relationships)
Typically ~100 records per table (configurable)
Output File path : https://github.com/saayam-for-all/data/tree/main/database/mock_db/file_name

Data Quality Requirements:
String/Text fields: Plausible names, emails, addresses, etc.
Numeric fields: Reasonable ranges and distributions
Date/Time fields: Valid and relevant dates
Foreign keys: Respect relationships between tables (valid ID references)
Relationships between columns are maintained: ex: if there are state and city columns, the city column values are based off on the state values etc

Implementation Steps:
Analyze Schema:
Extract all table names, field names, data types provided in the xls sheet
Identify constraints (primary keys, foreign keys, unique constraints)

Select Data Generation Tool:
Explore different fake data generating libraries, hugging face, or via LLMs

Develop Generation Scripts:
Write code to generate CSVs matching your schema
Ensure correct field naming, ordering, and data types
Enforce referential integrity for foreign keys
Output :
The scripts should go in data/tree/main/database/mock-data-generation
Update README.md documenting how to run scripts and what each file represents

Store CSVs in database/mock_db (e.g., users.csv, orders.csv)
The csv files created with the mock data should be stored in /data/tree/main/database/mock_db folder

Quality Review & Commit
Validate CSV structure and completeness
Commit all scripts and generated files to repository

Acceptance Criteria:
✅ CSV files exist for all tables in the schema
✅ Each CSV contains at most 100 rows of realistic synthetic data which can be scalable to at least 40000 rows later
✅ Field types, formats, and relationships are respected
✅ Documentation (README) with reproduction instructions included
✅ Scripts are properly documented and reusable

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions