-
Notifications
You must be signed in to change notification settings - Fork 9
Generate Synthetic CSV Data for "fraud_requests" and "notifications" tables from Database Schema #121
Description
Table1 : fraud_requests
References: user_id → users
Table2 : notifications
References: user_id → users, type_id → notification_types, channel_id → notification_channels
Objective:
Create CSV files containing synthetic (mock) data for the above tables, following the provided schema structure. This is useful for testing, development, and demonstrations without using real/sensitive data.
Key Details
Input:
Database schema structure containing all tables names, their respective column names and data types.
Input file : https://github.com/saayam-for-all/data/tree/main/database/Saayam_Table.column.names_data.xlsx
And the same is available in the programmatically extractable way in the https://github.com/saayam-for-all/data/tree/main/database/mock-data-generation/db_info.json
Lookup table/Reference table file path : https://github.com/saayam-for-all/data/tree/main/database/lookup_tables
Output:
One CSV file per table with realistic synthetic data
Adheres to data types and constraints (string lengths, date formats, relationships)
Typically ~100 records per table (configurable)
Output File path : https://github.com/saayam-for-all/data/tree/main/database/mock_db/file_name
Data Quality Requirements:
String/Text fields: Plausible names, emails, addresses, etc.
Numeric fields: Reasonable ranges and distributions
Date/Time fields: Valid and relevant dates
Foreign keys: Respect relationships between tables (valid ID references)
Relationships between columns are maintained: ex: if there are state and city columns, the city column values are based off on the state values etc
Implementation Steps:
Analyze Schema:
Extract all table names, field names, data types provided in the xls sheet
Identify constraints (primary keys, foreign keys, unique constraints)
Select Data Generation Tool:
Explore different fake data generating libraries, hugging face, or via LLMs
Develop Generation Scripts:
Write code to generate CSVs matching your schema
Ensure correct field naming, ordering, and data types
Enforce referential integrity for foreign keys
Output :
The scripts should go in data/tree/main/database/mock-data-generation
Update README.md documenting how to run scripts and what each file represents
Store CSVs in database/mock_db (e.g., users.csv, orders.csv)
The csv files created with the mock data should be stored in /data/tree/main/database/mock_db folder
Quality Review & Commit
Validate CSV structure and completeness
Commit all scripts and generated files to repository
Acceptance Criteria:
✅ CSV files exist for all tables in the schema
✅ Each CSV contains at most 100 rows of realistic synthetic data which can be scalable to at least 40000 rows later
✅ Field types, formats, and relationships are respected
✅ Documentation (README) with reproduction instructions included
✅ Scripts are properly documented and reusable