Skip to content

Enhance synthetic data generation and fix schema inconsistencies#131

Open
Prasadkurapati wants to merge 6 commits intomainfrom
prasad-script-improvement
Open

Enhance synthetic data generation and fix schema inconsistencies#131
Prasadkurapati wants to merge 6 commits intomainfrom
prasad-script-improvement

Conversation

@Prasadkurapati
Copy link
Copy Markdown

Summary

Fixed all data consistency issues raised in code review. The script now generates realistic, scalable mock data with proper foreign key relationships.

Changes Made

1. Country-State Consistency

  • Before: Random integers 1-5 for states, no relationship to countries
  • After: String state codes (US-NY, CA-ON, UK-SCT) mapped to specific countries
  • Result: Afghanistan can no longer be assigned NY state code

2. Scalability

  • Before: Script edited existing CSV files (hard to regenerate)
  • After: Fresh generation each run, change NUM_ROWS = 10000 and execute
  • Result: Generated 10,000 rows successfully in one run

3. Realistic Request Content

  • Before: fake.sentence() produced random gibberish
  • After: Contextual templates by category (ELDERLY_CARE, CHILDCARE, HOME_REPAIR, etc.)
  • Result: "Need assistance with grocery shopping" instead of "blue monkey dances"

4. Foreign Key Integrity

  • Before: Random integers, needed fix_foreign_keys.py workaround
  • After: UUIDs selected from generated pools, guaranteed valid references
  • Result: No orphaned records, no secondary fix script needed

Testing

  • Generated 10,000 rows per table
  • Verified state codes match countries (US-TX only in US, CA-ON only in Canada)
  • Verified realistic request subjects/descriptions
  • All foreign keys reference valid existing IDs

Files Changed

  • database/mock-data-generation/generate_mock_data.py (complete rewrite)

Removed

  • database/mock-data-generation/fix_foreign_keys.py (no longer needed)

…o 10k rows

- Fix state codes: strings (US-NY, CA-ON) instead of integers
- Add proper country-state mapping (no more Afghanistan with NY)
- Replace random words with realistic request templates
- Generate fresh data instead of editing existing CSVs
- Valid foreign keys using UUID pools
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the repository’s synthetic/mock data story by rewriting the Python generator and committing new CSV seed data intended to be more realistic and consistent with foreign keys and geography.

Changes:

  • Rewrites generate_mock_data.py to generate larger datasets with “realistic” request text and UUID-based IDs.
  • Adds/updates committed database/mock_db/*.csv seed files.
  • Adds a fix_foreign_keys.py utility and updates database/README.MD run instructions.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 26 comments.

Show a summary per file
File Description
database/mock-data-generation/generate_mock_data.py New end-to-end generator producing users/requests/volunteers/assignments/comments.
database/mock-data-generation/fix_foreign_keys.py Utility script intended to rewrite FK columns post-generation.
database/mock_db/users.csv New/updated committed mock users dataset.
database/mock_db/request.csv New/updated committed mock requests dataset.
database/mock_db/request_comments.csv New/updated committed mock request comments dataset.
database/mock_db/volunteer_details.csv New/updated committed mock volunteer details dataset.
database/README.MD Adds basic instructions for running the generator.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +11 to +13
# CONFIG
NUM_ROWS = 10000
OUTPUT_DIR = "../mock_db"
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other generators in database/mock-data-generation/ use shared helpers like utils.set_seed()/format_ts()/write_csv() (e.g., volunteer_applications.py:5-13) to keep output deterministic and consistently formatted. This script doesn’t seed random/Faker and bypasses those utilities, so outputs and formatting will vary run-to-run; consider reintroducing seeding (and/or reusing utils) if reproducibility is required.

Copilot uses AI. Check for mistakes.
Comment on lines +228 to +233
REQUEST_TYPES = {
1: "One-time",
2: "Recurring",
3: "Emergency"
}

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REQUEST_TYPES is hard-coded as 1-3 with labels that may not match the repo’s lookup table (database/lookup_tables/request_type.csv). To keep generated data loadable, consider reading valid req_type_id values from the lookup CSV instead of defining a separate enum here.

Suggested change
REQUEST_TYPES = {
1: "One-time",
2: "Recurring",
3: "Emergency"
}
def load_request_types():
"""
Load request types from the lookup CSV to keep mock data aligned with the database.
Falls back to the previous hard-coded mapping if the CSV is missing or malformed.
"""
csv_path = os.path.join(os.path.dirname(__file__), "..", "lookup_tables", "request_type.csv")
try:
df = pd.read_csv(csv_path)
# Expect at least two columns: id and label. Use the first two columns generically.
if df.empty or df.shape[1] < 2:
raise ValueError("request_type.csv does not have the expected structure")
id_col = df.columns[0]
label_col = df.columns[1]
return dict(zip(df[id_col].astype(int), df[label_col].astype(str)))
except Exception:
# Fallback to the original hard-coded mapping to avoid breaking existing behavior.
return {
1: "One-time",
2: "Recurring",
3: "Emergency",
}
REQUEST_TYPES = load_request_types()

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +14
2. Run script:
python generate_mock_data.py

Output:

- users.csv
- request.csv
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run command python generate_mock_data.py is ambiguous because the script lives under database/mock-data-generation/. Please clarify the correct invocation (e.g., run from repo root with the proper path) and optionally note that the output directory is database/mock_db/.

Suggested change
2. Run script:
python generate_mock_data.py
Output:
- users.csv
- request.csv
2. Run script from the repo root:
python database/mock-data-generation/generate_mock_data.py
Output (written to database/mock_db/):
- database/mock_db/users.csv
- database/mock_db/request.csv

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +3
1,17,Level alone lot respond what answer.,Continue garden decade strong soon. See new affect interview.,1,2,2
2,81,Personal fill poor from be.,Politics grow continue sister. True place no performance group call military.,1,3,3
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions contextual request subjects/descriptions by category, but the committed req_title/req_description values here still look like generic Faker text (and req_cat_id is numeric). If these CSVs are intended as the new generated output, please regenerate them using the new templates/categories so the sample data matches the PR’s stated goal.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +17
comments_df['req_id'] = comments_df['req_id'].apply(lambda x: random.choice(request_df['req_id']))
comments_df['commenter_id'] = comments_df['commenter_id'].apply(lambda x: random.choice(users_df['user_id']))
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script updates comments_df['req_id'] and comments_df['commenter_id'], but the committed database/mock_db/request_comments.csv header is comment_id,request_id,user_id,comment (no req_id/commenter_id). This will also raise KeyError unless the CSV schema is updated to match.

Suggested change
comments_df['req_id'] = comments_df['req_id'].apply(lambda x: random.choice(request_df['req_id']))
comments_df['commenter_id'] = comments_df['commenter_id'].apply(lambda x: random.choice(users_df['user_id']))
comments_df['request_id'] = comments_df['request_id'].apply(lambda x: random.choice(request_df['req_id']))
comments_df['user_id'] = comments_df['user_id'].apply(lambda x: random.choice(users_df['user_id']))

Copilot uses AI. Check for mistakes.
# Consistent country-state relationship
country_id = random.choice(list(COUNTRIES.keys()))
state_id = get_state_for_country(country_id)
state_info = STATES.get(state_id, {})
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state_info = STATES.get(state_id, {}) is assigned but never used. Either use it (e.g., to derive state_code/state_name columns) or remove the variable to reduce clutter.

Suggested change
state_info = STATES.get(state_id, {})

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,101 @@
comment_id,request_id,user_id,comment
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This request_comments.csv uses columns request_id,user_id,comment and numeric IDs, but the generator writes req_id,commenter_id,comment_desc,... with UUID IDs. As-is, the committed CSVs can’t be produced by (or consumed consistently with) the new generator; please make the column names/ID types consistent across the generator and these seed files.

Suggested change
comment_id,request_id,user_id,comment
comment_id,req_id,commenter_id,comment_desc

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +27
request_id,req_user_id,req_title,req_description,req_cat_id,req_priority_id,req_status_id
1,17,Level alone lot respond what answer.,Continue garden decade strong soon. See new affect interview.,1,2,2
2,81,Personal fill poor from be.,Politics grow continue sister. True place no performance group call military.,1,3,3
3,98,Him data market.,Produce join contain as. Civil meet off movement power. Various nice choose determine concern most.,5,1,2
4,8,Especially stuff call along term.,Reduce onto source community. Against act mention song.,4,2,2
5,80,Relate impact card news.,I town increase rise arrive. Tax relationship again make thus.,3,3,1
6,83,Specific as leave something.,Throw need old later center hand. Fill administration decade build.,2,3,2
7,98,Blue community then simple.,Stop simple shake organization summer throw. Reflect program open final under loss.,3,3,1
8,84,Everything be similar least century.,Develop sometimes collection term true left than. Best rise recognize quite stay develop.,1,3,1
9,93,Rise hope baby.,Word big whole phone impact teacher. According open city sit within likely.,1,3,1
10,4,School person note thing system candidate word.,Yeah upon life generation international about. Media film season week budget off seat.,1,3,2
11,82,Couple growth vote.,Fine weight peace end just. Player well third door red media.,5,1,2
12,91,Be read only.,Letter care push assume simple. Fly evening herself stage if.,1,2,2
13,80,North result worry affect police a.,Without grow upon exist picture reality. Through democratic well growth. Value part near ready all.,3,3,2
14,42,Factor last perhaps hot.,If quickly agree edge. Until some manage none year administration many.,2,1,2
15,41,First dog suddenly business.,Then finish specific probably. Partner have its month.,1,3,2
16,82,Television true respond ever.,Kitchen word mind compare avoid performance by. Rest reason kid.,1,2,2
17,32,Prove arrive with.,Word campaign person from leg growth. After executive each me top.,5,1,3
18,83,Somebody sea idea friend sea serve.,"Pay view whole go. Not a fill everyone.
Buy great despite price month base miss.",5,2,3
19,77,Major fall although television some.,Share forward realize design chance accept these. Room me add debate many.,2,1,2
20,35,Kind stock then know weight.,Above age would cultural economic your rise production.,1,2,2
21,86,Service without professor talk deep.,Fight here real various. Child heart simple network different.,5,3,3
22,20,Sing manage purpose eye.,"Approach wait ball fire. Board note season test step.
Play beautiful realize figure old.",3,3,1
23,45,Day finish hour memory.,Bring simply fine go first red activity.,3,1,3
24,50,Dog ask go board.,Mention manager than fall. Provide this adult term. Rich site few have minute pass.,3,1,3
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request.csv here uses request_id,req_title,req_description,... with numeric IDs and category/priority/status as integers, which doesn’t match the generator output (req_id UUIDs plus different column names like req_subj/req_desc, and req_cat_id as strings such as ELDERLY_CARE). Please align the committed seed CSV schema with the generator (or vice versa) so foreign keys and imports remain consistent.

Suggested change
request_id,req_user_id,req_title,req_description,req_cat_id,req_priority_id,req_status_id
1,17,Level alone lot respond what answer.,Continue garden decade strong soon. See new affect interview.,1,2,2
2,81,Personal fill poor from be.,Politics grow continue sister. True place no performance group call military.,1,3,3
3,98,Him data market.,Produce join contain as. Civil meet off movement power. Various nice choose determine concern most.,5,1,2
4,8,Especially stuff call along term.,Reduce onto source community. Against act mention song.,4,2,2
5,80,Relate impact card news.,I town increase rise arrive. Tax relationship again make thus.,3,3,1
6,83,Specific as leave something.,Throw need old later center hand. Fill administration decade build.,2,3,2
7,98,Blue community then simple.,Stop simple shake organization summer throw. Reflect program open final under loss.,3,3,1
8,84,Everything be similar least century.,Develop sometimes collection term true left than. Best rise recognize quite stay develop.,1,3,1
9,93,Rise hope baby.,Word big whole phone impact teacher. According open city sit within likely.,1,3,1
10,4,School person note thing system candidate word.,Yeah upon life generation international about. Media film season week budget off seat.,1,3,2
11,82,Couple growth vote.,Fine weight peace end just. Player well third door red media.,5,1,2
12,91,Be read only.,Letter care push assume simple. Fly evening herself stage if.,1,2,2
13,80,North result worry affect police a.,Without grow upon exist picture reality. Through democratic well growth. Value part near ready all.,3,3,2
14,42,Factor last perhaps hot.,If quickly agree edge. Until some manage none year administration many.,2,1,2
15,41,First dog suddenly business.,Then finish specific probably. Partner have its month.,1,3,2
16,82,Television true respond ever.,Kitchen word mind compare avoid performance by. Rest reason kid.,1,2,2
17,32,Prove arrive with.,Word campaign person from leg growth. After executive each me top.,5,1,3
18,83,Somebody sea idea friend sea serve.,"Pay view whole go. Not a fill everyone.
Buy great despite price month base miss.",5,2,3
19,77,Major fall although television some.,Share forward realize design chance accept these. Room me add debate many.,2,1,2
20,35,Kind stock then know weight.,Above age would cultural economic your rise production.,1,2,2
21,86,Service without professor talk deep.,Fight here real various. Child heart simple network different.,5,3,3
22,20,Sing manage purpose eye.,"Approach wait ball fire. Board note season test step.
Play beautiful realize figure old.",3,3,1
23,45,Day finish hour memory.,Bring simply fine go first red activity.,3,1,3
24,50,Dog ask go board.,Mention manager than fall. Provide this adult term. Rich site few have minute pass.,3,1,3
req_id,req_user_id,req_subj,req_desc,req_cat_id,req_priority_id,req_status_id
00000000-0000-0000-0000-000000000001,17,Level alone lot respond what answer.,Continue garden decade strong soon. See new affect interview.,ELDERLY_CARE,2,2
00000000-0000-0000-0000-000000000002,81,Personal fill poor from be.,Politics grow continue sister. True place no performance group call military.,ELDERLY_CARE,3,3
00000000-0000-0000-0000-000000000003,98,Him data market.,Produce join contain as. Civil meet off movement power. Various nice choose determine concern most.,OTHER,1,2
00000000-0000-0000-0000-000000000004,8,Especially stuff call along term.,Reduce onto source community. Against act mention song.,PET_CARE,2,2
00000000-0000-0000-0000-000000000005,80,Relate impact card news.,I town increase rise arrive. Tax relationship again make thus.,HOUSEKEEPING,3,1
00000000-0000-0000-0000-000000000006,83,Specific as leave something.,Throw need old later center hand. Fill administration decade build.,CHILDCARE,3,2
00000000-0000-0000-0000-000000000007,98,Blue community then simple.,Stop simple shake organization summer throw. Reflect program open final under loss.,HOUSEKEEPING,3,1
00000000-0000-0000-0000-000000000008,84,Everything be similar least century.,Develop sometimes collection term true left than. Best rise recognize quite stay develop.,ELDERLY_CARE,3,1
00000000-0000-0000-0000-000000000009,93,Rise hope baby.,Word big whole phone impact teacher. According open city sit within likely.,ELDERLY_CARE,3,1
00000000-0000-0000-0000-000000000010,4,School person note thing system candidate word.,Yeah upon life generation international about. Media film season week budget off seat.,ELDERLY_CARE,3,2
00000000-0000-0000-0000-000000000011,82,Couple growth vote.,Fine weight peace end just. Player well third door red media.,OTHER,1,2
00000000-0000-0000-0000-000000000012,91,Be read only.,Letter care push assume simple. Fly evening herself stage if.,ELDERLY_CARE,2,2
00000000-0000-0000-0000-000000000013,80,North result worry affect police a.,Without grow upon exist picture reality. Through democratic well growth. Value part near ready all.,HOUSEKEEPING,3,2
00000000-0000-0000-0000-000000000014,42,Factor last perhaps hot.,If quickly agree edge. Until some manage none year administration many.,CHILDCARE,1,2
00000000-0000-0000-0000-000000000015,41,First dog suddenly business.,Then finish specific probably. Partner have its month.,ELDERLY_CARE,3,2
00000000-0000-0000-0000-000000000016,82,Television true respond ever.,Kitchen word mind compare avoid performance by. Rest reason kid.,ELDERLY_CARE,2,2
00000000-0000-0000-0000-000000000017,32,Prove arrive with.,Word campaign person from leg growth. After executive each me top.,OTHER,1,3
00000000-0000-0000-0000-000000000018,83,Somebody sea idea friend sea serve.,"Pay view whole go. Not a fill everyone.
Buy great despite price month base miss.",OTHER,2,3
00000000-0000-0000-0000-000000000019,77,Major fall although television some.,Share forward realize design chance accept these. Room me add debate many.,CHILDCARE,1,2
00000000-0000-0000-0000-000000000020,35,Kind stock then know weight.,Above age would cultural economic your rise production.,ELDERLY_CARE,2,2
00000000-0000-0000-0000-000000000021,86,Service without professor talk deep.,Fight here real various. Child heart simple network different.,OTHER,3,3
00000000-0000-0000-0000-000000000022,20,Sing manage purpose eye.,"Approach wait ball fire. Board note season test step.
Play beautiful realize figure old.",HOUSEKEEPING,3,1
00000000-0000-0000-0000-000000000023,45,Day finish hour memory.,Bring simply fine go first red activity.,HOUSEKEEPING,1,3
00000000-0000-0000-0000-000000000024,50,Dog ask go board.,Mention manager than fall. Provide this adult term. Rich site few have minute pass.,HOUSEKEEPING,1,3

Copilot uses AI. Check for mistakes.
OUTPUT_DIR = "output_csv_files"
# CONFIG
NUM_ROWS = 10000
OUTPUT_DIR = "../mock_db"
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OUTPUT_DIR = "../mock_db" is relative to the current working directory, not the script location, so running the script from a different directory can write files to an unexpected path. Consider resolving the output directory relative to __file__ (e.g., using pathlib.Path(__file__).parent) or accept an explicit CLI arg/env var for the output path.

Suggested change
OUTPUT_DIR = "../mock_db"
OUTPUT_DIR = os.path.join(os.path.dirname(__file__), "..", "mock_db")

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +20
18,83,Somebody sea idea friend sea serve.,"Pay view whole go. Not a fill everyone.
Buy great despite price month base miss.",5,2,3
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Row 18’s req_description contains an embedded newline inside a quoted CSV field. While valid CSV, it tends to break simplistic line-based import tooling. If these files are intended for easy loading (e.g., via shell/ETL scripts), consider ensuring generated descriptions don’t include newlines (or confirm all loaders handle RFC4180 multiline fields).

Suggested change
18,83,Somebody sea idea friend sea serve.,"Pay view whole go. Not a fill everyone.
Buy great despite price month base miss.",5,2,3
18,83,Somebody sea idea friend sea serve.,"Pay view whole go. Not a fill everyone. Buy great despite price month base miss.",5,2,3

Copilot uses AI. Check for mistakes.
@Prasadkurapati
Copy link
Copy Markdown
Author

Hi @saquibb8 , this PR is ready for your review. I accidentally triggered Copilot AI review, but please ignore those comments - they are automated suggestions. I've addressed your original feedback about country-state consistency and realistic request content. Thanks!"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants