-
Notifications
You must be signed in to change notification settings - Fork 9
Enhance synthetic data generation and fix schema inconsistencies #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
6cc20e6
7f18820
ac4ae42
96dd103
ecbd4af
af17360
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1 +1,14 @@ | ||||||||||||||||||||||||||||||
| This script generates synthetic data for users and request tables. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| How to run: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| 1. Install dependencies: | ||||||||||||||||||||||||||||||
| pip install faker pandas | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| 2. Run script: | ||||||||||||||||||||||||||||||
| python generate_mock_data.py | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Output: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| - users.csv | ||||||||||||||||||||||||||||||
| - request.csv | ||||||||||||||||||||||||||||||
|
Comment on lines
+8
to
+14
|
||||||||||||||||||||||||||||||
| 2. Run script: | |
| python generate_mock_data.py | |
| Output: | |
| - users.csv | |
| - request.csv | |
| 2. Run script from the repo root: | |
| python database/mock-data-generation/generate_mock_data.py | |
| Output (written to database/mock_db/): | |
| - database/mock_db/users.csv | |
| - database/mock_db/request.csv |
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,29 @@ | ||||||||||
| import pandas as pd | ||||||||||
| import random | ||||||||||
|
|
||||||||||
| # Load generated CSVs | ||||||||||
|
Comment on lines
+1
to
+4
|
||||||||||
| users_df = pd.read_csv("../mock_db/users.csv") | ||||||||||
| request_df = pd.read_csv("../mock_db/request.csv") | ||||||||||
| comments_df = pd.read_csv("../mock_db/request_comments.csv") | ||||||||||
|
Comment on lines
+5
to
+7
|
||||||||||
| volunteers_df = pd.read_csv("../mock_db/volunteer_details.csv") | ||||||||||
| assigned_df = pd.read_csv("../mock_db/volunteers_assigned.csv") | ||||||||||
|
|
||||||||||
| # Fix request table | ||||||||||
| request_df['req_user_id'] = request_df['req_user_id'].apply(lambda x: random.choice(users_df['user_id'])) | ||||||||||
| request_df.to_csv("../mock_db/request.csv", index=False) | ||||||||||
|
|
||||||||||
| # Fix comments table | ||||||||||
| comments_df['req_id'] = comments_df['req_id'].apply(lambda x: random.choice(request_df['req_id'])) | ||||||||||
| comments_df['commenter_id'] = comments_df['commenter_id'].apply(lambda x: random.choice(users_df['user_id'])) | ||||||||||
|
Comment on lines
+16
to
+17
|
||||||||||
| comments_df['req_id'] = comments_df['req_id'].apply(lambda x: random.choice(request_df['req_id'])) | |
| comments_df['commenter_id'] = comments_df['commenter_id'].apply(lambda x: random.choice(users_df['user_id'])) | |
| comments_df['request_id'] = comments_df['request_id'].apply(lambda x: random.choice(request_df['req_id'])) | |
| comments_df['user_id'] = comments_df['user_id'].apply(lambda x: random.choice(users_df['user_id'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This README says the script generates data only for users/request tables, but
generate_mock_data.pywrites additional outputs (volunteer_details, volunteers_assigned, request_comments). Please update the description/output list so it reflects what the script actually generates.