Sandboxed Verification Environment

Role: Act as a Senior Data Engineer and QA Automation Specialist specializing in data privacy and ETL (Extract, Transform, Load) pipelines.

Context: I am evaluating the OpenDataMask library (https://github.com/MaximumTrainer/OpenDataMask) for a production environment. I need a robust verification suite to prove the tool's effectiveness in masking sensitive data while maintaining referential integrity.

Task: Create a complete, containerized (Docker-based) verification test. The solution must include:

Infrastructure Setup:

Provide a docker-compose.yml file spinning up two separate databases (e.g., PostgreSQL): SOURCE_DB and TARGET_DB.

Initialize SOURCE_DB with a users table containing at least 50 records. Fields should include: id (PK, UUID), full_name, email, phone_number, date_of_birth, and salary.

OpenDataMask Configuration:

Generate the necessary configuration files (JSON or YAML) for OpenDataMask to connect to SOURCE_DB and pipe data to TARGET_DB.

Define Translation Rules that ensure data is human-readable but fake. (e.g., Replace full_name with a random realistic name, scramble email while keeping the domain, and shift date_of_birth by a random number of days).

Execution Logic:

Provide a shell script or Python wrapper that triggers the OpenDataMask execution to perform the extract-mask-load process.

Automated Verification Script:

Create a Python validation script that connects to both databases post-execution and performs the following checks:

Record Integrity: Confirm the count of records in SOURCE_DB matches TARGET_DB.

Key Persistence: Verify that for every id in the source, the exact same id exists in the target.

Masking Effectiveness: Compare specific fields (name, email) for each id. The test passes only if source.id == target.id AND source.email != target.email.

Human Readability Check: Log a sample of 5 records to the console to visually confirm the masked data looks realistic (e.g., no random strings like "asdfghjkl").

Reporting: * The script should output a "Verification Report" summarizing the pass/fail status of the Integrity and Masking checks.

Understanding the Verification Flow
When you run a data masking test, it follows a specific architectural flow to ensure that sensitive information never leaves the source in its original form but remains functionally useful for developers or testers.

Key Components to Look for in the Result:
When you use the prompt above, ensure the generated solution addresses these critical technical points of OpenDataMask:

Primary Key Retention: The tool must not "hash" or "mask" the ID itself if you intend to use the target database for debugging applications that rely on specific IDs.

Deterministic vs. Random Masking: If you run the test twice, does the same Source Name become the same Target Name? Depending on your use case, you may want to ask the AI to configure "Deterministic" rules.

Referential Integrity: If you have a orders table linked to users, the verification script should ideally check if the Foreign Keys in the target DB still point to the correct (now masked) users.

Why this test is necessary
Data masking isn't just about changing values; it’s about Data Utility. If a developer receives a database where all names are "XXXXX", they cannot test search features or UI sorting. By requiring "Human Readable" data in the prompt, you ensure that the OpenDataMask configuration utilizes its "Faker" or "Substitution" libraries effectively, providing a realistic testing environment without the privacy risk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sandboxed Verification Environment #39

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sandboxed Verification Environment #39

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions