Skip to content

Comments

feat(agent): Database Deduplication Agent#993

Open
8471919 wants to merge 9 commits intomainfrom
feat/database_duplication
Open

feat(agent): Database Deduplication Agent#993
8471919 wants to merge 9 commits intomainfrom
feat/database_duplication

Conversation

@8471919
Copy link
Contributor

@8471919 8471919 commented Jan 29, 2026

This pull request introduces a semantic deduplication step to the Prisma database orchestration pipeline. It adds a new agent system prompt for deduplication, implements logic to detect and group semantically duplicate tables across components, and integrates this process into the orchestration flow. Additionally, it improves developer visibility with logging and ensures that downstream schema generation and review steps operate on deduplicated components.

Major changes include:

Semantic Deduplication Agent & Prompt

  • Added a comprehensive system prompt (DATABASE_DEDUPLICATION.md) detailing the agent's responsibilities, semantic duplicate criteria, workflow, output format, and examples for identifying duplicate tables across components.

Deduplication Orchestration Logic

  • Implemented transformPrismaDeduplicationHistory, which:
    • Normalizes table names (removing prefixes, singularizing, sorting tokens) to identify strong duplicate candidates.
    • Formats and presents naming similarity hints to the agent.
    • Guides the agent to group semantically duplicate tables and output the required structure.
  • Integrated the deduplication step into the Prisma orchestration pipeline:
    • Added the deduplication orchestrator import and invocation after component review, before schema generation.
    • Ensured that only deduplicated components are passed to schema and schema review steps. [1] [2] [3]

Developer Experience

  • Added detailed logging throughout the orchestration process to track the state of components before and after deduplication, aiding debugging and transparency.

Agent Simulation

  • Updated the agent simulation's sleep map to include the new databaseDeduplication event type.

@8471919 8471919 requested a review from samchon January 29, 2026 08:49
@8471919 8471919 self-assigned this Jan 29, 2026
@8471919 8471919 added this to WrtnLabs Jan 29, 2026
@8471919 8471919 added the enhancement New feature or request label Jan 29, 2026
@samchon samchon requested a review from Copilot January 29, 2026 10:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a semantic deduplication step to the Prisma database orchestration pipeline to identify and eliminate duplicate tables across components that serve the same purpose but may have different names.

Changes:

  • Adds a new deduplication agent with comprehensive system prompt (DATABASE_DEDUPLICATION.md) defining semantic duplicate criteria, workflow, and output format
  • Implements deduplication orchestration logic with naming similarity hints, Union-Find cluster merging, and deterministic resolution (keeping tables from smallest components)
  • Integrates deduplication into the Prisma orchestration pipeline between component review and schema generation phases

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
packages/agent/prompts/DATABASE_DEDUPLICATION.md Comprehensive system prompt defining agent responsibilities, duplicate criteria, execution flow, and examples
packages/agent/src/orchestrate/prisma/orchestratePrismaDeduplication.ts Main orchestration logic for running deduplication agents per component
packages/agent/src/orchestrate/prisma/programmers/AutoBeDatabaseDeduplicationProgrammer.ts Validation and resolution logic including Union-Find cluster merging
packages/agent/src/orchestrate/prisma/histories/transformPrismaDeduplicationHistory.ts Prompt construction with naming similarity hints based on normalized table names
packages/agent/src/orchestrate/prisma/structures/IAutoBeDatabaseDeduplicationApplication.ts TypeScript interface defining agent application structure
packages/agent/src/orchestrate/prisma/orchestratePrisma.ts Integration of deduplication step into main orchestration pipeline with debug logging
packages/interface/src/events/AutoBeDatabaseDeduplicationEvent.ts Event definition for deduplication progress tracking
packages/interface/src/histories/contents/AutoBeDatabaseDeduplicationGroup.ts Data structure for representing duplicate table groups
packages/ui/src/components/events/AutoBeProgressEventMovie.tsx UI support for displaying deduplication progress
packages/ui/src/components/events/AutoBeEventMovie.tsx Event type registration for UI rendering
packages/ui/src/structure/AutoBeListener.ts Event listener integration
test/src/archive/utils/ArchiveLogger.ts Logging support for deduplication events
packages/agent/src/AutoBeMockAgent.ts Mock agent sleep time configuration
test/src/agent/internal/validate_interface_complement.ts Added failures parameter (unrelated fix)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +196 to +197
const [namespace, name] = key.split("::");
cluster.push({ namespace: namespace!, name: name! });
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: Using split with "::" separator and non-null assertion operators without validation. If a table key doesn't contain "::" or contains multiple instances of it, this could result in incorrect namespace/name extraction. The split could return an array with unexpected length, and the non-null assertions (namespace!, name!) could mask undefined values. Consider adding validation or using a more robust key format.

Copilot uses AI. Check for mistakes.
2. **Check the Naming Similarity Hints first** — tables with the same normalized name are strong duplicate candidates
3. For each target table, compare its name AND description against every table in other components
4. If two tables serve the same purpose → group them as duplicates
5. Call \`process({ request: { type: "complete", review: "...", duplicateGroups: [...] } })\`
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discrepancy in user message: The prompt instructs the agent to call process({ request: { type: "complete", review: "...", duplicateGroups: [...] } }), but the IComplete interface does not have a review field. The correct fields are analysis and rationale. This will cause the agent to fail validation when following the prompt instructions. The prompt should be updated to match the actual interface definition.

Suggested change
5. Call \`process({ request: { type: "complete", review: "...", duplicateGroups: [...] } })\`
5. Call \`process({ request: { type: "complete", analysis: "...", rationale: "...", duplicateGroups: [...] } })\`

Copilot uses AI. Check for mistakes.
` - process: progress`,
` - progress: (${event.completed} of ${event.total})`,
` - namespace: ${event.namespace}`,
` - duplicated tables: ${event.duplicateGroups.map((g) => g.tables.map((t) => t.name).join(", ")).join(", ")}`,
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging format could be misleading or incorrect. This line flattens all duplicate group tables into a single comma-separated list, which makes it unclear which tables belong to which duplicate group. Consider formatting this to show the group structure more clearly, for example: duplicateGroups.map((g) => [${g.tables.map((t) => t.name).join(", ")}]).join("; ") to separate groups with semicolons or brackets.

Suggested change
` - duplicated tables: ${event.duplicateGroups.map((g) => g.tables.map((t) => t.name).join(", ")).join(", ")}`,
` - duplicated tables: ${event.duplicateGroups
.map((g) => `[${g.tables.map((t) => t.name).join(", ")}]`)
.join("; ")}`,

Copilot uses AI. Check for mistakes.
Comment on lines 81 to 128
console.log(`----------- PRISMA AUTHORIZATION -----------`);
console.log(JSON.stringify(authorizations, null, 2));

const reviewedAuthorizations: AutoBeDatabaseComponent[] =
await orchestratePrismaAuthorizationReview(ctx, {
instruction: props.instruction,
components: authorizations,
});
console.log(`----------- PRISMA AUTHORIZATION REVIEW -----------`);
console.log(JSON.stringify(reviewedAuthorizations, null, 2));

// COMPONENT
const components: AutoBeDatabaseComponent[] =
await orchestratePrismaComponent(ctx, {
instruction: props.instruction,
groups: reviewedGroups,
});
console.log(`----------- PRISMA COMPONENT -----------`);
console.log(JSON.stringify(components, null, 2));

const reviewedComponents: AutoBeDatabaseComponent[] =
await orchestratePrismaComponentReview(ctx, {
instruction: props.instruction,
components,
});
const reviewedAllComponents: AutoBeDatabaseComponent[] = [
...reviewedAuthorizations,
...reviewedComponents,
];
console.log(`----------- PRISMA COMPONENT REVIEW -----------`);
console.log(JSON.stringify(reviewedComponents, null, 2));

const reviewedAllComponents: AutoBeDatabaseComponent[] =
AutoBeDatabaseComponentProgrammer.removeDuplicatedTable([
...reviewedAuthorizations,
...reviewedComponents,
]);

// DEDUPLICATION (semantic)
const deduplicatedComponents: AutoBeDatabaseComponent[] =
await orchestratePrismaDeduplication(ctx, {
instruction: props.instruction,
components: reviewedAllComponents,
});
console.log(`----------- PRISMA DEDUPLICATION -----------`);
console.log(JSON.stringify(deduplicatedComponents, null, 2));
console.log(
`before Tables: ${reviewedAllComponents.flatMap((c) => c.tables).length}`,
);
console.log(
`after Tables: ${deduplicatedComponents.flatMap((c) => c.tables).length}`,
);
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug logging statements should be removed before merging to production. These console.log statements with large JSON payloads can impact performance and clutter logs in production environments. Consider using a proper logging framework with configurable log levels, or remove these statements entirely if they were only needed for development.

Copilot uses AI. Check for mistakes.
…aDeduplicationHistory.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@8471919 8471919 force-pushed the feat/database_duplication branch from 19bdb6b to 405c500 Compare February 3, 2026 02:18
@8471919 8471919 force-pushed the feat/database_duplication branch from 405c500 to d09f23c Compare February 3, 2026 08:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +206 to +207
const [namespace, name] = key.split("::");
cluster.push({ namespace: namespace!, name: name! });
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split operation on line 206 assumes that the key will contain exactly one "::" separator. However, if a table name or namespace contains "::" in it, this could lead to incorrect parsing. While table names and namespaces are validated elsewhere with snake_case patterns that shouldn't allow "::", it would be safer to use a more robust splitting approach (e.g., splitting with a limit of 2, or using a different separator that's guaranteed not to appear in the data).

Suggested change
const [namespace, name] = key.split("::");
cluster.push({ namespace: namespace!, name: name! });
const separator = "::";
const separatorIndex = key.indexOf(separator);
let namespace: string;
let name: string;
if (separatorIndex === -1) {
// Fallback for malformed keys without the expected separator.
namespace = "";
name = key;
} else {
namespace = key.slice(0, separatorIndex);
name = key.slice(separatorIndex + separator.length);
}
cluster.push({ namespace, name });

Copilot uses AI. Check for mistakes.
{ name: "sale_question_answers", description: "Seller answers to customer questions" },
{
name: "sale_reviews",
description: "[INPUT] Customer reviews and ratings for purchased sales. Stores review content (rating, title, body, images), customer reference, and verified_purchase flag. Created after customer receives order. Used in product page display and seller rating calculation. Does NOT store review responses - see sale_review_replies for seller responses."
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description references a table sale_review_replies that doesn't exist in the examples. This appears to be an inconsistency - reviews typically don't have seller responses (unlike questions which have answers). Consider either:

  1. Removing the reference to sale_review_replies if reviews don't support responses, or
  2. Adding sale_review_replies to the examples if they do support responses

Based on the Q&A pattern having separate questions/answers tables, if reviews support responses, they should follow the same pattern.

Suggested change
description: "[INPUT] Customer reviews and ratings for purchased sales. Stores review content (rating, title, body, images), customer reference, and verified_purchase flag. Created after customer receives order. Used in product page display and seller rating calculation. Does NOT store review responses - see sale_review_replies for seller responses."
description: "[INPUT] Customer reviews and ratings for purchased sales. Stores review content (rating, title, body, images), customer reference, and verified_purchase flag. Created after customer receives order. Used in product page display and seller rating calculation. Does NOT store seller responses; if seller responses are required, model them in a separate table following the Q&A pattern."

Copilot uses AI. Check for mistakes.
reason: "Requirement 3.5 specifies customer reviews on sales, but no review table exists",
table: "sale_reviews",
description: "Customer reviews and ratings for sales with helpful votes"
description: "[INPUT] Customer reviews and ratings for purchased sales. Stores review content (rating, title, body, images), customer reference, verified_purchase flag, timestamps. Created after customer receives order. Used in product page display and seller rating calculation. Does NOT store review responses - see sale_review_replies for seller responses."
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description references a table sale_review_replies that doesn't exist in the examples. This appears to be an inconsistency - reviews typically don't have seller responses (unlike questions which have answers). Consider either:

  1. Removing the reference to sale_review_replies if reviews don't support responses, or
  2. Adding sale_review_replies to the examples if they do support responses

Based on the Q&A pattern having separate questions/answers tables, if reviews support responses, they should follow the same pattern.

Suggested change
description: "[INPUT] Customer reviews and ratings for purchased sales. Stores review content (rating, title, body, images), customer reference, verified_purchase flag, timestamps. Created after customer receives order. Used in product page display and seller rating calculation. Does NOT store review responses - see sale_review_replies for seller responses."
description: "[INPUT] Customer reviews and ratings for purchased sales. Stores review content (rating, title, body, images), customer reference, verified_purchase flag, timestamps. Created after customer receives order. Used in product page display and seller rating calculation. Does NOT store seller responses; this table only contains customer-authored feedback.",

Copilot uses AI. Check for mistakes.
}
}

const parent: number[] = tableKeys.map((_, i) => i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const parent: number[] = tableKeys.map((_, i) => i);
const parent: number[] = tableKeys.keys();

https://developer.mozilla.org/ko/docs/Web/JavaScript/Reference/Global_Objects/Array/keys

}

const parent: number[] = tableKeys.map((_, i) => i);
const rank: number[] = tableKeys.map(() => 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const rank: number[] = tableKeys.map(() => 0);
const rank: number[] = new Array(tableKyes.length).fill(0);

https://developer.mozilla.org/ko/docs/Web/JavaScript/Reference/Global_Objects/Array/Array

https://developer.mozilla.org/ko/docs/Web/JavaScript/Reference/Global_Objects/Array/fill

Comment on lines +143 to +147
for (const group of groups) {
for (const table of group.tables) {
getOrCreateIndex(table.namespace, table.name);
}
}
Copy link
Member

@sunrabbit123 sunrabbit123 Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (const group of groups) {
for (const table of group.tables) {
getOrCreateIndex(table.namespace, table.name);
}
}
groups.flatMap(g => g.tables).forEach(t => getOrCreateIndex(t.namespace, t.name));


// Restore original order and filter empty components
const result: AutoBeDatabaseComponent[] = processed
.sort((a, b) => a.second - b.second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, It doesn't matter because it's not a large amount
Doing filter and doing the next action has fewer operations.

@8471919 8471919 force-pushed the feat/database_duplication branch from b7a4a05 to fab4642 Compare February 3, 2026 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants