Skip to content

Conversation

Copy link

Copilot AI commented Dec 30, 2025

When querying data from multiple channels (e.g., AQC + TYC), duplicate records accumulate in merged output files without deduplication.

Changes

Deduplication logic (common/utils/utils.go)

  • DeduplicateMapList(): Deduplicates records using type-specific unique keys
    • Enterprise data (invest, branch, holds, supplier, partner): pid
    • ICP records: domain + icp
    • Apps: name + bundle_id
    • WeChat/Weibo/Jobs/Copyright: appropriate unique fields
    • Filters records with all-empty key fields
  • DeduplicateData(): Batch processes all data types with summary logging

Integration (runner/runner.go)

  • Applied in OutFileByEnInfo() before file export
  • Applied in OutDataByEnInfo() for API responses
  • Single deduplication pass at output time (not incremental) for O(n) performance

Example

// Before: duplicate records from multiple sources
enterpriseData := []map[string]string{
    {"name": "小米科技", "pid": "123456"},  // from AQC
    {"name": "小米科技", "pid": "123456"},  // from TYC (duplicate)
    {"name": "华为技术", "pid": "789012"},
}

// After deduplication
result := DeduplicateMapList("enterprise_info", enterpriseData)
// Returns 2 items, logs: "数据去重完成: 原始 3 条,去重后 2 条,移除重复 1 条"

Automatic behavior change - no configuration required. Records with identical unique keys are now removed during merge operations.

Original prompt

This section details on the original issue you should resolve

<issue_title>增加一个数据去重吧</issue_title>
<issue_description>合并文件后有些是重复的,比如多个渠道查询的数据放在了一起,希望增加一个合并时数据去重</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 4 commits December 30, 2025 02:25
Co-authored-by: keac <16091665+keac@users.noreply.github.com>
Co-authored-by: keac <16091665+keac@users.noreply.github.com>
…formance

Co-authored-by: keac <16091665+keac@users.noreply.github.com>
Co-authored-by: keac <16091665+keac@users.noreply.github.com>
Copilot AI changed the title [WIP] Add data deduplication feature for merged files Add deduplication for merged data from multiple sources Dec 30, 2025
Copilot AI requested a review from keac December 30, 2025 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

增加一个数据去重吧

2 participants