docs: align naming to influencer list and include minimal publish components

ChesterRa · ChesterRa · commit 70ffe00107cb · 2025-11-23T23:46:21.000+09:00
diff --git a/.gitignore b/.gitignore
@@ -21,8 +21,11 @@ data/**
 # Docs / audit / QA (not tracked)
 docs/**
 
-# Lists and seeds (not tracked)
+# Lists and seeds (track only core rules)
 lists/**
+!lists/rules/
+!lists/rules/brand_heuristics.yml
+!lists/rules/risk_terms.yml
 
 # Processed/working batches (not tracked)
 processed_batches/**
@@ -32,6 +35,7 @@ data/test/**
 data/samples/**
 data/latest/**
 data/prefetched.sample.jsonl
+!data/prefetched.sample.jsonl
 
 # Scripts/tools: track only minimal publish set
 scripts/**
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # influx
 
-High-signal X/Twitter creator index — functional demo built on the open-source multi-agent framework [CCCC](https://github.com/ChesterRa/cccc). Ready-to-use whitelist of influential individual accounts (non-brand/non-official), curated for downstream ingestion.
+High-signal X/Twitter influencer list — functional demo built on the open-source multi-agent framework [CCCC](https://github.com/ChesterRa/cccc). Ready-to-use whitelist of influential individual accounts (non-brand/non-official), curated for downstream ingestion. This open-source bundle is a **download-and-use** minimal set; the full production flow requires CCCC + RUBE MCP (Twitter tools) for data fetching.
 
 [![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE) [![Schema](https://img.shields.io/badge/schema-v1.0.0-green.svg)](schema/bigv.schema.json)
 
@@ -13,6 +13,12 @@ High-signal X/Twitter creator index — functional demo built on the open-source
   - Gzipped: [`data/release/influx-latest.jsonl.gz`](data/release/influx-latest.jsonl.gz)  
   - Manifest: [`data/release/manifest.json`](data/release/manifest.json)
 - Delivered as data only; you don’t need to run the pipeline to use it.
+- Components included here (minimal open-source set):
+  - Data: `data/release/` (latest JSONL + manifest)
+  - Guard: `scripts/pipeline_guard.sh`
+  - Schema: `schema/bigv.schema.json`
+  - Rules: `lists/rules/brand_heuristics.yml`, `lists/rules/risk_terms.yml`
+  - Sample prefetched JSONL: `data/prefetched.sample.jsonl` (for local filter demo)
 
 ## Why it’s useful
 - **Content ingestion/ranking:** high signal-to-noise whitelist reduces crawl and processing cost.
@@ -52,9 +58,9 @@ print(len(ai_authors))
 - Full schema: [`schema/bigv.schema.json`](schema/bigv.schema.json)
 - Key fields: `id` (author_id), `handle`, `name`, `verified`, `followers_count`, `lang_primary`, `topic_tags`, `metrics_30d*`, `meta.sources` (with evidence/fetched_at), `provenance_hash`.
 
-## How it’s produced (context only; not required to use)
-- Two-step, no local MCP dependency: fetch Twitter users in an MCP-capable environment → save prefetched JSONL → run `influx-harvest x-lists|bulk --prefetched-users <file>` here → enforce `scripts/pipeline_guard.sh` (dedup handle/id, evidence required, placeholder/“000” rejection, strict schema) → publish to `data/release/`.
-- Only prefetched JSONL inputs are accepted; manual edits to `latest` are forbidden.
+## How it’s produced (context; requires external deps)
+- Full flow requires: **CCCC + RUBE MCP (Twitter tools)** to fetch users → prefetched JSONL → run `influx-harvest x-lists|bulk --prefetched-users <file>` → enforce `scripts/pipeline_guard.sh` (dedup handle/id, evidence required, placeholder/“000” rejection, strict schema) → publish to `data/release/`.
+- This repo ships the minimal set (data + guard + schema + rules + sample prefetched); MCP fetching is not included here.
 
 ## License
 - Apache-2.0 (covers code and released data).
diff --git a/README.zh.md b/README.zh.md
@@ -1,12 +1,13 @@
 # influx
 
-基于开源多 Agent 框架 [CCCC](https://github.com/ChesterRa/cccc) 的功能性示范项目，提供高价值的 X.com 热门推主白名单（高活跃、非品牌/官方），直接下载即可使用。
+基于开源多 Agent 框架 [CCCC](https://github.com/ChesterRa/cccc) 的功能性示范项目，提供高价值的 X.com 热门推主清单（influencer list，强调个人非品牌/官方），直接下载即可使用。本仓库仅包含“可下载使用”的最小集；完整生产流程需要 CCCC + RUBE MCP（Twitter 工具）来抓取数据。
 
 [![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE) [![Schema](https://img.shields.io/badge/schema-v1.0.0-green.svg)](schema/bigv.schema.json)
 
 ## 这是什么
 - 一份严格过滤的热门推主名单（个人账号，非品牌/官方），目标规模 5k–10k。
 - 当前发布：`data/release/influx-latest.jsonl`（302 条）及 `manifest.json`（含 count/sha256/schema_version/timestamp/score_version）。
+- 已包含的开源最小集：发布数据、`scripts/pipeline_guard.sh`、`schema/bigv.schema.json`、规则（`lists/rules/brand_heuristics.yml`、`lists/rules/risk_terms.yml`）、示例 `data/prefetched.sample.jsonl`（本地过滤演示用）。
 - 面向使用者直接消费数据，无需运行生产流水线。
 
 ## 为什么有价值
@@ -53,8 +54,8 @@ print(len(ai_authors))
 - 关键字段：`id`（author_id）、`handle`、`name`、`verified`、`followers_count`、`lang_primary`、`topic_tags`、`metrics_30d*`、`meta.sources`（含 evidence/fetched_at）、`provenance_hash`。
 
 ## 生产方式（了解即可，使用者无需运行）
-- 两段式、无本地 MCP 依赖：在具备 MCP 的环境批量获取 Twitter 用户 JSONL（prefetched）→ 在本仓库用 `influx-harvest x-lists|bulk --prefetched-users <file>` 过滤 → 运行 `scripts/pipeline_guard.sh`（handle/id 去重、证据必填、占位/“000”拒绝、strict schema）→ 发布到 `data/release/`。
-- 只接受 prefetched JSONL；禁止手工编辑 latest。
+- 完整流程需：CCCC + RUBE MCP（Twitter 工具）获取用户 → 生成 prefetched JSONL → 用 `influx-harvest x-lists|bulk --prefetched-users <file>` 过滤 → 运行 `scripts/pipeline_guard.sh`（去重 handle/id、证据必填、占位/“000”拒绝、strict schema）→ 发布到 `data/release/`。
+- 本仓库仅提供数据、guard、schema、规则及示例，不包含 MCP 抓取部分。
 
 ## 许可
 - Apache-2.0（代码与数据一致）。
diff --git a/data/prefetched.sample.jsonl b/data/prefetched.sample.jsonl
@@ -0,0 +1,2 @@
+{"id":"1","username":"pmarca","name":"Marc Andreessen","description":"Co-founder a16z","verified_type":"blue","public_metrics":{"followers_count":1600000},"url":"https://a16z.com"}
+{"id":"2","username":"fakebrand","name":"Acme Inc.","description":"Official account","verified_type":"org","public_metrics":{"followers_count":500000},"url":"https://acme.com"}
diff --git a/lists/rules/brand_heuristics.yml b/lists/rules/brand_heuristics.yml
@@ -0,0 +1,243 @@
+# Brand and Official Account Heuristics
+# Used to filter out brand, media, official, and organizational accounts
+# from the influencer pool. Rules are evaluated as OR (any match triggers flag).
+
+version: "2.0.0"
+updated: "2025-11-14"
+
+# Keywords in name or username (case-insensitive, word boundary match)
+name_keywords:
+  official_indicators:
+    - "official"
+    - "team"  # NOTE: May FP on gaming/esports teams ("Team Liquid player"). Recommend context check (gaming domain indicators).
+    - "support"
+    - "help"
+    - "press"
+    - "pr"
+    - "media"
+    - "news"
+    - "newsroom"
+
+  corporate_indicators:
+    - "corp"
+    - "inc"  # NOTE: "Inc" as personal nickname (e.g., "John Doe Inc") may cause FP; use context/verified status to override
+    - "ltd"
+    - "llc"
+    - "gmbh"
+    - "company"
+    - "enterprises"
+
+  brand_commerce:
+    - "store"
+    - "shop"  # NOTE: May FP on job descriptions ("at Shopify"). Recommend word-boundary match in implementation.
+    - "shopping"
+    - "deals"
+    - "sales"
+    - "promo"
+    - "coupon"
+
+  # Major tech companies and platforms (corporate accounts)
+  tech_corporations:
+    - "amazon web services"
+    - "aws"
+    - "microsoft azure"
+    - "azure"
+    - "google cloud"
+    - "google ai"
+    - "google deepmind"
+    - "tensor flow"
+    - "kaggle"
+    - "docker"
+    - "figma"
+    - "mongodb"
+    - "stripe"
+    - "next.js"
+    - "nextjs"
+    - "vercel"
+    - "netlify"
+    - "cloudflare"
+    - "tailwind css"
+    - "tailwindcss"
+    - "react"
+    - "reactjs"
+    - "vue.js"
+    - "vuejs"
+    - "angular"
+    - "nvidia"
+    - "github"
+    - "gitlab"
+    - "openai"
+    - "anthropic"
+    - "hugging face"
+    - "huggingface"
+    - "pytorch"
+    - "tensorflow"
+    - "def con"
+    - "defcon"
+    - "supabase"
+    - "linear"
+    - "remix"
+    - "digitalocean"
+    - "heroku"
+
+  tech_frameworks:
+    - "platform"
+    - "framework"
+    - "ecosystem"
+    - "sdk"
+    - "api"
+    - "developer tools"
+    - "open source"
+    - "library"
+    - "runtime"
+    - "database"
+    - "devtools"
+
+  conference_events:
+    - "conference"
+    - "summit"
+    - "keynote"
+    - "event"
+    - "festival"
+    - "convention"
+    - "meetup"
+
+# Chinese official account patterns (新增中文官号识别)
+  chinese_official:
+    - "官号"
+    - "官方"
+    - "公司"
+    - "企业"
+    - "组织"
+    - "机构"
+    - "平台"
+    - "官方账号"
+    - "官方推特"
+    - "客服"
+
+# Keywords in bio/description (case-insensitive, substring match)
+bio_keywords:
+  organizational:
+    - "official account"
+    - "official twitter"
+    - "official page"
+    - "managed by"
+    - "run by our team"
+    - "corporate account"
+    - "company news"
+    - "press releases"
+    - "media inquiries"
+    - "for support"
+    - "customer service"
+
+  media_publishers:
+    - "news outlet"
+    - "news organization"
+    - "media company"
+    - "news network"
+    - "publishing"
+    - "journalist at"
+    - "reporter for"
+    - "editor at"
+
+  aggregators_bots:
+    - "automated"
+    - "bot account"
+    - "news aggregator"
+    - "auto-tweet"
+    - "rss feed"
+
+# Domain patterns in profile URL or bio links (regex)
+domain_patterns:
+  - pattern: ".*\\.(gov|edu|org)$"
+    reason: "institutional_domain"
+    exceptions: ["github.org", "huggingface.co"]
+
+  - pattern: ".*(shop|store|deals|buy|cart|checkout).*"
+    reason: "ecommerce_domain"
+
+  - pattern: ".*(news|press|media|journal|times|post|tribune).*"
+    reason: "media_domain"
+
+# Verification status rules
+verification_rules:
+  # X Blue (verified=blue) is personal; keep unless other heuristics match
+  # Organization verification (verified=org) is always flagged
+  flag_org_verification: true
+
+  # Legacy verified (verified=legacy) before 2023: mixed (keep unless other heuristics match)
+  # Gold verified (not in current schema): future-proof placeholder
+
+# Follower/following ratio heuristics (optional, low weight)
+ratio_heuristics:
+  # If followers/following > 100 AND bio matches corporate, likely brand
+  high_ratio_threshold: 100
+  # If following = 0, likely automated/official (but not sufficient alone)
+  zero_following_flag: false
+
+# Known organization handles (AUTO-FILTER - these are always removed)
+# Format: handle (without @) - any match = immediate removal
+org_handles_blacklist:
+  - "github"
+  - "nvidia"
+  - "awscloud"
+  - "azure"
+  - "reactjs"
+  - "code"
+  - "huggingface"
+  - "googlecloud"
+  - "docker"
+  - "figma"
+  - "mongodb"
+  - "vuejs"
+  - "nextjs"
+  - "golang"
+  - "netlify"
+  - "cloudflare"
+  - "supabase"
+  - "linear"
+  - "remix"
+  - "gitlab"
+  - "openai"
+  - "anthropic"
+  - "pytorch"
+  - "tensorflow"
+  - "vercel"
+  - "digitalocean"
+  - "heroku"
+  - "stripe"
+  - "tailwindcss"
+  - "defcon"
+
+# Exceptions (handles that match heuristics but are known individuals)
+# Format: handle (without @)
+exceptions:
+  - "example_personal_account"  # Placeholder
+  # Crypto/Web3 individual influencers (high follower accounts incorrectly flagged as org)
+  - "100trillionUSD"
+  - "CredibleCrypto"
+  - "CryptoGodJohn"
+  - "TheCryptoDog"
+  - "DocumentingBTC"
+  - "PlanB"
+  - "Loomdart"
+  - "WhalePanda"
+  - "Winklevoss"
+  # Gaming/Creator individual accounts
+  - "Ninja"
+  - "DrDisrespect"
+  - "MarkRuffalo"
+  - "JackMa"
+  - "JeffBezos"
+  # Add known false positives here as they're discovered
+
+# Scoring adjustments (for fine-tuning, not hard filters)
+# These adjust the is_org / is_official confidence score
+confidence_weights:
+  name_keyword_match: 0.6
+  bio_keyword_match: 0.4
+  domain_match: 0.8
+  org_verification: 1.0
+
+  # Threshold for flagging (sum of weights)
+  flag_threshold: 0.7  # ≥0.7 → is_org or is_official = true
diff --git a/lists/rules/risk_terms.yml b/lists/rules/risk_terms.yml

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+{"id":"1","username":"pmarca","name":"Marc Andreessen","description":"Co-founder a16z","verified_type":"blue","public_metrics":{"followers_count":1600000},"url":"https://a16z.com"}`
	`2`	`+{"id":"2","username":"fakebrand","name":"Acme Inc.","description":"Official account","verified_type":"org","public_metrics":{"followers_count":500000},"url":"https://acme.com"}`