Skip to content

Commit 70ffe00

Browse files
committed
docs: align naming to influencer list and include minimal publish components
1 parent 3d58e2e commit 70ffe00

6 files changed

Lines changed: 412 additions & 8 deletions

File tree

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,11 @@ data/**
2121
# Docs / audit / QA (not tracked)
2222
docs/**
2323

24-
# Lists and seeds (not tracked)
24+
# Lists and seeds (track only core rules)
2525
lists/**
26+
!lists/rules/
27+
!lists/rules/brand_heuristics.yml
28+
!lists/rules/risk_terms.yml
2629

2730
# Processed/working batches (not tracked)
2831
processed_batches/**
@@ -32,6 +35,7 @@ data/test/**
3235
data/samples/**
3336
data/latest/**
3437
data/prefetched.sample.jsonl
38+
!data/prefetched.sample.jsonl
3539

3640
# Scripts/tools: track only minimal publish set
3741
scripts/**

README.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# influx
22

3-
High-signal X/Twitter creator index — functional demo built on the open-source multi-agent framework [CCCC](https://github.com/ChesterRa/cccc). Ready-to-use whitelist of influential individual accounts (non-brand/non-official), curated for downstream ingestion.
3+
High-signal X/Twitter influencer list — functional demo built on the open-source multi-agent framework [CCCC](https://github.com/ChesterRa/cccc). Ready-to-use whitelist of influential individual accounts (non-brand/non-official), curated for downstream ingestion. This open-source bundle is a **download-and-use** minimal set; the full production flow requires CCCC + RUBE MCP (Twitter tools) for data fetching.
44

55
[![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE) [![Schema](https://img.shields.io/badge/schema-v1.0.0-green.svg)](schema/bigv.schema.json)
66

@@ -13,6 +13,12 @@ High-signal X/Twitter creator index — functional demo built on the open-source
1313
- Gzipped: [`data/release/influx-latest.jsonl.gz`](data/release/influx-latest.jsonl.gz)
1414
- Manifest: [`data/release/manifest.json`](data/release/manifest.json)
1515
- Delivered as data only; you don’t need to run the pipeline to use it.
16+
- Components included here (minimal open-source set):
17+
- Data: `data/release/` (latest JSONL + manifest)
18+
- Guard: `scripts/pipeline_guard.sh`
19+
- Schema: `schema/bigv.schema.json`
20+
- Rules: `lists/rules/brand_heuristics.yml`, `lists/rules/risk_terms.yml`
21+
- Sample prefetched JSONL: `data/prefetched.sample.jsonl` (for local filter demo)
1622

1723
## Why it’s useful
1824
- **Content ingestion/ranking:** high signal-to-noise whitelist reduces crawl and processing cost.
@@ -52,9 +58,9 @@ print(len(ai_authors))
5258
- Full schema: [`schema/bigv.schema.json`](schema/bigv.schema.json)
5359
- Key fields: `id` (author_id), `handle`, `name`, `verified`, `followers_count`, `lang_primary`, `topic_tags`, `metrics_30d*`, `meta.sources` (with evidence/fetched_at), `provenance_hash`.
5460

55-
## How it’s produced (context only; not required to use)
56-
- Two-step, no local MCP dependency: fetch Twitter users in an MCP-capable environment → save prefetched JSONL → run `influx-harvest x-lists|bulk --prefetched-users <file>` here → enforce `scripts/pipeline_guard.sh` (dedup handle/id, evidence required, placeholder/“000” rejection, strict schema) → publish to `data/release/`.
57-
- Only prefetched JSONL inputs are accepted; manual edits to `latest` are forbidden.
61+
## How it’s produced (context; requires external deps)
62+
- Full flow requires: **CCCC + RUBE MCP (Twitter tools)** to fetch users → prefetched JSONL → run `influx-harvest x-lists|bulk --prefetched-users <file>` → enforce `scripts/pipeline_guard.sh` (dedup handle/id, evidence required, placeholder/“000” rejection, strict schema) → publish to `data/release/`.
63+
- This repo ships the minimal set (data + guard + schema + rules + sample prefetched); MCP fetching is not included here.
5864

5965
## License
6066
- Apache-2.0 (covers code and released data).

README.zh.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
# influx
22

3-
基于开源多 Agent 框架 [CCCC](https://github.com/ChesterRa/cccc) 的功能性示范项目,提供高价值的 X.com 热门推主白名单(高活跃、非品牌/官方),直接下载即可使用。
3+
基于开源多 Agent 框架 [CCCC](https://github.com/ChesterRa/cccc) 的功能性示范项目,提供高价值的 X.com 热门推主清单(influencer list,强调个人非品牌/官方),直接下载即可使用。本仓库仅包含“可下载使用”的最小集;完整生产流程需要 CCCC + RUBE MCP(Twitter 工具)来抓取数据
44

55
[![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE) [![Schema](https://img.shields.io/badge/schema-v1.0.0-green.svg)](schema/bigv.schema.json)
66

77
## 这是什么
88
- 一份严格过滤的热门推主名单(个人账号,非品牌/官方),目标规模 5k–10k。
99
- 当前发布:`data/release/influx-latest.jsonl`(302 条)及 `manifest.json`(含 count/sha256/schema_version/timestamp/score_version)。
10+
- 已包含的开源最小集:发布数据、`scripts/pipeline_guard.sh``schema/bigv.schema.json`、规则(`lists/rules/brand_heuristics.yml``lists/rules/risk_terms.yml`)、示例 `data/prefetched.sample.jsonl`(本地过滤演示用)。
1011
- 面向使用者直接消费数据,无需运行生产流水线。
1112

1213
## 为什么有价值
@@ -53,8 +54,8 @@ print(len(ai_authors))
5354
- 关键字段:`id`(author_id)、`handle``name``verified``followers_count``lang_primary``topic_tags``metrics_30d*``meta.sources`(含 evidence/fetched_at)、`provenance_hash`
5455

5556
## 生产方式(了解即可,使用者无需运行)
56-
- 两段式、无本地 MCP 依赖:在具备 MCP 的环境批量获取 Twitter 用户 JSONL(prefetched)→ 在本仓库用 `influx-harvest x-lists|bulk --prefetched-users <file>` 过滤 → 运行 `scripts/pipeline_guard.sh`(handle/id 去重、证据必填、占位/“000”拒绝、strict schema)→ 发布到 `data/release/`
57-
- 只接受 prefetched JSONL;禁止手工编辑 latest
57+
- 完整流程需:CCCC + RUBE MCP(Twitter 工具)获取用户 → 生成 prefetched JSONL → 用 `influx-harvest x-lists|bulk --prefetched-users <file>` 过滤 → 运行 `scripts/pipeline_guard.sh`去重 handle/id、证据必填、占位/“000”拒绝、strict schema)→ 发布到 `data/release/`
58+
- 本仓库仅提供数据、guard、schema、规则及示例,不包含 MCP 抓取部分
5859

5960
## 许可
6061
- Apache-2.0(代码与数据一致)。

data/prefetched.sample.jsonl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
{"id":"1","username":"pmarca","name":"Marc Andreessen","description":"Co-founder a16z","verified_type":"blue","public_metrics":{"followers_count":1600000},"url":"https://a16z.com"}
2+
{"id":"2","username":"fakebrand","name":"Acme Inc.","description":"Official account","verified_type":"org","public_metrics":{"followers_count":500000},"url":"https://acme.com"}

lists/rules/brand_heuristics.yml

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# Brand and Official Account Heuristics
2+
# Used to filter out brand, media, official, and organizational accounts
3+
# from the influencer pool. Rules are evaluated as OR (any match triggers flag).
4+
5+
version: "2.0.0"
6+
updated: "2025-11-14"
7+
8+
# Keywords in name or username (case-insensitive, word boundary match)
9+
name_keywords:
10+
official_indicators:
11+
- "official"
12+
- "team" # NOTE: May FP on gaming/esports teams ("Team Liquid player"). Recommend context check (gaming domain indicators).
13+
- "support"
14+
- "help"
15+
- "press"
16+
- "pr"
17+
- "media"
18+
- "news"
19+
- "newsroom"
20+
21+
corporate_indicators:
22+
- "corp"
23+
- "inc" # NOTE: "Inc" as personal nickname (e.g., "John Doe Inc") may cause FP; use context/verified status to override
24+
- "ltd"
25+
- "llc"
26+
- "gmbh"
27+
- "company"
28+
- "enterprises"
29+
30+
brand_commerce:
31+
- "store"
32+
- "shop" # NOTE: May FP on job descriptions ("at Shopify"). Recommend word-boundary match in implementation.
33+
- "shopping"
34+
- "deals"
35+
- "sales"
36+
- "promo"
37+
- "coupon"
38+
39+
# Major tech companies and platforms (corporate accounts)
40+
tech_corporations:
41+
- "amazon web services"
42+
- "aws"
43+
- "microsoft azure"
44+
- "azure"
45+
- "google cloud"
46+
- "google ai"
47+
- "google deepmind"
48+
- "tensor flow"
49+
- "kaggle"
50+
- "docker"
51+
- "figma"
52+
- "mongodb"
53+
- "stripe"
54+
- "next.js"
55+
- "nextjs"
56+
- "vercel"
57+
- "netlify"
58+
- "cloudflare"
59+
- "tailwind css"
60+
- "tailwindcss"
61+
- "react"
62+
- "reactjs"
63+
- "vue.js"
64+
- "vuejs"
65+
- "angular"
66+
- "nvidia"
67+
- "github"
68+
- "gitlab"
69+
- "openai"
70+
- "anthropic"
71+
- "hugging face"
72+
- "huggingface"
73+
- "pytorch"
74+
- "tensorflow"
75+
- "def con"
76+
- "defcon"
77+
- "supabase"
78+
- "linear"
79+
- "remix"
80+
- "digitalocean"
81+
- "heroku"
82+
83+
tech_frameworks:
84+
- "platform"
85+
- "framework"
86+
- "ecosystem"
87+
- "sdk"
88+
- "api"
89+
- "developer tools"
90+
- "open source"
91+
- "library"
92+
- "runtime"
93+
- "database"
94+
- "devtools"
95+
96+
conference_events:
97+
- "conference"
98+
- "summit"
99+
- "keynote"
100+
- "event"
101+
- "festival"
102+
- "convention"
103+
- "meetup"
104+
105+
# Chinese official account patterns (新增中文官号识别)
106+
chinese_official:
107+
- "官号"
108+
- "官方"
109+
- "公司"
110+
- "企业"
111+
- "组织"
112+
- "机构"
113+
- "平台"
114+
- "官方账号"
115+
- "官方推特"
116+
- "客服"
117+
118+
# Keywords in bio/description (case-insensitive, substring match)
119+
bio_keywords:
120+
organizational:
121+
- "official account"
122+
- "official twitter"
123+
- "official page"
124+
- "managed by"
125+
- "run by our team"
126+
- "corporate account"
127+
- "company news"
128+
- "press releases"
129+
- "media inquiries"
130+
- "for support"
131+
- "customer service"
132+
133+
media_publishers:
134+
- "news outlet"
135+
- "news organization"
136+
- "media company"
137+
- "news network"
138+
- "publishing"
139+
- "journalist at"
140+
- "reporter for"
141+
- "editor at"
142+
143+
aggregators_bots:
144+
- "automated"
145+
- "bot account"
146+
- "news aggregator"
147+
- "auto-tweet"
148+
- "rss feed"
149+
150+
# Domain patterns in profile URL or bio links (regex)
151+
domain_patterns:
152+
- pattern: ".*\\.(gov|edu|org)$"
153+
reason: "institutional_domain"
154+
exceptions: ["github.org", "huggingface.co"]
155+
156+
- pattern: ".*(shop|store|deals|buy|cart|checkout).*"
157+
reason: "ecommerce_domain"
158+
159+
- pattern: ".*(news|press|media|journal|times|post|tribune).*"
160+
reason: "media_domain"
161+
162+
# Verification status rules
163+
verification_rules:
164+
# X Blue (verified=blue) is personal; keep unless other heuristics match
165+
# Organization verification (verified=org) is always flagged
166+
flag_org_verification: true
167+
168+
# Legacy verified (verified=legacy) before 2023: mixed (keep unless other heuristics match)
169+
# Gold verified (not in current schema): future-proof placeholder
170+
171+
# Follower/following ratio heuristics (optional, low weight)
172+
ratio_heuristics:
173+
# If followers/following > 100 AND bio matches corporate, likely brand
174+
high_ratio_threshold: 100
175+
# If following = 0, likely automated/official (but not sufficient alone)
176+
zero_following_flag: false
177+
178+
# Known organization handles (AUTO-FILTER - these are always removed)
179+
# Format: handle (without @) - any match = immediate removal
180+
org_handles_blacklist:
181+
- "github"
182+
- "nvidia"
183+
- "awscloud"
184+
- "azure"
185+
- "reactjs"
186+
- "code"
187+
- "huggingface"
188+
- "googlecloud"
189+
- "docker"
190+
- "figma"
191+
- "mongodb"
192+
- "vuejs"
193+
- "nextjs"
194+
- "golang"
195+
- "netlify"
196+
- "cloudflare"
197+
- "supabase"
198+
- "linear"
199+
- "remix"
200+
- "gitlab"
201+
- "openai"
202+
- "anthropic"
203+
- "pytorch"
204+
- "tensorflow"
205+
- "vercel"
206+
- "digitalocean"
207+
- "heroku"
208+
- "stripe"
209+
- "tailwindcss"
210+
- "defcon"
211+
212+
# Exceptions (handles that match heuristics but are known individuals)
213+
# Format: handle (without @)
214+
exceptions:
215+
- "example_personal_account" # Placeholder
216+
# Crypto/Web3 individual influencers (high follower accounts incorrectly flagged as org)
217+
- "100trillionUSD"
218+
- "CredibleCrypto"
219+
- "CryptoGodJohn"
220+
- "TheCryptoDog"
221+
- "DocumentingBTC"
222+
- "PlanB"
223+
- "Loomdart"
224+
- "WhalePanda"
225+
- "Winklevoss"
226+
# Gaming/Creator individual accounts
227+
- "Ninja"
228+
- "DrDisrespect"
229+
- "MarkRuffalo"
230+
- "JackMa"
231+
- "JeffBezos"
232+
# Add known false positives here as they're discovered
233+
234+
# Scoring adjustments (for fine-tuning, not hard filters)
235+
# These adjust the is_org / is_official confidence score
236+
confidence_weights:
237+
name_keyword_match: 0.6
238+
bio_keyword_match: 0.4
239+
domain_match: 0.8
240+
org_verification: 1.0
241+
242+
# Threshold for flagging (sum of weights)
243+
flag_threshold: 0.7 # ≥0.7 → is_org or is_official = true

0 commit comments

Comments
 (0)