From ebaade843134df25396e4d88ef61cb687b007a93 Mon Sep 17 00:00:00 2001 From: Krzysztof Modras Date: Thu, 7 May 2026 12:56:11 +0200 Subject: [PATCH 1/3] Add contributor docs and research skill for new tracker domains Documents the flow for adding a tracker domain (file formats, slug conventions, sourcing, duplicate avoidance) and ships a research skill that gathers sourced facts about a domain into a proposal the contributor reviews before any file is written. --- .../skills/research-tracker-domain/SKILL.md | 156 ++++++++++++++++++ .github/ISSUE_TEMPLATE/categorize_tracker.yml | 2 +- AGENTS.md | 27 +++ README.md | 2 + docs/categories.md | 8 + docs/contributing.md | 78 +++++++++ 6 files changed, 272 insertions(+), 1 deletion(-) create mode 100644 .claude/skills/research-tracker-domain/SKILL.md create mode 100644 AGENTS.md create mode 100644 docs/contributing.md diff --git a/.claude/skills/research-tracker-domain/SKILL.md b/.claude/skills/research-tracker-domain/SKILL.md new file mode 100644 index 000000000..eecef244c --- /dev/null +++ b/.claude/skills/research-tracker-domain/SKILL.md @@ -0,0 +1,156 @@ +--- +name: research-tracker-domain +description: Research a domain for a TrackerDB contributor — checks the database for existing coverage, investigates the operator, and produces a sourced research report plus concrete proposed file changes (snippets only). Does not write files in db/. Use when a developer wants help gathering facts about a domain before deciding whether or how to add it. +argument-hint: "" +allowed-tools: Bash, Read, WebSearch, WebFetch, Grep, Glob +--- + +# Research a tracker domain + +Helper for TrackerDB developers. Gathers facts about a domain and turns them into a concrete proposal: which file to create or edit and what content to put in it. **You do not write files in `db/`** and **you do not decide classification on the contributor's behalf** — the proposal is reviewed and applied by the contributor (or by you in a follow-up turn after they approve). The database structure, field rules, and slug conventions live in [`docs/contributing.md`](../../../docs/contributing.md) and the category list in [`docs/categories.md`](../../../docs/categories.md); read both once so your report uses the same vocabulary. + +The user supplies one argument: a domain (or URL — strip to the hostname). + +## Research checklist + +Run independent steps in parallel. + +### 1. Existing coverage + +- **Exact match:** `grep -rln "^$" db/patterns/`. If hit, the domain is already in the database — report which pattern and stop with no proposed change. +- **Parent coverage:** an existing pattern's `||^` filter covers all subdomains. `grep -rn "" db/patterns/`. Flag `||^` filters as full coverage; path-scoped or sibling-subdomain hits don't conflict but are useful evidence for organization reuse. +- **Sibling subdomain:** if another subdomain of the same apex is already in a pattern, the proposal is almost always to extend that file rather than create a new one. + +### 2. Operator + +- `dig CNAME +short` and `dig A +short`. CNAMEs into another vendor's infrastructure (`*.adobedtm.com`, `*.tealiumiq.com`, `*.salesforce.com`) suggest ownership; generic CDN CNAMEs (CloudFront, Akamai, Fastly, Cloudflare) only mean hosting. +- `whois `. Usually redacted on `.com`/`.net`/`.org`, but ccTLDs (`.de`, `.fr`, `.kr`, `.jp`, `.uk`, ...) often expose the legal entity directly. +- `WebFetch https:///` for company name, what the service does, and links to privacy policy and contact. Tracker hostnames (`static.example.io`) often don't render — fetch the brand domain instead. +- Privacy policy: `WebSearch "" privacy policy` then `WebFetch` for legal entity name, jurisdiction, contact email. +- Business intelligence for HQ and parent / acquirer: Crunchbase, CB Insights, LinkedIn About pages. Often paywalled; the search-result snippet alone is sometimes enough. Don't retry blocked fetches. +- Cross-reference with the company's About page and Wikipedia. + +**Don't cite Ghostery's own properties** (`whotracks.me`, `ghostery.com/whotracksme/`) — they are generated from this database, so citing them is circular. + +### 3. Existing organization + +If the operator might already be in the database: + +1. By slug: `ls db/organizations/.eno`. +2. By name: `grep -il "^name: $" db/organizations/*.eno`. +3. By parent (acquired companies often roll up): `grep -il "^name: " db/organizations/*.eno`. Acquired by Google → `google`; by Meta → `facebook`; by Adobe → `adobe`. +4. By sibling pattern: read a related pattern's `organization:` field. + +When a lookup matches, open the organization file and confirm the legal entity is the same — a matching brand name alone is not enough. + +### 4. Decide the change shape + +Pick exactly one of these scenarios; it dictates the Proposed changes section: + +- **A — already covered.** Exact match or `||^` parent filter. No file changes. Report and stop. +- **B — extend an existing pattern.** A sibling subdomain pattern exists and the new domain belongs to the same product. Propose adding the hostname to that file's `--- domains` block (in alphabetical order, matching the file's existing convention). +- **C — new pattern, existing organization.** The operator already has an organization file but no pattern fits. Propose a new `db/patterns/.eno` file pointing at the existing `organization:` slug. +- **D — new pattern + new organization.** Operator is not in the database. Propose both a new `db/organizations/.eno` and a new `db/patterns/.eno`. + +### 5. Slug + +Apply the rules in [`docs/contributing.md` § Slug naming](../../../docs/contributing.md#slug-naming) and check for collisions: `ls db/organizations/.eno db/patterns/.eno`. + +## Output format + +Produce a single markdown report. Be concise — one or two lines per bullet. Leave a value blank when you have no source for it; do not guess. + +``` +# Research: + +## Existing coverage +- one of: + - not present in db/ + - already covered by `||^` in db/patterns/.eno (no new entry needed) — scenario A + - sibling subdomain in db/patterns/.eno — scenario B + - operator already in db/organizations/.eno but no matching pattern — scenario C + - operator not in db/ — scenario D + +## Operator +- Brand / product name: ... +- Legal entity: ... +- Country (legal domicile, ISO 3166-1 alpha-2): ... +- Parent or acquirer: ... (if any) +- Privacy policy URL: ... +- Privacy contact: ... + +## Suggested category +One category from docs/categories.md, with a one-line justification. Mention the runner-up if it's close. Final pick is the contributor's call. + +## Existing organization to consider reusing +- `` (db/organizations/.eno) — same legal entity, confirmed by +- or: none found; a new organization file is included in the proposal below. + +## Proposed changes + +For scenario A, write: "No changes — already covered." + +For scenarios B / C / D, list each file under its own subheading. Use ✏️ for an edit, ➕ for a new file. Show the **full proposed final contents** of the file in a fenced block (or, for an edit, the smallest contiguous excerpt that includes the change with a few lines of surrounding context). Do not show diffs without context. + +### ✏️ db/patterns/.eno +```eno + +``` + +### ➕ db/patterns/.eno +```eno +name: +category: +website_url: +organization: + +--- domains + +--- domains + +--- notes +Sources: +- name, website_url: +- category: +- organization: reused existing slug, confirmed by +--- notes +``` + +### ➕ db/organizations/.eno (scenario D only) +```eno +name: +website_url: +privacy_policy_url: +privacy_contact: +country: +description: +``` + +After the proposal, add a one-line apply hint: + +> Apply: review above, then ask me to write these files. I will not write them on my own. + +## Sources +- : +- (one line per non-trivial claim) + +## Open questions +- (anything you could not resolve — blank fields in the proposal, ambiguous category, apex-vs-subdomain scope, etc.) +``` + +### Rules for the proposed snippets + +- **Leave fields blank rather than guess.** A blank `privacy_contact:` is fine; an invented one is not. Mirror every blank in Open questions. +- **Order entries inside `--- domains` alphabetically** unless the existing file uses a different convention; match the file you are editing. +- **Include a `--- notes` block** in any new pattern or organization file, citing the source per non-trivial field. This is required by `docs/contributing.md`. +- **Do not add `--- filters`** unless the research clearly justifies it (shared hostname, third-party-only behaviour). Domains-only is the default. +- **Do not invent categories or tags.** Use the lists in `docs/categories.md` (categories) and `docs/contributing.md` (tags). +- **Do not propose unrelated edits** to existing files — only the one line / block needed for this domain. + +## Don't + +- **Don't write files in `db/`.** Suggesting content in the report is fine; running `Write` or `Edit` on anything under `db/` is not. Wait for the contributor to approve. +- **Don't fill values you cannot source.** If you cannot find a source for a fact in this session, leave it blank in the proposal and list it under Open questions. +- **Don't decide classification for the contributor.** Suggest a category with a justification and call out a close runner-up; let the contributor make the call. +- **Don't open a PR, commit, or push.** +- **Don't run `node lint.js`.** That is the contributor's verification step after they apply changes. diff --git a/.github/ISSUE_TEMPLATE/categorize_tracker.yml b/.github/ISSUE_TEMPLATE/categorize_tracker.yml index 78bee9cc8..b1f46b4f0 100644 --- a/.github/ISSUE_TEMPLATE/categorize_tracker.yml +++ b/.github/ISSUE_TEMPLATE/categorize_tracker.yml @@ -60,7 +60,7 @@ body: id: company-description attributes: label: Describe the company - description: Tell us what you know about the company. Ensure descriptions are informative and impartial. Please fact check the information if using generative AI. Be prepared to cite sources, if necessary. + description: Tell us what you know about the company. Ensure descriptions are informative and impartial. Be prepared to cite sources for any non-obvious claim. validations: required: false - type: dropdown diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..033a5d9f6 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,27 @@ +# AGENTS.md + +Orientation for anyone — human or agent — contributing to this repository. See [README.md](README.md) for what this project is and where the data is used. + +## Tooling + +The `research-tracker-domain` skill in `.claude/skills/` is an example of the tooling we welcome: it gathers sourced facts about a domain into a research report so the contributor can decide whether and how to add it. The contributor still reviews the findings and writes the files. + +## Editing `db/` + +Read [`docs/contributing.md`](docs/contributing.md) first — file formats, slug conventions, category guidance, sourcing, and duplicate avoidance. Auto-format with `node lint.js` after editing. + +The rule worth restating up front: **leave a field blank rather than fill it with a value you cannot back up.** + +## Before opening a pull request + +```sh +npm ci +npm run lint # ESLint over src/, test/, scripts/ +npm run lint-patterns # node lint.js --check — verifies db/ formatting +npm test +npm run build +``` + +## Code + +The SDK in `src/` and tooling in `scripts/` are TypeScript / JavaScript on Node 20+. Follow the existing ESLint + Prettier configuration; `npm run lint-fix` applies fixes. diff --git a/README.md b/README.md index 55480fe9a..181ee57ed 100644 --- a/README.md +++ b/README.md @@ -96,6 +96,8 @@ Output: We encourage contributions from developers of all levels. If you come across any errors, such as typos, inaccuracies, or outdated information, please don't hesitate to open an issue, or, even better, send us a pull request. Your feedback is highly valued! +See [docs/contributing.md](docs/contributing.md) for a walkthrough of adding a new tracker — file formats, category guidance, sourcing expectations, and how to verify your changes. + If you are new to the project or want an easy starting point, check out our [Good First Issues](https://github.com/ghostery/trackerdb/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22). These are beginner-friendly tasks to help you get acquainted with our project. If you are unsure about an issue or have questions, feel free to ask in the issue comments. ### Data Partners diff --git a/docs/categories.md b/docs/categories.md index b31ca742a..638f7fc3d 100644 --- a/docs/categories.md +++ b/docs/categories.md @@ -1,5 +1,13 @@ # Category overview +When two categories could plausibly apply, pick the more specific. A few edge cases: + +- **AI personalization / recommendation**: `advertising` if monetized through sponsored placements, otherwise `customer_interaction`. +- **Identity / CDP**: `advertising` when pitched at advertisers, otherwise `utilities`. +- **A vendor that does both ads and analytics**: pick the dominant function *for this domain*. +- **Categorize by behaviour at the domain, not by parent company.** `play.google.com` is `hosting`, not `advertising`. +- **Don't pick `misc` to dodge a hard call.** Re-read the company's own description first. + ## Advertising Advertising services that utilize data collection, behavioral analysis, and user retargeting. diff --git a/docs/contributing.md b/docs/contributing.md new file mode 100644 index 000000000..af98d86d1 --- /dev/null +++ b/docs/contributing.md @@ -0,0 +1,78 @@ +# Contributing to TrackerDB + +This data ships in the Ghostery extension and powers [WhoTracks.me](https://whotracks.me/), so accuracy beats coverage. A missing entry leaves a request unclassified; a wrong entry mislabels it for millions of users. **Prefer leaving a field blank to filling it with a value you cannot back up.** + +The common contribution is adding a tracker domain. Skim a few existing files in `db/organizations/` and `db/patterns/` first — they are the canonical examples. + +## Files + +A pattern (`db/patterns/.eno`) points at exactly one organization (`db/organizations/.eno`) via its `organization:` field. **One organization commonly has many patterns** — `google` is referenced by 60+. Most contributions only add a pattern. + +The files use [eno](https://eno-lang.org/). `node lint.js` preserves only `key: value` lines and `--- section` blocks; anything else is stripped. + +A few rules are not obvious from the examples: + +- **`country` is the legal domicile**, not where engineers sit or servers are hosted. ISO 3166-1 alpha-2, taken from the privacy policy or imprint. +- **`description` is factual.** No marketing language, no editorial judgement (`invasive`, `shady`). +- **`category` must come from `db/categories/`** — read the company's *own* description, then pick from [`docs/categories.md`](categories.md), which also covers edge cases. Don't invent new ones. +- **`tags` are limited to** `site-statistics`, `cross-site`, `passive-statistics`, `anti-fraud`. Omit the line if none apply. +- **The `--- domains` block already blocks all listed hostnames.** Add `--- filters` only when the hostname is shared between tracking and non-tracking traffic, or when behaviour is third-party-only (`||example.com^$3p`). Domains-only is the safe default. + +## Sourcing + +Cite where each non-trivial value came from in a `--- notes` block at the bottom of the file: + +``` +--- notes +Sources: +- name, description: https://www.bytedance.com/en/ + https://en.wikipedia.org/wiki/ByteDance +- country (KY): privacy policy footer "ByteDance Ltd., Cayman Islands" +--- notes +``` + +If a field is blank because no source was found, say so — it makes the absence auditable. + +**Don't cite Ghostery's own properties** (`whotracks.me`, `ghostery.com/whotracksme/`). They are generated from this database, so citing them is circular and reinforces existing mistakes. + +## Slug naming + +- Lowercase, snake_case. Replace `.` and `-` with `_`. +- Drop the TLD when the brand reads better without it: `Dable` → `dable`; `eclick.vn` → `eclick`. +- For single-product companies, pattern slug equals organization slug. +- When one organization has many patterns, name the pattern after the product (`google_analytics`, `google_tag_manager`). +- Check for collisions: `ls db/organizations/.eno db/patterns/.eno`. + +## Researching the operator + +Before writing, find out who actually operates the domain: + +- The company's **privacy policy** and imprint — legal entity, jurisdiction, contact email. +- **DNS.** `dig CNAME +short`. A CNAME into another vendor's infrastructure (`*.adobedtm.com`, `*.salesforce.com`, `*.tealiumiq.com`) is strong evidence of ownership; generic CDN CNAMEs (CloudFront, Akamai, Fastly, Cloudflare) only mean hosting. +- **WHOIS.** Usually redacted on `.com`/`.net`/`.org`, but ccTLDs (`.de`, `.fr`, `.kr`, `.jp`, `.uk`) often expose the legal entity directly. +- Crunchbase, CB Insights, LinkedIn About — for HQ and parent / acquirer. Often paywalled; the search snippet alone is sometimes enough. + +## Reusing an existing organization + +If the operator's company is already in the database, **don't create a second organization** — point your new pattern at the existing slug. Look for a match by: + +1. Slug: `ls db/organizations/.eno`. +2. Name: `grep -il "^name: $" db/organizations/*.eno`. +3. Parent. Acquired companies often roll up — Google → `google`; Meta → `facebook`; Adobe → `adobe`. +4. Sibling pattern: read its `organization:` field. + +Confirm the legal entity is the same — a matching brand name alone is not enough. + +## Avoiding duplicates + +- Exact match: `grep -rln "^$" db/patterns/`. +- Parent coverage: `||example.com^` matches all subdomains. `grep -rn "" db/patterns/` and inspect — `||^` filters cover your domain; path-scoped or sibling-subdomain ones do not. + +If the domain is already covered, extend the existing pattern (or open an issue) rather than creating a new one. + +## Verifying + +```sh +node lint.js +``` + +Auto-formats files in `db/` and flags common mistakes — expensive regex filters, missing section terminators, untokenisable filters. Run with `--check` to confirm a clean tree. From 610d718d3e6504e11c7a6c977a747d56540ca7bd Mon Sep 17 00:00:00 2001 From: Krzysztof Modras Date: Thu, 7 May 2026 13:03:52 +0200 Subject: [PATCH 2/3] Trim contributor docs to non-obvious rules --- .../skills/research-tracker-domain/SKILL.md | 147 +++++------------- AGENTS.md | 20 +-- docs/categories.md | 8 - docs/contributing.md | 81 +++------- 4 files changed, 71 insertions(+), 185 deletions(-) diff --git a/.claude/skills/research-tracker-domain/SKILL.md b/.claude/skills/research-tracker-domain/SKILL.md index eecef244c..c39fc5a55 100644 --- a/.claude/skills/research-tracker-domain/SKILL.md +++ b/.claude/skills/research-tracker-domain/SKILL.md @@ -7,150 +7,87 @@ allowed-tools: Bash, Read, WebSearch, WebFetch, Grep, Glob # Research a tracker domain -Helper for TrackerDB developers. Gathers facts about a domain and turns them into a concrete proposal: which file to create or edit and what content to put in it. **You do not write files in `db/`** and **you do not decide classification on the contributor's behalf** — the proposal is reviewed and applied by the contributor (or by you in a follow-up turn after they approve). The database structure, field rules, and slug conventions live in [`docs/contributing.md`](../../../docs/contributing.md) and the category list in [`docs/categories.md`](../../../docs/categories.md); read both once so your report uses the same vocabulary. +Helper for TrackerDB contributors. Gathers sourced facts about a domain and proposes concrete file content. **Read [`docs/contributing.md`](../../../docs/contributing.md) once** for field rules and slug conventions; this skill does not repeat them. + +**You do not write files in `db/`.** Propose content in the report; the contributor applies it (or asks you to in a follow-up). The user supplies one argument: a domain (or URL — strip to the hostname). -## Research checklist +## Investigate Run independent steps in parallel. -### 1. Existing coverage - -- **Exact match:** `grep -rln "^$" db/patterns/`. If hit, the domain is already in the database — report which pattern and stop with no proposed change. -- **Parent coverage:** an existing pattern's `||^` filter covers all subdomains. `grep -rn "" db/patterns/`. Flag `||^` filters as full coverage; path-scoped or sibling-subdomain hits don't conflict but are useful evidence for organization reuse. -- **Sibling subdomain:** if another subdomain of the same apex is already in a pattern, the proposal is almost always to extend that file rather than create a new one. - -### 2. Operator - -- `dig CNAME +short` and `dig A +short`. CNAMEs into another vendor's infrastructure (`*.adobedtm.com`, `*.tealiumiq.com`, `*.salesforce.com`) suggest ownership; generic CDN CNAMEs (CloudFront, Akamai, Fastly, Cloudflare) only mean hosting. -- `whois `. Usually redacted on `.com`/`.net`/`.org`, but ccTLDs (`.de`, `.fr`, `.kr`, `.jp`, `.uk`, ...) often expose the legal entity directly. -- `WebFetch https:///` for company name, what the service does, and links to privacy policy and contact. Tracker hostnames (`static.example.io`) often don't render — fetch the brand domain instead. -- Privacy policy: `WebSearch "" privacy policy` then `WebFetch` for legal entity name, jurisdiction, contact email. -- Business intelligence for HQ and parent / acquirer: Crunchbase, CB Insights, LinkedIn About pages. Often paywalled; the search-result snippet alone is sometimes enough. Don't retry blocked fetches. -- Cross-reference with the company's About page and Wikipedia. - -**Don't cite Ghostery's own properties** (`whotracks.me`, `ghostery.com/whotracksme/`) — they are generated from this database, so citing them is circular. - -### 3. Existing organization - -If the operator might already be in the database: - -1. By slug: `ls db/organizations/.eno`. -2. By name: `grep -il "^name: $" db/organizations/*.eno`. -3. By parent (acquired companies often roll up): `grep -il "^name: " db/organizations/*.eno`. Acquired by Google → `google`; by Meta → `facebook`; by Adobe → `adobe`. -4. By sibling pattern: read a related pattern's `organization:` field. - -When a lookup matches, open the organization file and confirm the legal entity is the same — a matching brand name alone is not enough. +**Existing coverage.** Stop early if already covered. +- Exact: `grep -rln "^$" db/patterns/`. +- Parent / sibling: `grep -rn "" db/patterns/`. A `||^` filter covers all subdomains; a sibling-subdomain hit means you should probably extend that file rather than create a new one. -### 4. Decide the change shape +**Operator.** +- `dig CNAME +short` — CNAMEs into vendor infrastructure (`*.adobedtm.com`, `*.tealiumiq.com`, `*.salesforce.com`) suggest ownership; generic CDNs (CloudFront, Akamai, Fastly, Cloudflare) only mean hosting. +- `whois ` — usually redacted on gTLDs but ccTLDs (`.de`, `.fr`, `.kr`, `.jp`, `.uk`) often expose the legal entity. +- `WebFetch` the brand site (`https:///`) and the privacy policy for legal entity, jurisdiction, contact email. Tracker hostnames often don't render — fetch the brand domain. +- Crunchbase / LinkedIn for parent or acquirer. Often paywalled; the search snippet is sometimes enough. Don't retry blocked fetches. -Pick exactly one of these scenarios; it dictates the Proposed changes section: +**Existing organization.** If the operator might already be in the database: +- `ls db/organizations/.eno` +- `grep -il "^name: $" db/organizations/*.eno` +- Check the parent (acquired companies roll up: Google → `google`, Meta → `facebook`, Adobe → `adobe`). +- Read a sibling pattern's `organization:` field. -- **A — already covered.** Exact match or `||^` parent filter. No file changes. Report and stop. -- **B — extend an existing pattern.** A sibling subdomain pattern exists and the new domain belongs to the same product. Propose adding the hostname to that file's `--- domains` block (in alphabetical order, matching the file's existing convention). -- **C — new pattern, existing organization.** The operator already has an organization file but no pattern fits. Propose a new `db/patterns/.eno` file pointing at the existing `organization:` slug. -- **D — new pattern + new organization.** Operator is not in the database. Propose both a new `db/organizations/.eno` and a new `db/patterns/.eno`. +Confirm by legal entity, not brand name alone. -### 5. Slug +## Decide the change shape -Apply the rules in [`docs/contributing.md` § Slug naming](../../../docs/contributing.md#slug-naming) and check for collisions: `ls db/organizations/.eno db/patterns/.eno`. +Pick exactly one: +- **A — already covered.** No file changes. Report and stop. +- **B — extend an existing pattern.** Add the hostname to its `--- domains` block. +- **C — new pattern, existing organization.** New `db/patterns/.eno` pointing at the existing org slug. +- **D — new pattern + new organization.** -## Output format +## Report -Produce a single markdown report. Be concise — one or two lines per bullet. Leave a value blank when you have no source for it; do not guess. +Single markdown report. One or two lines per bullet. Leave values blank when unsourced — never guess. Mirror every blank in **Open questions**. ``` # Research: ## Existing coverage -- one of: - - not present in db/ - - already covered by `||^` in db/patterns/.eno (no new entry needed) — scenario A - - sibling subdomain in db/patterns/.eno — scenario B - - operator already in db/organizations/.eno but no matching pattern — scenario C - - operator not in db/ — scenario D + ## Operator -- Brand / product name: ... +- Brand / product: ... - Legal entity: ... - Country (legal domicile, ISO 3166-1 alpha-2): ... -- Parent or acquirer: ... (if any) +- Parent or acquirer: ... - Privacy policy URL: ... - Privacy contact: ... ## Suggested category -One category from docs/categories.md, with a one-line justification. Mention the runner-up if it's close. Final pick is the contributor's call. +One key from db/categories/, with a one-line justification. Mention a close runner-up. Final pick is the contributor's call. -## Existing organization to consider reusing +## Existing organization to reuse - `` (db/organizations/.eno) — same legal entity, confirmed by -- or: none found; a new organization file is included in the proposal below. +- or: none; a new organization is included below. ## Proposed changes -For scenario A, write: "No changes — already covered." +For scenario A: "No changes — already covered." -For scenarios B / C / D, list each file under its own subheading. Use ✏️ for an edit, ➕ for a new file. Show the **full proposed final contents** of the file in a fenced block (or, for an edit, the smallest contiguous excerpt that includes the change with a few lines of surrounding context). Do not show diffs without context. - -### ✏️ db/patterns/.eno -```eno - -``` - -### ➕ db/patterns/.eno -```eno -name: -category: -website_url: -organization: - ---- domains - ---- domains - ---- notes -Sources: -- name, website_url: -- category: -- organization: reused existing slug, confirmed by ---- notes -``` - -### ➕ db/organizations/.eno (scenario D only) -```eno -name: -website_url: -privacy_policy_url: -privacy_contact: -country: -description: -``` - -After the proposal, add a one-line apply hint: +For B/C/D, list each file under its own subheading (✏️ edit, ➕ new). Show full final contents in a fenced ```eno block (or, for an edit, the smallest excerpt with surrounding context — never a contextless diff). Pattern files need a `--- notes` block citing each non-trivial field. Add `--- filters` only when the research justifies it. > Apply: review above, then ask me to write these files. I will not write them on my own. ## Sources -- : -- (one line per non-trivial claim) +- : ## Open questions -- (anything you could not resolve — blank fields in the proposal, ambiguous category, apex-vs-subdomain scope, etc.) +- ``` -### Rules for the proposed snippets - -- **Leave fields blank rather than guess.** A blank `privacy_contact:` is fine; an invented one is not. Mirror every blank in Open questions. -- **Order entries inside `--- domains` alphabetically** unless the existing file uses a different convention; match the file you are editing. -- **Include a `--- notes` block** in any new pattern or organization file, citing the source per non-trivial field. This is required by `docs/contributing.md`. -- **Do not add `--- filters`** unless the research clearly justifies it (shared hostname, third-party-only behaviour). Domains-only is the default. -- **Do not invent categories or tags.** Use the lists in `docs/categories.md` (categories) and `docs/contributing.md` (tags). -- **Do not propose unrelated edits** to existing files — only the one line / block needed for this domain. - ## Don't -- **Don't write files in `db/`.** Suggesting content in the report is fine; running `Write` or `Edit` on anything under `db/` is not. Wait for the contributor to approve. -- **Don't fill values you cannot source.** If you cannot find a source for a fact in this session, leave it blank in the proposal and list it under Open questions. -- **Don't decide classification for the contributor.** Suggest a category with a justification and call out a close runner-up; let the contributor make the call. -- **Don't open a PR, commit, or push.** -- **Don't run `node lint.js`.** That is the contributor's verification step after they apply changes. +- Write or edit anything under `db/`. +- Fill values you cannot source. Blanks are auditable; invented values are not. +- Decide classification on the contributor's behalf. +- Cite Ghostery's own properties (`whotracks.me`, `ghostery.com/whotracksme/`) — generated from this database, so it's circular. +- Open a PR, commit, or push. +- Run `node lint.js` — that's the contributor's verification step after they apply changes. diff --git a/AGENTS.md b/AGENTS.md index 033a5d9f6..e11f4bbe4 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,27 +1,17 @@ # AGENTS.md -Orientation for anyone — human or agent — contributing to this repository. See [README.md](README.md) for what this project is and where the data is used. +Read [`docs/contributing.md`](docs/contributing.md) before editing `db/`. -## Tooling - -The `research-tracker-domain` skill in `.claude/skills/` is an example of the tooling we welcome: it gathers sourced facts about a domain into a research report so the contributor can decide whether and how to add it. The contributor still reviews the findings and writes the files. - -## Editing `db/` - -Read [`docs/contributing.md`](docs/contributing.md) first — file formats, slug conventions, category guidance, sourcing, and duplicate avoidance. Auto-format with `node lint.js` after editing. - -The rule worth restating up front: **leave a field blank rather than fill it with a value you cannot back up.** +The `research-tracker-domain` skill in `.claude/skills/` gathers sourced facts about a domain into a research report. It does not write files in `db/` — the contributor reviews and applies. ## Before opening a pull request ```sh npm ci -npm run lint # ESLint over src/, test/, scripts/ -npm run lint-patterns # node lint.js --check — verifies db/ formatting +npm run lint +npm run lint-patterns npm test npm run build ``` -## Code - -The SDK in `src/` and tooling in `scripts/` are TypeScript / JavaScript on Node 20+. Follow the existing ESLint + Prettier configuration; `npm run lint-fix` applies fixes. +The SDK in `src/` and tooling in `scripts/` are TypeScript / JavaScript on Node 20+. `npm run lint-fix` applies ESLint + Prettier fixes. diff --git a/docs/categories.md b/docs/categories.md index 638f7fc3d..b31ca742a 100644 --- a/docs/categories.md +++ b/docs/categories.md @@ -1,13 +1,5 @@ # Category overview -When two categories could plausibly apply, pick the more specific. A few edge cases: - -- **AI personalization / recommendation**: `advertising` if monetized through sponsored placements, otherwise `customer_interaction`. -- **Identity / CDP**: `advertising` when pitched at advertisers, otherwise `utilities`. -- **A vendor that does both ads and analytics**: pick the dominant function *for this domain*. -- **Categorize by behaviour at the domain, not by parent company.** `play.google.com` is `hosting`, not `advertising`. -- **Don't pick `misc` to dodge a hard call.** Re-read the company's own description first. - ## Advertising Advertising services that utilize data collection, behavioral analysis, and user retargeting. diff --git a/docs/contributing.md b/docs/contributing.md index af98d86d1..6bdba7cca 100644 --- a/docs/contributing.md +++ b/docs/contributing.md @@ -1,73 +1,40 @@ # Contributing to TrackerDB -This data ships in the Ghostery extension and powers [WhoTracks.me](https://whotracks.me/), so accuracy beats coverage. A missing entry leaves a request unclassified; a wrong entry mislabels it for millions of users. **Prefer leaving a field blank to filling it with a value you cannot back up.** +This data ships in the Ghostery extension and powers [WhoTracks.me](https://whotracks.me/), so accuracy beats coverage. **Prefer leaving a field blank to filling it with a value you cannot back up.** -The common contribution is adding a tracker domain. Skim a few existing files in `db/organizations/` and `db/patterns/` first — they are the canonical examples. +The common contribution is a new tracker domain. Open a few files in `db/organizations/` and `db/patterns/` first — they are the canonical examples, and most rules are obvious from them. The non-obvious bits: -## Files +- A pattern's `organization:` field points at one org; one org commonly has many patterns (`google` is referenced by 60+). +- `country` is the **legal domicile** (ISO 3166-1 alpha-2), not where engineers sit. Take it from the privacy policy or imprint. +- `description` is factual — no marketing language, no editorial judgement. +- `category` keys come from `db/categories/`; don't invent new ones. +- `tags` accepts only `site-statistics`, `cross-site`, `passive-statistics`, `anti-fraud`. Omit otherwise. +- The `--- domains` block already blocks every listed hostname. Add `--- filters` only when a hostname is shared between tracking and non-tracking traffic, or when behaviour is third-party-only (`||example.com^$3p`). +- Cite each non-trivial value in a `--- notes` block — except Ghostery's own properties (`whotracks.me`, `ghostery.com/whotracksme/`), which are generated from this database and would be circular. -A pattern (`db/patterns/.eno`) points at exactly one organization (`db/organizations/.eno`) via its `organization:` field. **One organization commonly has many patterns** — `google` is referenced by 60+. Most contributions only add a pattern. +## Slugs -The files use [eno](https://eno-lang.org/). `node lint.js` preserves only `key: value` lines and `--- section` blocks; anything else is stripped. +Lowercase, snake_case (replace `.` and `-` with `_`). Drop the TLD when the brand reads better: `Dable` → `dable`. For multi-product organizations, name the pattern after the product (`google_analytics`, `google_tag_manager`). Check collisions: `ls db/organizations/.eno db/patterns/.eno`. -A few rules are not obvious from the examples: +## Before adding an organization -- **`country` is the legal domicile**, not where engineers sit or servers are hosted. ISO 3166-1 alpha-2, taken from the privacy policy or imprint. -- **`description` is factual.** No marketing language, no editorial judgement (`invasive`, `shady`). -- **`category` must come from `db/categories/`** — read the company's *own* description, then pick from [`docs/categories.md`](categories.md), which also covers edge cases. Don't invent new ones. -- **`tags` are limited to** `site-statistics`, `cross-site`, `passive-statistics`, `anti-fraud`. Omit the line if none apply. -- **The `--- domains` block already blocks all listed hostnames.** Add `--- filters` only when the hostname is shared between tracking and non-tracking traffic, or when behaviour is third-party-only (`||example.com^$3p`). Domains-only is the safe default. +If the operator is already in the database, point your new pattern at the existing slug rather than creating a duplicate. Acquired companies often roll up — Google → `google`, Meta → `facebook`, Adobe → `adobe`. Confirm by legal entity, not just brand name: -## Sourcing - -Cite where each non-trivial value came from in a `--- notes` block at the bottom of the file: - -``` ---- notes -Sources: -- name, description: https://www.bytedance.com/en/ + https://en.wikipedia.org/wiki/ByteDance -- country (KY): privacy policy footer "ByteDance Ltd., Cayman Islands" ---- notes +```sh +ls db/organizations/.eno +grep -il "^name: $" db/organizations/*.eno ``` -If a field is blank because no source was found, say so — it makes the absence auditable. - -**Don't cite Ghostery's own properties** (`whotracks.me`, `ghostery.com/whotracksme/`). They are generated from this database, so citing them is circular and reinforces existing mistakes. - -## Slug naming - -- Lowercase, snake_case. Replace `.` and `-` with `_`. -- Drop the TLD when the brand reads better without it: `Dable` → `dable`; `eclick.vn` → `eclick`. -- For single-product companies, pattern slug equals organization slug. -- When one organization has many patterns, name the pattern after the product (`google_analytics`, `google_tag_manager`). -- Check for collisions: `ls db/organizations/.eno db/patterns/.eno`. - -## Researching the operator +## Before adding a pattern -Before writing, find out who actually operates the domain: +Make sure the domain isn't already covered: -- The company's **privacy policy** and imprint — legal entity, jurisdiction, contact email. -- **DNS.** `dig CNAME +short`. A CNAME into another vendor's infrastructure (`*.adobedtm.com`, `*.salesforce.com`, `*.tealiumiq.com`) is strong evidence of ownership; generic CDN CNAMEs (CloudFront, Akamai, Fastly, Cloudflare) only mean hosting. -- **WHOIS.** Usually redacted on `.com`/`.net`/`.org`, but ccTLDs (`.de`, `.fr`, `.kr`, `.jp`, `.uk`) often expose the legal entity directly. -- Crunchbase, CB Insights, LinkedIn About — for HQ and parent / acquirer. Often paywalled; the search snippet alone is sometimes enough. - -## Reusing an existing organization - -If the operator's company is already in the database, **don't create a second organization** — point your new pattern at the existing slug. Look for a match by: - -1. Slug: `ls db/organizations/.eno`. -2. Name: `grep -il "^name: $" db/organizations/*.eno`. -3. Parent. Acquired companies often roll up — Google → `google`; Meta → `facebook`; Adobe → `adobe`. -4. Sibling pattern: read its `organization:` field. - -Confirm the legal entity is the same — a matching brand name alone is not enough. - -## Avoiding duplicates - -- Exact match: `grep -rln "^$" db/patterns/`. -- Parent coverage: `||example.com^` matches all subdomains. `grep -rn "" db/patterns/` and inspect — `||^` filters cover your domain; path-scoped or sibling-subdomain ones do not. +```sh +grep -rln "^$" db/patterns/ # exact match +grep -rn "" db/patterns/ # parent / sibling coverage +``` -If the domain is already covered, extend the existing pattern (or open an issue) rather than creating a new one. +A `||^` filter in an existing pattern already covers all subdomains. ## Verifying @@ -75,4 +42,4 @@ If the domain is already covered, extend the existing pattern (or open an issue) node lint.js ``` -Auto-formats files in `db/` and flags common mistakes — expensive regex filters, missing section terminators, untokenisable filters. Run with `--check` to confirm a clean tree. +Auto-formats files in `db/` and flags common mistakes. Pass `--check` to verify without writing. From ce2878d21303cfd5c06b87f7f0095a9c0a8030f4 Mon Sep 17 00:00:00 2001 From: Krzysztof Modras Date: Thu, 7 May 2026 13:09:19 +0200 Subject: [PATCH 3/3] Make research-tracker-domain output shape adaptive --- .../skills/research-tracker-domain/SKILL.md | 101 +++++++----------- 1 file changed, 39 insertions(+), 62 deletions(-) diff --git a/.claude/skills/research-tracker-domain/SKILL.md b/.claude/skills/research-tracker-domain/SKILL.md index c39fc5a55..1287519a2 100644 --- a/.claude/skills/research-tracker-domain/SKILL.md +++ b/.claude/skills/research-tracker-domain/SKILL.md @@ -7,87 +7,64 @@ allowed-tools: Bash, Read, WebSearch, WebFetch, Grep, Glob # Research a tracker domain -Helper for TrackerDB contributors. Gathers sourced facts about a domain and proposes concrete file content. **Read [`docs/contributing.md`](../../../docs/contributing.md) once** for field rules and slug conventions; this skill does not repeat them. - -**You do not write files in `db/`.** Propose content in the report; the contributor applies it (or asks you to in a follow-up). +Help a TrackerDB contributor decide whether and how to add a domain. Read [`docs/contributing.md`](../../../docs/contributing.md) once for field rules and slug conventions — this skill does not repeat them. The user supplies one argument: a domain (or URL — strip to the hostname). -## Investigate - -Run independent steps in parallel. - -**Existing coverage.** Stop early if already covered. -- Exact: `grep -rln "^$" db/patterns/`. -- Parent / sibling: `grep -rn "" db/patterns/`. A `||^` filter covers all subdomains; a sibling-subdomain hit means you should probably extend that file rather than create a new one. - -**Operator.** -- `dig CNAME +short` — CNAMEs into vendor infrastructure (`*.adobedtm.com`, `*.tealiumiq.com`, `*.salesforce.com`) suggest ownership; generic CDNs (CloudFront, Akamai, Fastly, Cloudflare) only mean hosting. -- `whois ` — usually redacted on gTLDs but ccTLDs (`.de`, `.fr`, `.kr`, `.jp`, `.uk`) often expose the legal entity. -- `WebFetch` the brand site (`https:///`) and the privacy policy for legal entity, jurisdiction, contact email. Tracker hostnames often don't render — fetch the brand domain. -- Crunchbase / LinkedIn for parent or acquirer. Often paywalled; the search snippet is sometimes enough. Don't retry blocked fetches. +## What to gather -**Existing organization.** If the operator might already be in the database: -- `ls db/organizations/.eno` -- `grep -il "^name: $" db/organizations/*.eno` -- Check the parent (acquired companies roll up: Google → `google`, Meta → `facebook`, Adobe → `adobe`). -- Read a sibling pattern's `organization:` field. +The point is to surface enough sourced fact for the contributor to make the call. The right shape of the answer depends on what you find — a domain already covered by a parent filter ends the investigation in one line; a brand-new operator with three competing legal entities deserves a longer write-up. Use judgement. -Confirm by legal entity, not brand name alone. +### Existing coverage in `db/` -## Decide the change shape +- Exact match: `grep -rln "^$" db/patterns/`. +- Parent or sibling: `grep -rn "" db/patterns/`. A `||^` filter in some pattern's filter block already covers every subdomain — that's full coverage. A sibling subdomain in the same apex usually means the right move is extending that file rather than starting a new one. Path-scoped or unrelated hits don't conflict but are useful evidence of operator reuse. -Pick exactly one: -- **A — already covered.** No file changes. Report and stop. -- **B — extend an existing pattern.** Add the hostname to its `--- domains` block. -- **C — new pattern, existing organization.** New `db/patterns/.eno` pointing at the existing org slug. -- **D — new pattern + new organization.** +### Operator -## Report +Who actually runs this hostname? A few signal sources, ordered roughly by reliability: -Single markdown report. One or two lines per bullet. Leave values blank when unsourced — never guess. Mirror every blank in **Open questions**. +- **DNS.** `dig CNAME +short`, then `dig A +short`. A CNAME into another vendor's product infrastructure (`*.adobedtm.com`, `*.tealiumiq.com`, `*.salesforce.com`, `*.segment.io`) is strong evidence that vendor operates the endpoint. Generic CDN CNAMEs (CloudFront, Akamai, Fastly, Cloudflare, Google Cloud) only tell you who hosts the bytes, not who runs the service. +- **The brand site.** `WebFetch https:///` for the company name, what the product does, and links to the privacy policy and imprint. Tracker hostnames (`static.example.io`, `metrics.example.io`) often don't render or redirect to the brand domain — fetch the brand domain directly. +- **Privacy policy and imprint.** Legal entity, jurisdiction, privacy contact. The imprint of a `.de` site is often the most reliable single source for legal domicile. +- **WHOIS.** `whois `. Usually redacted on `.com` / `.net` / `.org`, but country-code TLDs (`.de`, `.fr`, `.kr`, `.jp`, `.uk`, ...) often expose the legal entity directly. +- **Business intel.** Crunchbase, CB Insights, LinkedIn About pages, Wikipedia. Useful for parent / acquirer relationships and HQ confirmation. Often paywalled — the search-result snippet alone is sometimes enough; don't retry blocked fetches. -``` -# Research: +Cross-reference at least two sources before stating a legal entity or country. Acquired companies often look independent on the surface but roll up under a parent — Google, Meta, Adobe, Salesforce, Oracle. The parent is what should appear in `country` and `description`, not the acquired brand's old jurisdiction. -## Existing coverage - +### Existing organization in `db/` -## Operator -- Brand / product: ... -- Legal entity: ... -- Country (legal domicile, ISO 3166-1 alpha-2): ... -- Parent or acquirer: ... -- Privacy policy URL: ... -- Privacy contact: ... +If the operator might already be in the database, the new pattern should reuse the existing slug rather than creating a duplicate org: -## Suggested category -One key from db/categories/, with a one-line justification. Mention a close runner-up. Final pick is the contributor's call. +- `ls db/organizations/.eno` +- `grep -il "^name: $" db/organizations/*.eno` +- Check the parent. Acquired companies roll up: most Google acquisitions live under `google`, Meta under `facebook`, Adobe under `adobe`. +- Read the `organization:` field of any related pattern you found earlier. -## Existing organization to reuse -- `` (db/organizations/.eno) — same legal entity, confirmed by -- or: none; a new organization is included below. +A matching brand name alone is not enough — open the candidate `.eno` and confirm the legal entity is the same. -## Proposed changes +### Category -For scenario A: "No changes — already covered." +The category is the contributor's call, but a sourced suggestion is useful. Read the company's own description of the product (their site, not a third-party summary), then map it to the keys in `db/categories/`. If two categories could reasonably apply, say so and give the trade-off — the contributor will pick. -For B/C/D, list each file under its own subheading (✏️ edit, ➕ new). Show full final contents in a fenced ```eno block (or, for an edit, the smallest excerpt with surrounding context — never a contextless diff). Pattern files need a `--- notes` block citing each non-trivial field. Add `--- filters` only when the research justifies it. +## What to produce -> Apply: review above, then ask me to write these files. I will not write them on my own. +A markdown report the contributor can read top-to-bottom and act on. There's no fixed template — let the findings drive the shape. As a baseline, it should make these things obvious: -## Sources -- : +- whether the domain is already covered (and if so, where) — if yes, stop there +- who the operator is, with sources for the legal entity and country +- whether an existing organization in `db/` should be reused, or a new one is needed +- a category suggestion with a one-line justification +- proposed file content as fenced `eno` blocks — full final contents for new files, smallest excerpt with surrounding context for edits (never a contextless diff) +- explicit list of what you couldn't source, so the contributor knows what's still open -## Open questions -- -``` +End with a line telling the contributor you won't write the files on your own — they review and approve. -## Don't +## Constraints -- Write or edit anything under `db/`. -- Fill values you cannot source. Blanks are auditable; invented values are not. -- Decide classification on the contributor's behalf. -- Cite Ghostery's own properties (`whotracks.me`, `ghostery.com/whotracksme/`) — generated from this database, so it's circular. -- Open a PR, commit, or push. -- Run `node lint.js` — that's the contributor's verification step after they apply changes. +- **Don't write or edit anything under `db/`.** Propose content; the contributor applies it (or asks you to in a follow-up). +- **Don't guess values you couldn't source.** Blanks are auditable; invented values get baked into a database that ships to millions of users. Call out every blank. +- **Don't decide classification for the contributor.** Suggest, justify, hand it over. +- **Don't cite Ghostery's own properties** (`whotracks.me`, `ghostery.com/whotracksme/`) — they are generated from this database, so citing them is circular. +- **Don't open a PR, commit, or push.** +- **Don't run `node lint.js`.** That's the contributor's verification step after they apply changes.