diff --git a/.claude/skills/research-tracker-domain/SKILL.md b/.claude/skills/research-tracker-domain/SKILL.md new file mode 100644 index 000000000..1287519a2 --- /dev/null +++ b/.claude/skills/research-tracker-domain/SKILL.md @@ -0,0 +1,70 @@ +--- +name: research-tracker-domain +description: Research a domain for a TrackerDB contributor — checks the database for existing coverage, investigates the operator, and produces a sourced research report plus concrete proposed file changes (snippets only). Does not write files in db/. Use when a developer wants help gathering facts about a domain before deciding whether or how to add it. +argument-hint: "" +allowed-tools: Bash, Read, WebSearch, WebFetch, Grep, Glob +--- + +# Research a tracker domain + +Help a TrackerDB contributor decide whether and how to add a domain. Read [`docs/contributing.md`](../../../docs/contributing.md) once for field rules and slug conventions — this skill does not repeat them. + +The user supplies one argument: a domain (or URL — strip to the hostname). + +## What to gather + +The point is to surface enough sourced fact for the contributor to make the call. The right shape of the answer depends on what you find — a domain already covered by a parent filter ends the investigation in one line; a brand-new operator with three competing legal entities deserves a longer write-up. Use judgement. + +### Existing coverage in `db/` + +- Exact match: `grep -rln "^$" db/patterns/`. +- Parent or sibling: `grep -rn "" db/patterns/`. A `||^` filter in some pattern's filter block already covers every subdomain — that's full coverage. A sibling subdomain in the same apex usually means the right move is extending that file rather than starting a new one. Path-scoped or unrelated hits don't conflict but are useful evidence of operator reuse. + +### Operator + +Who actually runs this hostname? A few signal sources, ordered roughly by reliability: + +- **DNS.** `dig CNAME +short`, then `dig A +short`. A CNAME into another vendor's product infrastructure (`*.adobedtm.com`, `*.tealiumiq.com`, `*.salesforce.com`, `*.segment.io`) is strong evidence that vendor operates the endpoint. Generic CDN CNAMEs (CloudFront, Akamai, Fastly, Cloudflare, Google Cloud) only tell you who hosts the bytes, not who runs the service. +- **The brand site.** `WebFetch https:///` for the company name, what the product does, and links to the privacy policy and imprint. Tracker hostnames (`static.example.io`, `metrics.example.io`) often don't render or redirect to the brand domain — fetch the brand domain directly. +- **Privacy policy and imprint.** Legal entity, jurisdiction, privacy contact. The imprint of a `.de` site is often the most reliable single source for legal domicile. +- **WHOIS.** `whois `. Usually redacted on `.com` / `.net` / `.org`, but country-code TLDs (`.de`, `.fr`, `.kr`, `.jp`, `.uk`, ...) often expose the legal entity directly. +- **Business intel.** Crunchbase, CB Insights, LinkedIn About pages, Wikipedia. Useful for parent / acquirer relationships and HQ confirmation. Often paywalled — the search-result snippet alone is sometimes enough; don't retry blocked fetches. + +Cross-reference at least two sources before stating a legal entity or country. Acquired companies often look independent on the surface but roll up under a parent — Google, Meta, Adobe, Salesforce, Oracle. The parent is what should appear in `country` and `description`, not the acquired brand's old jurisdiction. + +### Existing organization in `db/` + +If the operator might already be in the database, the new pattern should reuse the existing slug rather than creating a duplicate org: + +- `ls db/organizations/.eno` +- `grep -il "^name: $" db/organizations/*.eno` +- Check the parent. Acquired companies roll up: most Google acquisitions live under `google`, Meta under `facebook`, Adobe under `adobe`. +- Read the `organization:` field of any related pattern you found earlier. + +A matching brand name alone is not enough — open the candidate `.eno` and confirm the legal entity is the same. + +### Category + +The category is the contributor's call, but a sourced suggestion is useful. Read the company's own description of the product (their site, not a third-party summary), then map it to the keys in `db/categories/`. If two categories could reasonably apply, say so and give the trade-off — the contributor will pick. + +## What to produce + +A markdown report the contributor can read top-to-bottom and act on. There's no fixed template — let the findings drive the shape. As a baseline, it should make these things obvious: + +- whether the domain is already covered (and if so, where) — if yes, stop there +- who the operator is, with sources for the legal entity and country +- whether an existing organization in `db/` should be reused, or a new one is needed +- a category suggestion with a one-line justification +- proposed file content as fenced `eno` blocks — full final contents for new files, smallest excerpt with surrounding context for edits (never a contextless diff) +- explicit list of what you couldn't source, so the contributor knows what's still open + +End with a line telling the contributor you won't write the files on your own — they review and approve. + +## Constraints + +- **Don't write or edit anything under `db/`.** Propose content; the contributor applies it (or asks you to in a follow-up). +- **Don't guess values you couldn't source.** Blanks are auditable; invented values get baked into a database that ships to millions of users. Call out every blank. +- **Don't decide classification for the contributor.** Suggest, justify, hand it over. +- **Don't cite Ghostery's own properties** (`whotracks.me`, `ghostery.com/whotracksme/`) — they are generated from this database, so citing them is circular. +- **Don't open a PR, commit, or push.** +- **Don't run `node lint.js`.** That's the contributor's verification step after they apply changes. diff --git a/.github/ISSUE_TEMPLATE/categorize_tracker.yml b/.github/ISSUE_TEMPLATE/categorize_tracker.yml index 78bee9cc8..b1f46b4f0 100644 --- a/.github/ISSUE_TEMPLATE/categorize_tracker.yml +++ b/.github/ISSUE_TEMPLATE/categorize_tracker.yml @@ -60,7 +60,7 @@ body: id: company-description attributes: label: Describe the company - description: Tell us what you know about the company. Ensure descriptions are informative and impartial. Please fact check the information if using generative AI. Be prepared to cite sources, if necessary. + description: Tell us what you know about the company. Ensure descriptions are informative and impartial. Be prepared to cite sources for any non-obvious claim. validations: required: false - type: dropdown diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..e11f4bbe4 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,17 @@ +# AGENTS.md + +Read [`docs/contributing.md`](docs/contributing.md) before editing `db/`. + +The `research-tracker-domain` skill in `.claude/skills/` gathers sourced facts about a domain into a research report. It does not write files in `db/` — the contributor reviews and applies. + +## Before opening a pull request + +```sh +npm ci +npm run lint +npm run lint-patterns +npm test +npm run build +``` + +The SDK in `src/` and tooling in `scripts/` are TypeScript / JavaScript on Node 20+. `npm run lint-fix` applies ESLint + Prettier fixes. diff --git a/README.md b/README.md index 55480fe9a..181ee57ed 100644 --- a/README.md +++ b/README.md @@ -96,6 +96,8 @@ Output: We encourage contributions from developers of all levels. If you come across any errors, such as typos, inaccuracies, or outdated information, please don't hesitate to open an issue, or, even better, send us a pull request. Your feedback is highly valued! +See [docs/contributing.md](docs/contributing.md) for a walkthrough of adding a new tracker — file formats, category guidance, sourcing expectations, and how to verify your changes. + If you are new to the project or want an easy starting point, check out our [Good First Issues](https://github.com/ghostery/trackerdb/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22). These are beginner-friendly tasks to help you get acquainted with our project. If you are unsure about an issue or have questions, feel free to ask in the issue comments. ### Data Partners diff --git a/docs/contributing.md b/docs/contributing.md new file mode 100644 index 000000000..6bdba7cca --- /dev/null +++ b/docs/contributing.md @@ -0,0 +1,45 @@ +# Contributing to TrackerDB + +This data ships in the Ghostery extension and powers [WhoTracks.me](https://whotracks.me/), so accuracy beats coverage. **Prefer leaving a field blank to filling it with a value you cannot back up.** + +The common contribution is a new tracker domain. Open a few files in `db/organizations/` and `db/patterns/` first — they are the canonical examples, and most rules are obvious from them. The non-obvious bits: + +- A pattern's `organization:` field points at one org; one org commonly has many patterns (`google` is referenced by 60+). +- `country` is the **legal domicile** (ISO 3166-1 alpha-2), not where engineers sit. Take it from the privacy policy or imprint. +- `description` is factual — no marketing language, no editorial judgement. +- `category` keys come from `db/categories/`; don't invent new ones. +- `tags` accepts only `site-statistics`, `cross-site`, `passive-statistics`, `anti-fraud`. Omit otherwise. +- The `--- domains` block already blocks every listed hostname. Add `--- filters` only when a hostname is shared between tracking and non-tracking traffic, or when behaviour is third-party-only (`||example.com^$3p`). +- Cite each non-trivial value in a `--- notes` block — except Ghostery's own properties (`whotracks.me`, `ghostery.com/whotracksme/`), which are generated from this database and would be circular. + +## Slugs + +Lowercase, snake_case (replace `.` and `-` with `_`). Drop the TLD when the brand reads better: `Dable` → `dable`. For multi-product organizations, name the pattern after the product (`google_analytics`, `google_tag_manager`). Check collisions: `ls db/organizations/.eno db/patterns/.eno`. + +## Before adding an organization + +If the operator is already in the database, point your new pattern at the existing slug rather than creating a duplicate. Acquired companies often roll up — Google → `google`, Meta → `facebook`, Adobe → `adobe`. Confirm by legal entity, not just brand name: + +```sh +ls db/organizations/.eno +grep -il "^name: $" db/organizations/*.eno +``` + +## Before adding a pattern + +Make sure the domain isn't already covered: + +```sh +grep -rln "^$" db/patterns/ # exact match +grep -rn "" db/patterns/ # parent / sibling coverage +``` + +A `||^` filter in an existing pattern already covers all subdomains. + +## Verifying + +```sh +node lint.js +``` + +Auto-formats files in `db/` and flags common mistakes. Pass `--check` to verify without writing.