tg-bot-detector

Heuristic + ML bot detection and purge toolkit for Telegram broadcast channel subscribers.

Analyzes your channel's subscriber list using profile-based scoring heuristics, temporal clustering, and a LightGBM classifier to identify likely bot accounts. Built on Telethon (MTProto) for full subscriber enumeration and Telegram Bot API for high-speed purging.

Proven on @leviathan_news: detected ~60,000 fake subscribers out of 86,000 (subscriber count dropped from 86K to 43K through natural bot exodus + active purging).

Quick Start

# Install
pip install -e .

# Install with ML dependencies
pip install -e ".[ml]"

# Set credentials (get from https://my.telegram.org)
export TG_PURGE_API_ID="your_api_id"
export TG_PURGE_API_HASH="your_api_hash"

# First run will prompt for phone number + verification code
tg-purge analyze --channel @your_channel --strategy minimal

Installation

Requires Python 3.9+.

git clone https://github.com/leviathan-news/tg-bot-detector.git
cd tg-bot-detector
pip install -e .

# For TOML config support on Python <3.11:
pip install -e ".[toml]"

# For ML pipeline (scikit-learn, LightGBM, XGBoost):
pip install -e ".[ml]"

Configuration

Set credentials via environment variables:

export TG_PURGE_API_ID="your_api_id"
export TG_PURGE_API_HASH="your_api_hash"

Or use a TOML config file (see examples/config.example.toml):

cp examples/config.example.toml config.toml
# Edit config.toml with your values

CLI Reference

analyze

Multi-round subscriber analysis. Three phases: self-identified bots, recently active users, and search-based sampling.

tg-purge analyze --channel @foo
tg-purge analyze --channel @foo --strategy minimal  # faster, fewer queries

candidates

Generate scored candidate lists for offline review. Supports heuristic, ML, or hybrid scoring modes.

# Heuristic only (default)
tg-purge candidates --channel @foo --threshold 4 --output candidates.csv

# Hybrid: heuristic + ML (requires trained model in models/)
tg-purge candidates --channel @foo --threshold 0 --scoring hybrid --output hybrid.csv

# With safelist protection
tg-purge candidates --channel @foo --threshold 4 --safelist registry/safelist.csv

Ctrl+C during enumeration saves partial results automatically. Also handles SIGHUP/SIGTERM for tmux compatibility.

join-dates

Join date clustering and spike detection. Identifies bulk-subscription events.

tg-purge join-dates --channel @foo
tg-purge join-dates --channel @foo --top-days 50

spike

Deep-dive into a specific time window. Compares spike subscribers against a control group.

tg-purge spike --channel @foo --start "2025-11-09T06:00Z" --end "2025-11-09T07:00Z"

validate

Score known-good users to measure the false positive rate at each threshold.

tg-purge validate --channel @foo --known-users my_users.csv

label

Bootstrap weak labels from heuristic scores or view label stats.

tg-purge label --bootstrap --channel @foo
tg-purge label --stats

ml

ML model management: train, inspect, or export features.

tg-purge ml train --channel @foo
tg-purge ml info
tg-purge ml export-features --channel @foo

registry

Local bot registry management. Generate, add to, or query a local registry of flagged user IDs.

tg-purge registry generate --channel @foo --threshold 4
tg-purge registry add --ids-file flagged.txt
tg-purge registry check --user-id 123456789

Detection Layers

Layer 1: Heuristic Scoring

Each subscriber is scored based on Telegram profile attributes. Higher = more likely bot.

Signal	Weight	Description
Deleted account	+5	Account has been deleted
Scam/fake flag	+5	Telegram-applied flag
Bot API account	+3	Self-identified bot
Restricted	+2	Telegram-restricted account
No activity status	+2	Never been seen online
Offline >1 year	+2	Abandoned account
Airdrop farmer	+2	2+ airdrop project tokens in display name
Spike join	+2	Joined during a detected bulk-subscription window
No photo/username	+1 each	Sparse profile
Photo DC 1	+1	Profile photo on bot-heavy datacenter
Short/digit/mixed name	+1 each	Generated-looking name
Premium subscriber	-2	Legitimacy signal (but gameable — bot farms buy Premium)
Emoji status	-1	Premium feature usage
Photo DC 5	-1	Human-dominant datacenter
Video avatar	-1	0% prevalence on bots
Custom color	-1	0.1% on bots vs 14.1% on humans

Layer 2: Temporal Clustering

Sliding-window algorithm detects bulk-subscription spikes from join-date history. Users who joined during a detected spike get +2 to their heuristic score.

Layer 3: Machine Learning

LightGBM classifier trained on 35K+ labeled accounts. Extracts 51 features per user including profile attributes, name analysis, photo metadata, and join timing. The ML model catches bots that evade individual heuristic signals.

Training uses ground-truth labels from:

Departed bot accounts (captured via exit monitoring during mass exodus events)
Human-reviewed profiles (1,132 manual reviews)
High-confidence heuristic labels (bootstrap)

Status features are neutralized during training to prevent data leakage from exit monitoring (bots show "online" when the farm activates them to leave).

Purge Scripts

Bot API (recommended)

python scripts/purge_bot_api.py \
    --input output/candidates.csv \
    --chat-id -100XXXXXXXXXX \
    --bot-token "YOUR_BOT_TOKEN"

Requires a bot that is admin in the channel with ban permissions. Much faster than user API (~20 bans/s vs 300-then-flood-wait).

User MTProto API (fallback)

python scripts/purge_batch.py \
    --input output/candidates.csv \
    --channel @your_channel \
    --chunk-size 200 --chunk-cooldown 300

Chunked banning with configurable cooldowns to avoid FloodWait. Both scripts support:

Resume: JSONL progress files for interrupted runs
Graceful shutdown: Ctrl+C / SIGHUP / SIGTERM save progress
Safelist: Exclude protected user IDs

Limitations

200-result cap: Each Telegram query returns at most 200 results
Sampling bias: Search-based enumeration misses accounts with no name or unusual names
Privacy false positives: Users who hide online status score +2 automatically
Score-0 bots: Sophisticated bots with full profiles are invisible to profile-based detection
Premium is gameable: Bot farms buy Telegram Premium to evade the -2 bonus
Channel-specific: Developed for broadcast channels, not groups

Testing

python -m pytest tests/ -v

453 unit tests covering scoring logic, CLI parsing, config loading, enumeration, clustering, ML features, labeling, statistics, cross-channel detection, and collectors. No Telegram API calls in tests.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
docs		docs
examples		examples
registry		registry
scripts		scripts
tests		tests
tg_purge		tg_purge
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tg-bot-detector

Quick Start

Installation

Configuration

CLI Reference

analyze

candidates

join-dates

spike

validate

label

ml

registry

Detection Layers

Layer 1: Heuristic Scoring

Layer 2: Temporal Clustering

Layer 3: Machine Learning

Purge Scripts

Bot API (recommended)

User MTProto API (fallback)

Limitations

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tg-bot-detector

Quick Start

Installation

Configuration

CLI Reference

analyze

candidates

join-dates

spike

validate

label

ml

registry

Detection Layers

Layer 1: Heuristic Scoring

Layer 2: Temporal Clustering

Layer 3: Machine Learning

Purge Scripts

Bot API (recommended)

User MTProto API (fallback)

Limitations

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages