Skip to content

shanskarBansal/twitter-scraper-showcase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 

Repository files navigation

Apify Twitter/X Playwright Node.js

🐦 Apify Twitter/X Scraper

A powerful, production-ready Twitter/X data extraction actor built for the Apify platform.
Extract tweets, profiles, engagement metrics, media, and more β€” with intelligent scrolling, date filtering, and login support.

License Version Status


πŸš€ Overview

Apify Twitter/X Scraper is a specialized web scraping actor that runs on the Apify cloud platform. It uses Playwright for reliable browser automation, navigating Twitter/X pages, auto-scrolling through infinite timelines, and extracting structured data from tweets, profiles, and search results.

Why this scraper? Unlike API-based solutions limited by Twitter's expensive API tiers, this scraper works directly with the Twitter web interface β€” giving you access to public data without API key restrictions.


✨ Key Features

Feature Description
🐦 Twitter-Specific Extraction Purpose-built selectors for tweets, profiles, and search results
πŸ“Š Rich Data Output Tweets, usernames, display names, timestamps, engagement metrics (likes, retweets, replies, views), and media URLs
πŸ“… Date Range Filtering Extract tweets from specific time periods with smart startDate / endDate filtering
πŸ”„ Infinite Scroll Support Automatically scrolls through Twitter's infinite timeline with configurable scroll count and delay
πŸ–ΌοΈ Media Tab Navigation Auto-navigates to profile Media tabs for chronologically sorted media tweets
πŸ” Login Support Optional Twitter login for accessing restricted content and avoiding rate limits
🌐 Proxy Support Built-in Apify Proxy integration to distribute requests and bypass rate limiting
🧠 Memory Optimized Intelligent DOM cleanup during long scroll sessions to prevent memory leaks
⚑ High Performance Configurable concurrency, aggressive scroll strategies, and smart early-exit logic
πŸ›‘οΈ Error Resilient Graceful error handling with fallback selectors and retry mechanisms

πŸ“¦ Data You Can Extract

Tweet Data

{
  "url": "https://twitter.com/elonmusk",
  "scrapedAt": "2026-02-16T10:30:00.000Z",
  "text": "The future of AI is incredibly exciting...",
  "username": "elonmusk",
  "displayName": "Elon Musk",
  "timestamp": "2026-02-15T18:45:00.000Z",
  "tweetUrl": "https://twitter.com/elonmusk/status/1234567890",
  "replies": 4200,
  "retweets": 15000,
  "likes": 120000,
  "views": 5000000,
  "media": [
    {
      "type": "image",
      "url": "https://pbs.twimg.com/media/example.jpg"
    }
  ],
  "pageType": "profile"
}

Profile Data

{
  "url": "https://twitter.com/elonmusk",
  "scrapedAt": "2026-02-16T10:30:00.000Z",
  "type": "profile",
  "username": "elonmusk",
  "bio": "Mars & Cars, Chips & Dips",
  "stats": "Following 800 Β· Followers 175M"
}

🎯 Supported URL Types

URL Type Example Description
Profile https://twitter.com/username Scrapes tweets from a user's timeline
Media Tab https://twitter.com/username/media Auto-navigated; scrapes media tweets sorted by date
Single Tweet https://twitter.com/user/status/123 Extracts data from a specific tweet
Search https://twitter.com/search?q=AI Scrapes search result tweets
X.com https://x.com/username Full support for the new X.com domain

βš™οΈ Input Configuration

Required Parameters

Parameter Type Description
startUrls array Array of Twitter/X URLs to scrape

Optional Parameters

Parameter Type Default Description
maxRequestsPerCrawl integer 50 Maximum pages to crawl
maxConcurrency integer 50 Concurrent browser pages
maxTweets integer 0 (unlimited) Maximum tweets to extract
scrollCount integer 10 Number of scroll iterations
scrollDelay integer 2000 Delay between scrolls (ms)
scrapeMediaTab boolean true Auto-navigate to Media tab
startDate string β€” Filter: tweets from this date (YYYY-MM-DD)
endDate string β€” Filter: tweets until this date (YYYY-MM-DD)
twitterUsername string β€” Twitter login username/email
twitterPassword string β€” Twitter login password
waitForTimeout integer 2000 Wait time for content to load (ms)
proxyConfiguration object β€” Apify Proxy settings

πŸ“‹ Usage Examples

Basic Profile Scrape

{
  "startUrls": [{ "url": "https://twitter.com/elonmusk" }],
  "maxRequestsPerCrawl": 20,
  "maxConcurrency": 1,
  "waitForTimeout": 3000
}

Multi-Profile Scrape

{
  "startUrls": [
    { "url": "https://twitter.com/elonmusk" },
    { "url": "https://twitter.com/OpenAI" },
    { "url": "https://x.com/Google" }
  ],
  "maxRequestsPerCrawl": 100,
  "scrollCount": 20
}

Date-Filtered Extraction

{
  "startUrls": [{ "url": "https://twitter.com/elonmusk" }],
  "startDate": "2025-01-01",
  "endDate": "2025-12-31",
  "maxRequestsPerCrawl": 200
}

Search Results

{
  "startUrls": [
    { "url": "https://twitter.com/search?q=artificial%20intelligence" }
  ],
  "maxRequestsPerCrawl": 50,
  "maxConcurrency": 1
}

πŸ—οΈ Architecture

flowchart TB
    subgraph APIFY["☁️ Apify Platform"]
        subgraph ACTOR["🐦 Twitter Scraper Actor"]
            IP["πŸ“₯ Input Parser"] --> PC["🎭 Playwright Crawler"] --> ED["πŸ“€ Extract Data"]
            IP --> DF["πŸ“… Date Filter Config"]
            PC --> AS["πŸ”„ Auto Scroll Engine"]
            ED --> DA["πŸ—‚οΈ Date Filter Apply"]
            AS --> DS["πŸ’Ύ Apify Dataset\n(JSON Output)"]
            DA --> DS
        end
        PP["🌐 Proxy Pool"]
        DC["🐳 Docker Container"]
        CB["🌍 Chromium Browser"]
    end

    PP & DC & CB -.-> ACTOR
Loading

πŸ”§ How It Works

  1. Input Parsing β€” Reads Twitter URLs and configuration from Apify input
  2. Optional Login β€” Authenticates with Twitter if credentials are provided (handles multi-step login flow)
  3. Page Navigation β€” Opens each URL in a Playwright-controlled Chromium browser
  4. Media Tab Detection β€” Automatically navigates to the Media tab for profile URLs
  5. Infinite Scroll β€” Scrolls through Twitter's timeline with configurable depth and smart stopping logic
  6. Data Extraction β€” Parses tweet elements using Twitter's data-testid selectors with multiple fallbacks
  7. Date Filtering β€” Applies start/end date filters to extracted tweets
  8. Deduplication β€” Tracks tweet URLs to ensure no duplicate entries
  9. Memory Management β€” Periodically removes old DOM elements during long scroll sessions
  10. Output β€” Saves structured JSON data to Apify Dataset

πŸ“… Smart Date Filtering

The scraper features an intelligent date filtering system:

  • Unlimited scrolling when date filters are active β€” scrolls until all tweets in range are found
  • Smart exit detection β€” stops when scrolled past the target date range
  • Tolerance for gaps β€” continues scrolling even if temporary gaps appear in the timeline
  • UTC-based comparison β€” consistent timezone handling across all date operations
Timeline: ←── Older ─────────────── Newer ──→
                    β”‚                    β”‚
              startDate              endDate
                    │◄── Extracted ──►│

πŸš€ Deployment

Deploy to Apify

# Install Apify CLI
npm install -g apify-cli

# Login to Apify
apify login

# Push actor to Apify
apify push

Run with Docker

docker build -t twitter-scraper .
docker run -e APIFY_INPUT_JSON='{"startUrls":[{"url":"https://twitter.com/elonmusk"}]}' twitter-scraper

Local Development

# Install dependencies
npm install
npx playwright install chromium

# Run with test input
APIFY_INPUT_JSON="$(cat test_input.json)" node main.js

⚠️ Important Notes

Rate Limiting

  • Use maxConcurrency: 1 and waitForTimeout: 2000-3000 for safe scraping
  • Enable Apify Proxy to distribute requests across IPs
  • Start with small maxRequestsPerCrawl values and scale gradually

Authentication

  • Works with public content without login
  • Login support available for accessing restricted content
  • Store credentials securely using Apify Secrets

Domain Support

  • Full support for both twitter.com and x.com domains
  • URLs are automatically normalized regardless of domain

πŸ› οΈ Tech Stack

  • Runtime: Node.js (ES Modules)
  • Browser Automation: Playwright
  • Crawler Framework: Crawlee
  • Platform: Apify
  • Container: Docker (apify/actor-node-playwright-chrome)

πŸ“„ License

ISC


Built with ❀️ for the data extraction community
⭐ Star this repo if you find it useful!

About

🐦 Production-ready Twitter/X data scraper built on Apify with Playwright β€” Extract tweets, profiles, engagement metrics & media with smart scrolling and date filtering.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors