Pipeline for generating AI character files and training datasets by scraping public figures' online presence across Twitter and blogs.
⚠️ IMPORTANT: Create a new Twitter account for this tool. DO NOT use your main account as it may trigger Twitter's automation detection and result in account restrictions.
-
Install dependencies:
npm install
-
Copy the
.env.exampleinto a.envfile:# (Required) Twitter Authentication TWITTER_USERNAME= # your twitter username TWITTER_PASSWORD= # your twitter password TWITTER_EMAIL= # your twitter email # RapidAPI Configuration RAPIDAPI_URL= RAPIDAPI_KEY= # Google Generative AI API Key. Required for summarizing tweets. GOOGLE_GENERATIVE_AI_API_KEY= # (Optional) Blog Configuration BLOG_URLS_FILE= # path to file containing blog URLs # (Optional) Scraping Configuration MAX_TWEETS= # max tweets to scrape MAX_RETRIES= # max retries for scraping RETRY_DELAY= # delay between retries MIN_DELAY= # minimum delay between requests MAX_DELAY= # maximum delay between requests
Add Rapid API to get more data.
Get full text tweet:
const twitterCrawlAPI = new TwitterCrawlAPI();
twitterCrawlAPI.getFullTextTweet();Use puppeteer to get full text tweet with tweet before Sep 29, 2022:
twitterCrawlAPI.fallbackGetFullTextTweet();Get message examples:
this.messageExamplesCrawler = new MessageExamplesCrawler();
messageExamplesCrawler.addExample();// Extract knowledge with longer tweets
const knowledgeGenerator = new KnowledgeGenerator();
await knowledgeGenerator.addKnowledge(uniqueTweets);
characterData.knowledge = knowledgeGenerator.getKnowledge();npm run startAdd express Server
- GET
/api/characters/:username- get character data by username - POST
/api/characters- scrape tweets and blogs by username
{
"username": "pmarca", // twitter username
"is_crawl": true // scrape tweets
}npm run twitter -- usernameExample: npm run twitter -- pmarca
npm run blognpm run character -- usernameExample: npm run character -- pmarca
npm run finetunenpm run finetune:testRun this after Twitter Collection step
npm run generate-virtuals -- username dateExample: npm run generate-virtuals -- pmarca 2024-11-29
Example without date: npm run generate-virtuals -- pmarca
The generated character file will be in the characters/[username].json directory. Edit clients and modelProvider fields to match your needs.
The generated tweet dataset file will be in pipeline/[username]/[date]/raw/tweets.json.