A comprehensive Python script that scrapes data from Wiki Fandom and converts it to Llama 3.2 conversational dataset format suitable for fine-tuning large language models.
- π Multi-threaded scraping for fast data collection
- π Multiple conversation formats (Q&A, explain, summary, detailed)
- π Multi-language support (English, Dutch, etc.)
- π§ Robust error handling with fallbacks
- π Detailed logging and progress tracking
- πΎ JSONL output format ready for LLM training
- π€ Llama 3.2 optimized conversational structure
- Clone or download the files:
# Download the main script and requirements
wget https://github.com/terastudio-org/ScrapeFandom/fandom_scraper_to_llama.py
wget https://github.com/terastudio-org/ScrapeFandom/requirements.txt- Install dependencies:
pip install -r requirements.txt# Scrape Star Wars wiki with default settings
python fandom_scraper_to_llama.py --wiki starwars
# Scrape a specific number of pages
python fandom_scraper_to_llama.py --wiki pokemon --max-pages 500
# Custom settings
python fandom_scraper_to_llama.py \
--wiki harrypotter \
--domain harrypotter \
--max-pages 1000 \
--delay 2.0 \
--workers 5 \
--output-dir ./harry_potter_data# Multi-language scraping
python fandom_scraper_to_llama.py \
--wiki starwars \
--languages en nl de \
--max-pages 2000 \
--include-images \
--workers 10| Option | Description | Default |
|---|---|---|
--wiki |
Required. Wiki name (e.g., 'starwars', 'pokemon') | - |
--domain |
Fandom domain (default: same as wiki name) | wiki name |
--max-pages |
Maximum number of pages to scrape | All available |
--delay |
Delay between requests (seconds) | 1.0 |
--workers |
Number of concurrent workers | 3 |
--output-dir |
Output directory for dataset | fandom_data |
--languages |
Languages to process | en |
--include-images |
Include image URLs in content | False |
--respect-robots |
Respect robots.txt | True |
The script generates a JSONL (JSON Lines) file where each line contains a conversation:
{
"messages": [
{
"role": "user",
"content": "Can you tell me about Darth Vader?"
},
{
"role": "assistant",
"content": "Darth Vader is a central character in the Star Wars franchise..."
}
]
}The script creates 4 types of conversations from each page:
- Simple Q&A: Direct question-answer pairs
- Explained: With system prompt for educational context
- Summary: Brief overviews of topics
- Detailed: Comprehensive explanations
fandom_data/
βββ fandom_llama_dataset.jsonl # Main training dataset
βββ dataset_summary.json # Dataset statistics
βββ fandom_scraper.log # Scraping log
{
"total_conversations": 1247,
"wiki_name": "starwars",
"domain": "starwars",
"languages": ["en"],
"output_file": "fandom_data/fandom_llama_dataset.jsonl",
"creation_time": "2025-12-05 12:55:21",
"conversations_by_type": {
"simple_qa": 312,
"explained": 311,
"summary": 312,
"detailed": 312
}
}# Scrape Minecraft wiki
python fandom_scraper_to_llama.py \
--wiki minecraft \
--max-pages 800 \
--workers 5 \
--delay 1.5# Scrape in multiple languages
python fandom_scraper_to_llama.py \
--wiki starwars \
--languages en nl de fr es it pt \
--max-pages 500 \
--workers 8# Create a large dataset with detailed content
python fandom_scraper_to_llama.py \
--wiki memory-alpha \
--domain star-trek \
--max-pages 2000 \
--include-images \
--workers 10 \
--delay 0.5Once you have your dataset, you can fine-tune Llama 3.2 using frameworks like:
from unsloth import FastLanguageModel
import json
# Load your dataset
dataset = []
with open('fandom_data/fandom_llama_dataset.jsonl', 'r') as f:
for line in f:
dataset.append(json.loads(line))
# Fine-tune with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Llama-3.2-3B-Instruct",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
)
# Training code would go here...from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
# Load dataset
dataset = load_dataset('json', data_files='fandom_data/fandom_llama_dataset.jsonl')
# Initialize model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
# Set padding token
tokenizer.pad_token = tokenizer.eos_token
# Training arguments
training_args = TrainingArguments(
output_dir="./llama-fandom-finetuned",
num_train_epochs=1,
per_device_train_batch_size=1,
save_steps=500,
save_total_limit=2,
prediction_loss_only=True,
logging_steps=10,
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset['train'],
tokenizer=tokenizer,
)
# Start training
trainer.train()The script works with any Fandom wiki. Here are popular examples:
| Wiki Name | Domain | Example Command |
|---|---|---|
| Star Wars | starwars | --wiki starwars |
| Harry Potter | harrypotter | --wiki harrypotter |
| Marvel | marvel | --wiki marvel |
| DC Comics | dc | --wiki dc |
| PokΓ©mon | pokemon | --wiki pokemon |
| Minecraft | minecraft | --wiki minecraft |
| Star Trek | star-trek | --wiki memory-alpha --domain star-trek |
| Game of Thrones | gameofthrones | --wiki gameofthrones |
- Built-in delays between requests to respect server load
- Concurrent workers limited by default to avoid overwhelming servers
- Respectful user agent identifies the scraper purpose
- Configurable delays allow you to be more/less aggressive
- Import Errors
pip install -r requirements.txt- Network Timeouts
# Increase delay between requests
python fandom_scraper_to_llama.py --wiki starwars --delay 3.0- Memory Issues
# Reduce max pages and workers
python fandom_scraper_to_llama.py --wiki starwars --max-pages 100 --workers 1- No Pages Found
# Try different wiki name or check domain
python fandom_scraper_to_llama.py --wiki starwars --domain starwarsCheck fandom_scraper.log for detailed information about the scraping process.
- Batch Processing: Use
--workers 5-10for faster scraping - Memory Management: Limit
--max-pagesfor large wikis - Network: Use
--delay 0.5for fast connections,--delay 2.0for slow ones - Disk Space: Each conversation is ~500-1000 characters
- Content Filtering: Removes pages with insufficient content
- Multiple Formats: Creates diverse conversation types for better training
- Context Preservation: Maintains original content structure
- No HTML Artifacts: Clean text extraction with proper formatting
This tool is for educational and research purposes. Please respect the terms of service of Fandom wikis and use responsibly.
Feel free to submit issues and enhancement requests!