Skip to content

terastudio-org/ScrapeFandom

Repository files navigation

Wiki Fandom to Llama 3.2 Conversational Dataset Scraper

A comprehensive Python script that scrapes data from Wiki Fandom and converts it to Llama 3.2 conversational dataset format suitable for fine-tuning large language models.

Features

  • πŸš€ Multi-threaded scraping for fast data collection
  • πŸ“ Multiple conversation formats (Q&A, explain, summary, detailed)
  • 🌐 Multi-language support (English, Dutch, etc.)
  • πŸ”§ Robust error handling with fallbacks
  • πŸ“Š Detailed logging and progress tracking
  • πŸ’Ύ JSONL output format ready for LLM training
  • πŸ€– Llama 3.2 optimized conversational structure

Installation

  1. Clone or download the files:
# Download the main script and requirements
wget https://github.com/terastudio-org/ScrapeFandom/fandom_scraper_to_llama.py
wget https://github.com/terastudio-org/ScrapeFandom/requirements.txt
  1. Install dependencies:
pip install -r requirements.txt

Quick Start

Basic Usage

# Scrape Star Wars wiki with default settings
python fandom_scraper_to_llama.py --wiki starwars

# Scrape a specific number of pages
python fandom_scraper_to_llama.py --wiki pokemon --max-pages 500

# Custom settings
python fandom_scraper_to_llama.py \
    --wiki harrypotter \
    --domain harrypotter \
    --max-pages 1000 \
    --delay 2.0 \
    --workers 5 \
    --output-dir ./harry_potter_data

Advanced Usage

# Multi-language scraping
python fandom_scraper_to_llama.py \
    --wiki starwars \
    --languages en nl de \
    --max-pages 2000 \
    --include-images \
    --workers 10

Command Line Options

Option Description Default
--wiki Required. Wiki name (e.g., 'starwars', 'pokemon') -
--domain Fandom domain (default: same as wiki name) wiki name
--max-pages Maximum number of pages to scrape All available
--delay Delay between requests (seconds) 1.0
--workers Number of concurrent workers 3
--output-dir Output directory for dataset fandom_data
--languages Languages to process en
--include-images Include image URLs in content False
--respect-robots Respect robots.txt True

Output Format

Dataset Structure

The script generates a JSONL (JSON Lines) file where each line contains a conversation:

{
  "messages": [
    {
      "role": "user",
      "content": "Can you tell me about Darth Vader?"
    },
    {
      "role": "assistant", 
      "content": "Darth Vader is a central character in the Star Wars franchise..."
    }
  ]
}

Conversation Types

The script creates 4 types of conversations from each page:

  1. Simple Q&A: Direct question-answer pairs
  2. Explained: With system prompt for educational context
  3. Summary: Brief overviews of topics
  4. Detailed: Comprehensive explanations

File Outputs

fandom_data/
β”œβ”€β”€ fandom_llama_dataset.jsonl    # Main training dataset
β”œβ”€β”€ dataset_summary.json          # Dataset statistics
└── fandom_scraper.log           # Scraping log

Summary File Example

{
  "total_conversations": 1247,
  "wiki_name": "starwars", 
  "domain": "starwars",
  "languages": ["en"],
  "output_file": "fandom_data/fandom_llama_dataset.jsonl",
  "creation_time": "2025-12-05 12:55:21",
  "conversations_by_type": {
    "simple_qa": 312,
    "explained": 311, 
    "summary": 312,
    "detailed": 312
  }
}

Usage Examples

Example 1: Gaming Wiki

# Scrape Minecraft wiki
python fandom_scraper_to_llama.py \
    --wiki minecraft \
    --max-pages 800 \
    --workers 5 \
    --delay 1.5

Example 2: Multi-language

# Scrape in multiple languages
python fandom_scraper_to_llama.py \
    --wiki starwars \
    --languages en nl de fr es it pt \
    --max-pages 500 \
    --workers 8

Example 3: Large Dataset

# Create a large dataset with detailed content
python fandom_scraper_to_llama.py \
    --wiki memory-alpha \
    --domain star-trek \
    --max-pages 2000 \
    --include-images \
    --workers 10 \
    --delay 0.5

Training Llama 3.2

Once you have your dataset, you can fine-tune Llama 3.2 using frameworks like:

Using Unsloth (Recommended)

from unsloth import FastLanguageModel
import json

# Load your dataset
dataset = []
with open('fandom_data/fandom_llama_dataset.jsonl', 'r') as f:
    for line in f:
        dataset.append(json.loads(line))

# Fine-tune with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
)

# Training code would go here...

Using Hugging Face Transformers

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer

# Load dataset
dataset = load_dataset('json', data_files='fandom_data/fandom_llama_dataset.jsonl')

# Initialize model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# Set padding token
tokenizer.pad_token = tokenizer.eos_token

# Training arguments
training_args = TrainingArguments(
    output_dir="./llama-fandom-finetuned",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
    logging_steps=10,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    tokenizer=tokenizer,
)

# Start training
trainer.train()

Supported Wikis

The script works with any Fandom wiki. Here are popular examples:

Wiki Name Domain Example Command
Star Wars starwars --wiki starwars
Harry Potter harrypotter --wiki harrypotter
Marvel marvel --wiki marvel
DC Comics dc --wiki dc
PokΓ©mon pokemon --wiki pokemon
Minecraft minecraft --wiki minecraft
Star Trek star-trek --wiki memory-alpha --domain star-trek
Game of Thrones gameofthrones --wiki gameofthrones

Rate Limiting & Ethics

  • Built-in delays between requests to respect server load
  • Concurrent workers limited by default to avoid overwhelming servers
  • Respectful user agent identifies the scraper purpose
  • Configurable delays allow you to be more/less aggressive

Troubleshooting

Common Issues

  1. Import Errors
pip install -r requirements.txt
  1. Network Timeouts
# Increase delay between requests
python fandom_scraper_to_llama.py --wiki starwars --delay 3.0
  1. Memory Issues
# Reduce max pages and workers
python fandom_scraper_to_llama.py --wiki starwars --max-pages 100 --workers 1
  1. No Pages Found
# Try different wiki name or check domain
python fandom_scraper_to_llama.py --wiki starwars --domain starwars

Logging

Check fandom_scraper.log for detailed information about the scraping process.

Performance Tips

  1. Batch Processing: Use --workers 5-10 for faster scraping
  2. Memory Management: Limit --max-pages for large wikis
  3. Network: Use --delay 0.5 for fast connections, --delay 2.0 for slow ones
  4. Disk Space: Each conversation is ~500-1000 characters

Dataset Quality

  • Content Filtering: Removes pages with insufficient content
  • Multiple Formats: Creates diverse conversation types for better training
  • Context Preservation: Maintains original content structure
  • No HTML Artifacts: Clean text extraction with proper formatting

License

This tool is for educational and research purposes. Please respect the terms of service of Fandom wikis and use responsibly.

Contributing

Feel free to submit issues and enhancement requests!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages