Wiki Fandom to Llama 3.2 Conversational Dataset Scraper

A comprehensive Python script that scrapes data from Wiki Fandom and converts it to Llama 3.2 conversational dataset format suitable for fine-tuning large language models.

Features

🚀 Multi-threaded scraping for fast data collection
📝 Multiple conversation formats (Q&A, explain, summary, detailed)
🌐 Multi-language support (English, Dutch, etc.)
🔧 Robust error handling with fallbacks
📊 Detailed logging and progress tracking
💾 JSONL output format ready for LLM training
🤖 Llama 3.2 optimized conversational structure

Installation

Clone or download the files:

# Download the main script and requirements
wget https://github.com/terastudio-org/ScrapeFandom/fandom_scraper_to_llama.py
wget https://github.com/terastudio-org/ScrapeFandom/requirements.txt

Install dependencies:

pip install -r requirements.txt

Quick Start

Basic Usage

# Scrape Star Wars wiki with default settings
python fandom_scraper_to_llama.py --wiki starwars

# Scrape a specific number of pages
python fandom_scraper_to_llama.py --wiki pokemon --max-pages 500

# Custom settings
python fandom_scraper_to_llama.py \
    --wiki harrypotter \
    --domain harrypotter \
    --max-pages 1000 \
    --delay 2.0 \
    --workers 5 \
    --output-dir ./harry_potter_data

Advanced Usage

# Multi-language scraping
python fandom_scraper_to_llama.py \
    --wiki starwars \
    --languages en nl de \
    --max-pages 2000 \
    --include-images \
    --workers 10

Command Line Options

Option	Description	Default
`--wiki`	Required. Wiki name (e.g., 'starwars', 'pokemon')	-
`--domain`	Fandom domain (default: same as wiki name)	wiki name
`--max-pages`	Maximum number of pages to scrape	All available
`--delay`	Delay between requests (seconds)	1.0
`--workers`	Number of concurrent workers	3
`--output-dir`	Output directory for dataset	`fandom_data`
`--languages`	Languages to process	`en`
`--include-images`	Include image URLs in content	False
`--respect-robots`	Respect robots.txt	True

Output Format

Dataset Structure

The script generates a JSONL (JSON Lines) file where each line contains a conversation:

{
  "messages": [
    {
      "role": "user",
      "content": "Can you tell me about Darth Vader?"
    },
    {
      "role": "assistant", 
      "content": "Darth Vader is a central character in the Star Wars franchise..."
    }
  ]
}

Conversation Types

The script creates 4 types of conversations from each page:

Simple Q&A: Direct question-answer pairs
Explained: With system prompt for educational context
Summary: Brief overviews of topics
Detailed: Comprehensive explanations

File Outputs

fandom_data/
├── fandom_llama_dataset.jsonl    # Main training dataset
├── dataset_summary.json          # Dataset statistics
└── fandom_scraper.log           # Scraping log

Summary File Example

{
  "total_conversations": 1247,
  "wiki_name": "starwars", 
  "domain": "starwars",
  "languages": ["en"],
  "output_file": "fandom_data/fandom_llama_dataset.jsonl",
  "creation_time": "2025-12-05 12:55:21",
  "conversations_by_type": {
    "simple_qa": 312,
    "explained": 311, 
    "summary": 312,
    "detailed": 312
  }
}

Usage Examples

Example 1: Gaming Wiki

# Scrape Minecraft wiki
python fandom_scraper_to_llama.py \
    --wiki minecraft \
    --max-pages 800 \
    --workers 5 \
    --delay 1.5

Example 2: Multi-language

# Scrape in multiple languages
python fandom_scraper_to_llama.py \
    --wiki starwars \
    --languages en nl de fr es it pt \
    --max-pages 500 \
    --workers 8

Example 3: Large Dataset

# Create a large dataset with detailed content
python fandom_scraper_to_llama.py \
    --wiki memory-alpha \
    --domain star-trek \
    --max-pages 2000 \
    --include-images \
    --workers 10 \
    --delay 0.5

Training Llama 3.2

Once you have your dataset, you can fine-tune Llama 3.2 using frameworks like:

Using Unsloth (Recommended)

from unsloth import FastLanguageModel
import json

# Load your dataset
dataset = []
with open('fandom_data/fandom_llama_dataset.jsonl', 'r') as f:
    for line in f:
        dataset.append(json.loads(line))

# Fine-tune with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
)

# Training code would go here...

Using Hugging Face Transformers

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer

# Load dataset
dataset = load_dataset('json', data_files='fandom_data/fandom_llama_dataset.jsonl')

# Initialize model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# Set padding token
tokenizer.pad_token = tokenizer.eos_token

# Training arguments
training_args = TrainingArguments(
    output_dir="./llama-fandom-finetuned",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
    logging_steps=10,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    tokenizer=tokenizer,
)

# Start training
trainer.train()

Supported Wikis

The script works with any Fandom wiki. Here are popular examples:

Wiki Name	Domain	Example Command
Star Wars	starwars	`--wiki starwars`
Harry Potter	harrypotter	`--wiki harrypotter`
Marvel	marvel	`--wiki marvel`
DC Comics	dc	`--wiki dc`
Pokémon	pokemon	`--wiki pokemon`
Minecraft	minecraft	`--wiki minecraft`
Star Trek	star-trek	`--wiki memory-alpha --domain star-trek`
Game of Thrones	gameofthrones	`--wiki gameofthrones`

Rate Limiting & Ethics

Built-in delays between requests to respect server load
Concurrent workers limited by default to avoid overwhelming servers
Respectful user agent identifies the scraper purpose
Configurable delays allow you to be more/less aggressive

Troubleshooting

Common Issues

Import Errors

pip install -r requirements.txt

Network Timeouts

# Increase delay between requests
python fandom_scraper_to_llama.py --wiki starwars --delay 3.0

Memory Issues

# Reduce max pages and workers
python fandom_scraper_to_llama.py --wiki starwars --max-pages 100 --workers 1

No Pages Found

# Try different wiki name or check domain
python fandom_scraper_to_llama.py --wiki starwars --domain starwars

Logging

Check fandom_scraper.log for detailed information about the scraping process.

Performance Tips

Batch Processing: Use --workers 5-10 for faster scraping
Memory Management: Limit --max-pages for large wikis
Network: Use --delay 0.5 for fast connections, --delay 2.0 for slow ones
Disk Space: Each conversation is ~500-1000 characters

Dataset Quality

Content Filtering: Removes pages with insufficient content
Multiple Formats: Creates diverse conversation types for better training
Context Preservation: Maintains original content structure
No HTML Artifacts: Clean text extraction with proper formatting

License

This tool is for educational and research purposes. Please respect the terms of service of Fandom wikis and use responsibly.

Contributing

Feel free to submit issues and enhancement requests!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
SOLUTION_OVERVIEW.md		SOLUTION_OVERVIEW.md
demo.py		demo.py
example_complete.py		example_complete.py
fandom_scraper_to_llama.py		fandom_scraper_to_llama.py
prepare_training.py		prepare_training.py
requirements.txt		requirements.txt
test_scraper.py		test_scraper.py
wikis_config.json		wikis_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki Fandom to Llama 3.2 Conversational Dataset Scraper

Features

Installation

Quick Start

Basic Usage

Advanced Usage

Command Line Options

Output Format

Dataset Structure

Conversation Types

File Outputs

Summary File Example

Usage Examples

Example 1: Gaming Wiki

Example 2: Multi-language

Example 3: Large Dataset

Training Llama 3.2

Using Unsloth (Recommended)

Using Hugging Face Transformers

Supported Wikis

Rate Limiting & Ethics

Troubleshooting

Common Issues

Logging

Performance Tips

Dataset Quality

License

Contributing

About

Uh oh!

Releases

Packages

Languages

terastudio-org/ScrapeFandom

Folders and files

Latest commit

History

Repository files navigation

Wiki Fandom to Llama 3.2 Conversational Dataset Scraper

Features

Installation

Quick Start

Basic Usage

Advanced Usage

Command Line Options

Output Format

Dataset Structure

Conversation Types

File Outputs

Summary File Example

Usage Examples

Example 1: Gaming Wiki

Example 2: Multi-language

Example 3: Large Dataset

Training Llama 3.2

Using Unsloth (Recommended)

Using Hugging Face Transformers

Supported Wikis

Rate Limiting & Ethics

Troubleshooting

Common Issues

Logging

Performance Tips

Dataset Quality

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages