NetPlag is a distributed plagiarism detection system designed to process and analyze large volumes of academic documents using big data technologies. The system employs TF-IDF vectorization and cosine similarity algorithms to identify potential plagiarism cases in both batch and real-time streaming modes. Built on Apache Spark and Hadoop HDFS, it provides scalable storage, distributed processing, and real-time analytics through Elasticsearch integration and an interactive web dashboard.
- Apache Spark 3.5.7: Distributed data processing engine for batch and streaming operations
- Hadoop HDFS: Distributed file system for storing documents, models, and results
- Elasticsearch 8.11.0: Search engine for indexing and querying plagiarism results
- Flask 2.3+: Web framework powering the interactive dashboard
- Python 3.11: Primary programming language
- Docker: Containerization platform for HDFS and Elasticsearch services
- NameNode & DataNode: HDFS cluster for distributed storage (Docker containers)
- Dashboard Service: Flask-based web application (containerized)
- Elasticsearch Service: Search and analytics engine (containerized)
- Spark Processing: Runs on Windows host or Docker for distributed computing
Purpose: Establish HDFS directory structure and prepare environment
Key Scripts:
0_migrate_to_hdfs.py: Creates HDFS directory structure with proper permissionsmigrate_fast.ps1: Ultra-fast file migration using Docker volume mounts and tar archives
Directory Structure Created:
/netplag/
├── data/
│ ├── corpus_initial/ # Reference corpus
│ ├── stream_input/ # Incoming documents for streaming
│ └── stream_source/ # Source files for simulation
└── storage/
├── idf_model/ # TF-IDF model
├── reference_vectors/ # Vectorized reference corpus
├── streaming_vectors/ # Vectorized streaming documents
├── plagiarism_results/ # Detection results
└── reports/ # Analysis reports
Technical Features:
- Docker exec-based directory creation to avoid Windows permission issues
- Batch processing (500 files per batch) using tar archives for efficient migration
- Automatic HDFS connection verification before operations
Script: 1_batch_init.py
Purpose: Build reference corpus with TF-IDF features
Process Flow:
- Document Reading: Load
.txtfiles from HDFS corpus directory - Text Preprocessing:
- Convert to lowercase
- Remove special characters (keep only alphanumeric)
- Filter words shorter than 3 characters
- Truncate to 50,000 characters per document
- Feature Extraction:
- HashingTF: Convert text to term frequency vectors (5,000 features)
- IDF: Calculate inverse document frequency to weight terms
- Model Persistence:
- Save IDF model to HDFS (
/netplag/storage/idf_model) - Save reference vectors to HDFS (
/netplag/storage/reference_vectors)
- Save IDF model to HDFS (
Technical Implementation:
- Uses Spark DataFrame API for distributed processing
- Implements custom UDF (User Defined Function) for text cleaning
- Employs Parquet format for efficient columnar storage
- Handles Java security configuration for Windows compatibility
Script: 2_streaming_app.py
Purpose: Real-time plagiarism detection for incoming documents
Process Flow:
- Stream Monitoring: Watch HDFS
stream_inputdirectory for new.txtfiles - Micro-Batch Processing:
- Trigger every 5 seconds
- Process up to 10 files per batch
- Vectorization: Apply TF-IDF transformation using pre-trained model
- Similarity Calculation: Compare against reference corpus using cosine similarity
- Result Storage:
- Save streaming vectors to HDFS
- Store plagiarism results with scores
Technical Features:
- Spark Structured Streaming with checkpoint-based fault tolerance
- HDFS-based checkpointing to avoid Windows native IO issues
- Broadcast join optimization for reference vectors
- Custom similarity computation using SparseVector operations
- Configurable plagiarism threshold (default: 0.7)
Key Algorithm - Cosine Similarity:
similarity = (v1 · v2) / (||v1|| × ||v2||)
Where v1 and v2 are TF-IDF vectors of compared documents.
Script: 4_plagiarism_analysis.py
Purpose: Comprehensive analysis of all streaming documents
Process Flow:
- Data Loading:
- Load streaming vectors from HDFS
- Load reference vectors from HDFS
- Similarity Matrix Computation:
- Cross join: streaming docs × reference docs
- Calculate cosine similarity for each pair
- Filter by plagiarism threshold
- Statistical Analysis:
- Total plagiarism cases detected
- Maximum similarity score
- Average similarity score
- Per-document summaries
- Report Generation:
- Detailed results (Parquet format)
- Summary statistics (Parquet format)
- JSON export for easy reading
Output Files on HDFS:
/netplag/storage/reports/detailed_results: Full similarity matrix/netplag/storage/reports/summary: Per-document statistics/netplag/storage/reports/plagiarism_cases.json: Human-readable report
Script: 6_elasticsearch_indexer.py
Purpose: Index results for fast search and visualization
Elasticsearch Indices Created:
-
plagiarism_reports: Detailed detection resultsdocument_filename: Source documentreference_filename: Matched referencesimilarity_score: Cosine similarity (0-1)is_plagiarism: Boolean flag (threshold-based)timestamp: Analysis timestamp
-
analysis_results: Document-level summariesdocument_filename: Document namenum_matches: Count of plagiarism matchesmax_score: Highest similarity scoreavg_score: Average similarity scoreanalysis_timestamp: Analysis time
-
documents: Document metadata (reserved for future use)
Technical Implementation:
- Bulk indexing for performance (1,000 docs per batch)
- Automatic index creation with proper mappings
- Connection retry logic for Docker environment
- Index refresh for immediate searchability
Script: 7_dashboard.py
Template: templates/dashboard.html
Purpose: Interactive visualization of plagiarism detection results
Features:
-
Real-Time Statistics:
- Total cases analyzed
- Confirmed plagiarism count
- High similarity cases (>0.8)
- Average/Max/Min similarity scores
-
Interactive Visualizations:
- Similarity Distribution Histogram: Shows distribution of similarity scores in 0.1 intervals
- Top Documents Chart: Bar chart of documents with most matches
- Analysis Summary Table: Document-level statistics with sorting
-
Search & Filter:
- Search by document or reference filename
- Filter by minimum similarity score
- Pagination support (20 results per page)
-
RESTful API Endpoints:
/api/stats: Overall statistics/api/plagiarism_cases: Paginated case list/api/similarity_distribution: Histogram data/api/top_documents: Most matched documents/api/analysis_summary: Summary table data/api/search: Search functionality
Technical Stack:
- Backend: Flask with Elasticsearch client
- Frontend: HTML5, CSS3, JavaScript, Chart.js
- Deployment: Docker container (Python 3.11-slim)
- Auto-refresh every 30 seconds
- Responsive design for all screen sizes
Script: 8_full_streamprocess.py
Purpose: All-in-one automated pipeline combining Steps 2, 3, and 4
Workflow:
- File Detection: Monitor HDFS
stream_inputdirectory - Streaming Processing: Apply TF-IDF and detect plagiarism (Step 2)
- Automatic Batch Analysis: Analyze all streaming vectors after each batch (Step 3)
- Automatic Indexing: Index results to Elasticsearch immediately (Step 4)
Key Advantages:
- Fully Automated: No manual intervention required
- Real-Time Results: Immediate availability in dashboard
- Production-Ready: Complete pipeline for deployment
- Efficient: Single process handles entire workflow
Configuration:
- Batch size: 10 files per trigger
- Trigger interval: 5 seconds
- Plagiarism threshold: 0.7
- Top-K results: 10
Preprocessing Pipeline:
- Text normalization (lowercase conversion)
- Special character removal (regex:
[^a-z0-9\s]) - Tokenization (split by whitespace)
- Short word filtering (min length: 3 characters)
- Document truncation (max: 50,000 characters)
TF-IDF Feature Extraction:
- Feature Space: 5,000 dimensions (HashingTF)
- IDF Min Document Frequency: 2 (reduces noise)
- Vector Format: SparseVector (memory-efficient)
Cosine Similarity Formula:
def cosine_similarity(v1, v2):
dot_product = np.dot(v1.toArray(), v2.toArray())
norm1 = np.linalg.norm(v1.toArray())
norm2 = np.linalg.norm(v2.toArray())
return dot_product / (norm1 * norm2) if norm1 > 0 and norm2 > 0 else 0.0Implementation Details:
- Custom UDF for Spark DataFrame operations
- Supports both SparseVector and DenseVector
- Broadcast optimization for reference corpus
- Window functions for top-K selection
Key Workarounds:
-
DataNode Hostname Resolution:
- Force
dfs.client.use.datanode.hostname=true - Maps DataNode internal IP to
localhost:9866
- Force
-
Checkpoint Storage:
- Use HDFS checkpoints instead of local filesystem
- Avoids Windows native IO library issues
-
Java Security File Issues:
- Create temporary security file in temp directory
- Pass custom Java options to Spark executors
Network Configuration:
# docker-compose.yml
services:
namenode:
ports:
- "8020:8020" # RPC
- "9870:9870" # Web UI
datanode:
ports:
- "9864:9864" # Web UI
- "9866:9866" # Data transfer-
NameNode:
- Image:
bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 - Persistent volume:
hadoop_namenode - Health check: HDFS admin report
- Memory: 2GB limit, 1GB reserved
- Image:
-
DataNode:
- Image:
bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 - Persistent volume:
hadoop_datanode - Depends on NameNode health
- Memory: 2GB limit, 1GB reserved
- Image:
-
Elasticsearch:
- Image:
docker.elastic.co/elasticsearch/elasticsearch:8.11.0 - Single-node cluster, security disabled
- Persistent volume:
elasticsearch_data - Memory: 512MB JVM heap
- Image:
-
Dashboard:
- Custom image from
Dockerfile.dashboard - Based on Python 3.11-slim
- Auto-restart on failure
- Health check:
/api/statsendpoint
- Custom image from
Persistent Volumes:
hadoop_namenode: HDFS metadatahadoop_datanode: HDFS data blockselasticsearch_data: Elasticsearch indices
Backup Strategy:
backup_docker.ps1: Container and volume backupbackup_hdfs_data.ps1: HDFS data exportrestore_docker.ps1: Full restoration
Environment Detection:
IN_DOCKER = os.path.exists('/.dockerenv')
HDFS_BASE_URL = "hdfs://namenode:8020" if IN_DOCKER else "hdfs://localhost:8020"Key Paths:
- Base:
/netplag/ - Corpus:
/netplag/data/corpus_initial/ - Stream input:
/netplag/data/stream_input/ - IDF model:
/netplag/storage/idf_model/ - Reference vectors:
/netplag/storage/reference_vectors/ - Reports:
/netplag/storage/reports/
Connection Settings:
ES_URL = "http://elasticsearch:9200" if IN_DOCKER else "http://localhost:9200"Index Mappings:
- Dynamic field detection disabled
- Explicit type definitions for all fields
- Optimized for aggregations and range queries
Functions:
cosine_similarity_sparse(v1, v2): Core similarity calculationcosine_similarity_udf(): Spark UDF wrappercompute_similarity_matrix(df1, df2, ...): Batch similarity computationfind_plagiarism_candidates(...): Complete plagiarism detection workflow
Features:
- Handles both SparseVector and DenseVector
- Window-based top-K selection
- Configurable threshold and result count
- Optimized cross-join with broadcast
Capabilities:
- Single document analysis
- Batch document analysis
- ES-ready JSON export format
- Independent of streaming pipeline
Use Cases:
- Ad-hoc document checking
- Testing and validation
- Manual analysis workflows
# 1. Start services
docker-compose up -d
# 2. Create HDFS structure
python scripts/0_migrate_to_hdfs.py
# 3. Migrate corpus
.\migrate_fast.ps1
# 4. Initialize reference corpus
python scripts/1_batch_init.py# Option 1: Manual pipeline
python scripts/2_streaming_app.py # Stream processing
python scripts/4_plagiarism_analysis.py # Batch analysis
python scripts/6_elasticsearch_indexer.py # Indexing
# Option 2: Automated pipeline (recommended)
python scripts/8_full_streamprocess.py # All-in-one# Docker deployment
docker-compose up -d dashboard
# Access: http://localhost:5000
# Local deployment
python scripts/7_dashboard.py
# Access: http://localhost:5000- Batch Processing: Handles 500+ documents efficiently
- Streaming: Real-time processing with 5-second latency
- Elasticsearch: Sub-second query response times
- Dashboard: Auto-refresh maintains responsiveness
- Broadcast Joins: Reference corpus broadcast to all workers
- Parquet Storage: Columnar format for 10x compression
- SparseVector: Memory-efficient representation
- Bulk Indexing: 1,000 documents per Elasticsearch batch
- Checkpointing: Fault-tolerant streaming with exactly-once semantics
- Java 17: Required for Spark/Hadoop
- Python 3.11: With PySpark, NumPy, Pandas, Flask, Elasticsearch client
- Docker Desktop: For HDFS and Elasticsearch containers
- Windows 10/11 or Linux: Tested platforms
- CPU: 4+ cores for parallel processing
- RAM: 8GB minimum (16GB recommended)
- Storage: 50GB+ for corpus and results
- Network: Gigabit for Docker networking
- Hybrid Architecture: Windows host + Docker containers
- Windows Compatibility: Comprehensive workarounds for HDFS networking
- Fault Tolerance: Checkpoint-based recovery for streaming
- Real-Time Analytics: Streaming + batch analysis + instant indexing
- Full Automation: Single-command deployment with
8_full_streamprocess.py
- Visual Dashboard: Chart.js-powered interactive visualizations
- RESTful API: Programmatic access to all functionality
- Search & Filter: Advanced query capabilities via Elasticsearch
- Auto-Refresh: Real-time updates without manual intervention
┌─────────────────┐
│ Corpus Files │
│ (HDFS) │
└────────┬────────┘
│
▼
┌─────────────────────────┐
│ Step 1: Batch Init │
│ - TF-IDF Training │
│ - Reference Vectors │
└────────┬────────────────┘
│
▼
┌─────────────────────────┐
│ IDF Model + Ref Vectors│
│ (Stored in HDFS) │
└─────────────────────────┘
│
│ ┌──────────────────┐
│ │ Stream Input │
│ │ (New Documents) │
│ └────────┬─────────┘
│ │
▼ ▼
┌─────────────────────────────────┐
│ Step 2: Streaming Processing │
│ - Apply TF-IDF Model │
│ - Compute Similarity │
│ - Detect Plagiarism │
└────────┬────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Step 3: Batch Analysis │
│ - Aggregate Results │
│ - Generate Reports │
│ - Calculate Statistics │
└────────┬────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Step 4: Elasticsearch Indexing │
│ - Index Plagiarism Reports │
│ - Index Analysis Summaries │
└────────┬────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Step 5: Web Dashboard │
│ - Visualizations │
│ - Search & Filter │
│ - Real-time Statistics │
└─────────────────────────────────┘
Term Frequency (TF):
TF(t, d) = (Number of times term t appears in document d) / (Total terms in document d)
Inverse Document Frequency (IDF):
IDF(t, D) = log((Total documents in corpus D) / (Documents containing term t))
TF-IDF Weight:
TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)
Implementation:
- Uses HashingTF for efficiency (5,000 hash buckets)
- IDF model trained on reference corpus
- Resulting vectors are sparse (typically <1% non-zero values)
Similarity Threshold: 0.7 (configurable)
Detection Steps:
- For each new document D:
- Vectorize using trained IDF model
- For each reference document R:
- Compute cosine_similarity(D, R)
- If similarity > threshold: Flag as potential plagiarism
- Return top-K most similar documents
Complexity:
- Time: O(N × M × F) where N = streaming docs, M = reference docs, F = features
- Space: O(N × F + M × F) for sparse vectors
- Location: HDFS
/netplag/checkpoints/ - Frequency: Every micro-batch (5 seconds)
- Recovery: Automatic on restart, continues from last checkpoint
- Detection: Health checks on NameNode and DataNode
- Recovery: Docker restart policies (automatic restart)
- Fallback: Backup and restore scripts available
- Retry Logic: 3 attempts with exponential backoff
- Batch Processing: Continues with next batch on persistent failures
- Logging: Detailed error messages to console
- Elasticsearch: Security disabled (development mode)
- HDFS: Basic file permissions, no Kerberos
- Dashboard: No authentication (intended for local use)
- Enable Elasticsearch security (TLS, authentication)
- Configure HDFS with Kerberos authentication
- Add dashboard authentication (OAuth, LDAP)
- Implement network isolation (VPC, firewalls)
- Enable HDFS encryption at rest
- Add audit logging for all operations
-
Machine Learning:
- Deep learning models (BERT, transformers) for semantic similarity
- Paraphrase detection using neural networks
- Automatic threshold optimization
-
Scalability:
- Multi-node Spark cluster for larger corpora
- Distributed Elasticsearch cluster
- Horizontal scaling with Kubernetes
-
Features:
- Document upload interface in dashboard
- Email notifications for plagiarism detection
- Citation analysis and exclusion
- Multi-language support
-
Performance:
- GPU acceleration for similarity computation
- Incremental IDF model updates
- Caching frequently accessed reference vectors
NetPlag represents a complete big data solution for plagiarism detection, demonstrating:
- Distributed computing with Apache Spark
- Scalable storage with Hadoop HDFS
- Real-time processing with Structured Streaming
- Advanced search with Elasticsearch
- Modern web interfaces with Flask and Chart.js
The system successfully bridges Windows development environments with Linux-based big data tools through Docker containerization and comprehensive compatibility layers. The modular architecture allows both step-by-step execution for learning and automated pipelines for production deployment.
Production-Ready Features:
✓ Fault-tolerant streaming
✓ Persistent storage with HDFS
✓ Searchable results via Elasticsearch
✓ Visual analytics dashboard
✓ Comprehensive backup/restore
✓ Docker-based deployment
This system can scale from academic research projects to enterprise-grade plagiarism detection with minimal architectural changes.
config/hdfs_config.py: HDFS paths and Spark configurationconfig/elasticsearch_config.py: ES connection and index mappingshadoop-config/hdfs-site.xml: HDFS cluster configurationdocker-compose.yml: Multi-container orchestrationrequirements.txt: Python dependencies
scripts/0_migrate_to_hdfs.py: HDFS setupscripts/1_batch_init.py: Reference corpus initializationscripts/2_streaming_app.py: Real-time processingscripts/3_simulateur.py: Stream simulation toolscripts/4_plagiarism_analysis.py: Batch analysisscripts/5_standalone_plagiarism_check.py: Manual checkerscripts/6_elasticsearch_indexer.py: ES integrationscripts/7_dashboard.py: Web dashboardscripts/8_full_streamprocess.py: Complete pipeline
scripts/similarity.py: Similarity computation modulemigrate_fast.ps1: Fast HDFS migrationbackup_docker.ps1: Container backupbackup_hdfs_data.ps1: Data backuprestore_docker.ps1: System restorationrun_spark_docker.ps1: Docker Spark execution
templates/dashboard.html: Dashboard UIPROJECT_SUMMARY.md: Quick referenceSETUP_TRACKING.md: Setup progress trackerDOCKER_SPARK_README.md: Docker deployment guide