This repository presents a comprehensive time series analysis and forecasting project based on historical monthly sunspot observations. Sunspot activity is a well-studied natural phenomenon characterized by long-term cyclic behavior, making it a classic benchmark problem for time series modeling and forecasting.
The main objective of this project is to analyze the temporal structure of sunspot cycles and evaluate the performance of different forecasting approaches under a consistent and leakage-aware experimental setup. The workflow follows a standard data science pipeline, progressing from data understanding and exploratory analysis to feature construction and predictive modeling, with an emphasis on interpretability, temporal consistency, and fair model comparison.
-
- Model 1 – Linear Regression, Ridge & Lasso (Regularized Linear Baselines)
- Model 2 – Support Vector Regression (SVR, RBF Kernel)
- Model 3 – Random Forest Regressor
- Model 4 – LightGBM (Gradient Boosting Decision Trees)
- Model 5 – MLP Regressor (Baseline Neural Network)
- Model 5.1 – DNN (Advanced MLP, PyTorch)
- Model 6 – K-Nearest Neighbors (KNN, PyTorch – GPU Accelerated)
- Model 7 – Stacked LSTM (PyTorch)
Sunspots are temporary phenomena on the Sun’s photosphere that appear as dark spots compared to the surrounding areas. They are regions of reduced surface temperature caused by intense magnetic field activity that inhibits convection. Sunspots typically appear in pairs with opposite magnetic polarity, and their frequency varies according to an approximately 11-year solar cycle.
The dataset used in this project consists of monthly mean total sunspot numbers, representing the average number of observed sunspots per month over a long historical period. Due to its length and well-documented cyclical structure, this dataset is widely used in time series analysis, forecasting, and signal processing research.
- Frequency: Monthly
- Time span: January 1749 – August 2017
- Target variable: Monthly mean total sunspot number
The raw dataset is stored at:
data/raw/sunspots.csv
The sunspot dataset originates from the Solar Influences Data Analysis Center (SIDC), which is part of the Royal Observatory of Belgium and specializes in solar physics research. The data is publicly available and has been distributed through platforms such as Quandl.
- Primary source: Solar Influences Data Analysis Center (SIDC)
- Institution: Royal Observatory of Belgium
- Reference: https://en.wikipedia.org/wiki/Sunspot
The authors acknowledge SIDC and related data providers for making this long-term solar activity dataset openly accessible.
Before applying any transformations or modeling techniques, the dataset is inspected to ensure its suitability for time series analysis. This initial inspection focuses on verifying:
- Temporal continuity and correct chronological ordering
- Presence of missing, duplicated, or invalid values
- Overall scale, variability, and range of the target variable
Basic summary statistics and head–tail inspections confirm that the dataset is clean, numeric, and well-structured. No imputation or interpolation is required at this stage, allowing the analysis to proceed directly to exploratory data analysis without altering the original signal.
Exploratory Data Analysis is conducted to understand the temporal dynamics, statistical properties, and anomaly behavior of the sunspot time series. The analysis focuses on visual inspection, temporal dependencies, distributional characteristics, and the identification of unusual patterns that may influence forecasting performance.
All EDA figures referenced in this section are stored under the assets/ directory.
This visualization provides a global overview of monthly sunspot activity over time. The plot shows pronounced cyclic behavior with repeating peaks and troughs that align with known solar cycles. Variations in peak magnitudes across cycles suggest non-stationary variance, motivating the use of both trend-aware and nonlinear forecasting models.
The heat map visualizes sunspot intensity across months and years simultaneously. Vertical bands of high intensity reveal the temporal extent of solar maxima, while darker regions indicate prolonged solar minima. The lack of strong horizontal patterns confirms that intra-year seasonality is weak compared to long-term cyclic behavior.
The distribution of sunspot values is right-skewed, with a high density of low-activity observations and a long tail corresponding to strong solar cycles. Box plot analysis further highlights the presence of extreme values, which can disproportionately affect error metrics during evaluation.
The autocorrelation plot shows strong positive correlation across many lags, confirming long-term temporal dependence. This persistence indicates that past observations contain substantial predictive information even at distant time steps.
The lag-1 scatter plot demonstrates a nonlinear relationship between consecutive observations. While a general positive dependency exists, the dispersion increases at higher values, suggesting that linear assumptions may be insufficient for capturing the underlying dynamics.
Seasonal pattern analysis confirms that sunspot activity does not follow a strong monthly seasonal cycle. Instead, variability is dominated by multi-year oscillations, aligning with the physical nature of solar activity cycles.
The exponential moving average smooths short-term fluctuations while preserving long-term trends. This visualization highlights the gradual rise and fall of solar cycles and provides insight into the underlying signal structure without high-frequency noise.
Trend analysis reveals extended periods of increasing and decreasing sunspot activity. These slow-moving dynamics suggest that models capable of learning long-range dependencies may perform better than short-memory approaches.
ACF and PACF plots jointly indicate that the series exhibits both long-memory behavior and complex lag dependencies. Significant correlations at large lags motivate the use of multiple lag-based and window-based features in later stages.
As expected for a monthly aggregated astronomical dataset, no meaningful weekday-based pattern is observed. This result serves as a sanity check and confirms correct temporal handling of the data.
This section focuses on identifying unusual or extreme patterns in sunspot activity that deviate from the expected temporal structure. Rather than treating anomalies as noise to be removed, the analysis aims to understand whether these extreme observations correspond to meaningful physical phenomena, such as periods of intense solar activity.
Anomaly detection is approached from both statistical decomposition and forecasting-based residual analysis, allowing anomalies to be examined from complementary perspectives.
Classical decomposition separates the series into trend, seasonal, and residual components. The residual component retains structured variance, suggesting that meaningful information remains beyond simple additive trend and seasonality modeling.
In this approach, anomalies are detected using a forecasting-residual strategy: a GRU model produces one-step-ahead predictions, and the absolute forecast error is monitored over time.
To avoid using a fixed global threshold (which would be unfair under changing cycle amplitudes), a dynamic threshold is constructed using an EWMA-smoothed error baseline. An observation is flagged as anomalous when its one-step GRU error rises above this smooth, adaptive threshold, capturing unexpected local deviations while remaining robust to regular solar-cycle behavior.
- Anomalies detected: 43 / 3133
- Anomaly rate: 1.372%
A Transformer-based forecaster is used to detect anomalies by analyzing where the model fails to predict abrupt local changes. Compared to GRU, the Transformer tends to be more sensitive to subtle and rapid irregularities, especially around cycle peaks and transition regions, where short-term dynamics can shift quickly.
Anomalies are flagged when the forecast error becomes unusually high relative to the typical error behavior, indicating that the observed pattern is difficult to explain with the learned temporal structure.
- Anomalies detected: 84 / 3133
- Anomaly rate: 2.681%
In Colab, the raw monthly sunspot series is converted into a supervised learning table using leakage-safe transformations (only past values are used via shift()).
- Input:
data/raw/sunspots.csv - Output:
data/processed/sunspots_train_engineered.csv,data/processed/sunspots_test_engineered.csv
- Calendar:
Year - Cyclical month encoding:
[ Month_sin = \sin\left(\frac{2\pi \cdot Month}{12}\right), \quad Month_cos = \cos\left(\frac{2\pi \cdot Month}{12}\right) ] - Lag features:
Lag_1, Lag_2, Lag_3, Lag_6, Lag_12, Lag_132
(132 months ≈ one solar cycle) - Rolling stats (computed on shifted series):
Rolling_Mean_12,Rolling_Std_12,Rolling_Mean_132 - Trend signals:
Diff_1,Expanding_Mean
After dropping NaN rows introduced by lag/rolling operations, the dataset is split chronologically with an 80/20 ratio to create train and test sets.
The dataset is split chronologically into training and test sets using an 80/20 ratio, ensuring that future observations are never used during training.
The visualization highlights the clear temporal separation between the two partitions, mirroring real-world forecasting conditions and preventing information leakage across the split boundary.
All models are trained and evaluated under a shared, consistent experimental setup to ensure fair comparison.
- Input: Feature-engineered datasets created in the previous step
(sunspots_train_engineered.csv,sunspots_test_engineered.csv) - Target:
SunspotNumber - Split: Fixed chronological 80/20 train–test split
- Preprocessing: Applied only on training data (then reused on test data where required)
- Metrics: RMSE (primary), supported by visual prediction vs. ground-truth plots
- Reproducibility: Fixed random seeds where applicable
This setup guarantees that performance differences arise from model behavior, not from data handling or evaluation inconsistencies.
This section presents the modeling approaches evaluated in the project. A diverse set of models is considered, ranging from simple linear baselines to advanced nonlinear and deep learning architectures. All models are trained on the same feature-engineered dataset and evaluated under the experimental constraints described above.
These models provide interpretable linear baselines on the engineered feature set (lag features + rolling statistics). Since many predictors are highly correlated, regularization is required to stabilize coefficient estimates.
All linear models are trained on standardized features. Scaling is fitted only on the training set and then applied to the test set.
Xᵗʳᵃⁱⁿˢ = (Xᵗʳᵃⁱⁿ − μ) / σ
Xᵗᵉˢᵗˢ = (Xᵗᵉˢᵗ − μ) / σ
This step is mandatory because Ridge and Lasso penalties are scale-sensitive.
Ordinary Least Squares minimizes the residual sum of squares:
β̂ = arg min ‖ y − Xβ ‖²
OLS is simple and highly interpretable, but sensitive to multicollinearity among engineered features.
Ridge adds an ℓ₂ penalty to shrink coefficients smoothly:
β̂ʳⁱᵈᵍᵉ = arg min ( ‖ y − Xβ ‖² + α ‖ β ‖² )
• Reduces variance under correlated predictors
• Improves numerical stability
• Hyperparameter selected via 5-fold CV
Best Ridge parameter: α = 10.0
Lasso applies an ℓ₁ penalty, encouraging sparsity:
β̂ˡᵃˢˢᵒ = arg min ( ‖ y − Xβ ‖² + α ‖ β ‖₁ )
• Can set some coefficients exactly to zero
• Acts as implicit feature selection
Best Lasso parameter: α = 0.1
- Linear Regression: 22.7456
- Lasso: 22.7500
- Ridge: 22.7579
All three models produce nearly identical predictions, indicating that the engineered features already capture strong linear structure; regularization mainly affects coefficient stability rather than forecast shape.
Support Vector Regression (SVR) is used as a nonlinear baseline to capture smooth but complex relationships between engineered lag/rolling features and the target. We use the RBF kernel, which maps inputs into a higher-dimensional space and fits a maximum-margin regression function.
SVR is highly sensitive to feature scales, so the same StandardScaler fitted on the training split is reused:
Xᵗʳᵃⁱⁿˢ = scaler(Xᵗʳᵃⁱⁿ)
Xᵗᵉˢᵗˢ = scaler(Xᵗᵉˢᵗ)
SVR solves a regularized ε-insensitive regression problem:
minimize ½‖w‖² + C · Σ(ξᵢ + ξᵢ*)
subject to |yᵢ − f(xᵢ)| ≤ ε (with slack variables ξ)
Key hyperparameters:
- C: penalty for errors (higher C → tighter fit, higher variance risk)
- γ (gamma): RBF kernel width (smaller γ → smoother function)
- ε (epsilon): size of the “no-penalty” tube around the regression curve
A 3-fold GridSearchCV is performed over an expanded parameter grid:
- Kernel: rbf
- C: {50 … 3000}
- γ: {scale, 0.001 … 0.1}
- ε: {0.1 … 10}
Search cost:
- 1080 candidates × 3 folds = 3240 fits
- Training time: 1913.03s
Best parameters:
- C = 1500
- γ = 0.001
- ε = 3
- kernel = rbf
- SVR RMSE: 22.7760
- R²: 0.8934
The optimized SVR produces smooth predictions that follow the long-term cycle reasonably well, but it may still under-react to sharp local changes around extreme peaks due to the ε-insensitive loss and the smoothing effect of the RBF kernel.
Saved model:
V2-models/model_svr_optimized.pkl
Random Forest Regressor is evaluated as a strong nonlinear, tree-based ensemble model. By averaging predictions from many decision trees trained on bootstrapped samples, the model can capture complex feature interactions while remaining relatively robust to noise.
Unlike linear and kernel-based models, Random Forest does not require feature scaling, making it well-suited for the heterogeneous set of lag, rolling, and trend features used in this project.
To respect temporal ordering, hyperparameter tuning is performed using TimeSeriesSplit (3 folds) instead of random cross-validation. A narrow but expressive grid is explored:
- n_estimators: number of trees
- max_depth: tree depth (controls bias–variance trade-off)
- min_samples_leaf / min_samples_split: regularization via minimum node sizes
- max_features: fraction of features considered at each split
Search details:
- 432 parameter combinations
- 3 time-series folds → 1296 fits
- Training time: 4577.15s
Best configuration:
- n_estimators: 400
- max_depth: 25
- max_features: 0.5
- min_samples_leaf: 3
- min_samples_split: 2
- RMSE: 22.8724
- R²: 0.8925
The Random Forest model produces stable predictions that follow the general structure of the sunspot cycles. However, its piecewise-constant nature limits its ability to smoothly track rapid transitions around extreme peaks.
Feature importance scores confirm that short-term temporal dependence dominates prediction performance:
Top contributing features:
- Lag_1 (0.46)
- Lag_2 (0.20)
- Rolling_Mean_12 (0.15)
- Lag_3 (0.08)
- Lag_6 (0.02)
This ranking highlights the critical role of recent history and short-term smoothing, while longer-horizon features contribute more subtly.
The trained model is saved under:
V2-models/model_random_forest.pkl
LightGBM is evaluated as a gradient boosting–based tree ensemble, which builds trees sequentially to correct the errors of previous models. Unlike Random Forest, which averages independent trees, LightGBM explicitly focuses on hard-to-predict regions of the feature space.
As a tree-based method, LightGBM does not require feature scaling and naturally handles the heterogeneous lag, rolling, and trend features used in this project.
Hyperparameter tuning is performed using TimeSeriesSplit (3 folds) to preserve temporal ordering. A compact but expressive grid is explored:
- learning_rate: boosting step size
- num_leaves: controls tree complexity
- max_depth: depth constraint
- colsample_bytree: feature subsampling ratio
Grid search summary:
- 36 parameter combinations
- 3 time-series folds → 108 fits
- Total search time: 58.09s
During training, early stopping is applied based on validation RMSE, automatically selecting the optimal number of boosting rounds and preventing overfitting.
Best configuration:
- learning_rate: 0.05
- num_leaves: 15
- max_depth: 10
- colsample_bytree: 0.8
- Best iteration: 123
- RMSE: 23.3378
- R²: 0.8881
Although LightGBM successfully captures nonlinear feature interactions and long-range dependencies (notably the solar-cycle-scale lag), its performance remains slightly below Random Forest and SVR in this setup. This suggests that, for this dataset, boosting-based models may overemphasize local corrections without consistently improving long-horizon generalization.
Feature importance scores reveal a balanced reliance on both short-term memory and long-cycle structure:
Top contributing features:
- Lag_1
- Lag_132 (≈ one full solar cycle)
- Rolling_Mean_12
- Expanding_Mean
- Lag_3
This confirms that gradient-boosted trees leverage information from both recent observations and long-term solar-cycle dynamics.
The trained model is saved under:
V2-models/model_lightgbm.pkl
A baseline feed-forward neural network is evaluated using MLPRegressor to test whether a simple nonlinear neural model can improve upon tree-based and kernel-based approaches on the engineered feature set.
Because MLPs are sensitive to feature scales, the same StandardScaler fitted on the training split is reused:
Xᵗʳᵃⁱⁿˢ = scaler(Xᵗʳᵃⁱⁿ)
Xᵗᵉˢᵗˢ = scaler(Xᵗᵉˢᵗ)
Hyperparameters are tuned with TimeSeriesSplit (3 folds) to preserve temporal ordering:
- 24 parameter combinations
- 3 folds → 72 fits
- Training time: 369.98s
To prevent overfitting, the MLP uses built-in early stopping with:
- validation_fraction = 0.1
- n_iter_no_change = 20
- tol = 1e-4
Best configuration:
- hidden_layer_sizes: (100, 100)
- activation: relu
- solver: adam
- alpha: 0.01
- learning_rate_init: 0.005
The training loss decreases smoothly across epochs, indicating stable optimization under the selected architecture and learning rate. Early stopping prevents unnecessary iterations once improvement stalls.
- RMSE: 23.6061
- R²: 0.8855
While the MLP captures some nonlinear relationships, its performance is weaker than the best classical baselines in this setup. This suggests that a shallow fully-connected network may struggle to represent the long-horizon temporal structure of sunspot cycles when trained on tabular engineered features.
The trained model is saved under:
V2-models/model_mlp.pkl
To overcome the limitations of the baseline MLPRegressor, an advanced feed-forward neural network is implemented in PyTorch. The goal is to increase representational capacity while keeping training stable and preventing overfitting on engineered tabular features.
The network is a deep MLP with progressive dimensional reduction:
- Linear(d → 128) + BatchNorm + ReLU + Dropout(0.3)
- Linear(128 → 64) + BatchNorm + ReLU + Dropout(0.2)
- Linear(64 → 32) + ReLU
- Linear(32 → 1)
This design improves stability (BatchNorm), reduces overfitting (Dropout), and allows richer nonlinear transformations than the shallow baseline MLP.
- Scaler: same
StandardScalerfitted on train split (reused here) - Optimizer: AdamW (lr=0.01, weight_decay=1e-4)
- Loss: MSE
- LR Scheduler: ReduceLROnPlateau (factor=0.5, patience=10)
- Early stopping: patience = 40 (training stopped at epoch 82)
Note: The run was executed on CPU (
device=cpu), but the code supports CUDA when available.
- RMSE: 22.4704
- R²: 0.8962
Compared to the baseline scikit-learn MLP, the PyTorch DNN achieves a clear improvement, indicating that deeper architectures with proper regularization and scheduling can better model the nonlinear structure of sunspot cycles even in a tabular feature setting.
Saved model:
V2-models/model_pytorch_mlp.pth
This model implements a custom K-Nearest Neighbors regressor in PyTorch, designed to run efficiently on GPU. Instead of learning parameters, the model performs prediction through distance-based retrieval and weighted aggregation.
The inference pipeline follows these steps:
-
Input Tensor (X_test)
Test samples are transferred to GPU in batch form. -
Parallel Distance Computation
Pairwise distances between test samples and all training samples are computed using
Manhattan distance viatorch.cdist(p=1), producing a full distance matrix
( N_{test} \times N_{train} ). -
Top-K Neighbor Selection
The indices of the K nearest neighbors are selected usingtorch.topk, enabling fast nearest-neighbor retrieval on GPU. -
Inverse Distance Weighting
Neighbor contributions are weighted as
( w_i = \frac{1}{d_i + \varepsilon} ),
followed by normalization to ensure numerical stability. -
Weighted Aggregation
Final predictions are obtained via a weighted sum of neighbor target values.
This architecture allows brute-force KNN to scale efficiently by leveraging GPU parallelism.
The number of neighbors (K) is selected empirically using a two-stage search strategy:
- Coarse search: (K = 15 \rightarrow 59)
- Fine search: refinement around the best coarse value
Optimal configuration:
- Best K: 27
- RMSE: 24.39
- R²: 0.8778
The prediction plot shows that KNN can roughly follow the overall shape of sunspot cycles but tends to oversmooth peaks and troughs, limiting its ability to model long-horizon nonlinear dynamics.
Overall, despite GPU acceleration and weighted aggregation, the instance-based nature of KNN makes it less effective than tree-based and neural approaches for this dataset.
If we consider the tested model family in this project, the PyTorch Stacked LSTM achieved the best overall forecasting performance on the held-out test period.
A 2-layer stacked LSTM is used to model temporal dynamics, followed by a small regression head:
- Input shape: (batch, seq_len=1, features)
- LSTM: 2 layers, hidden_dim = 128, internal dropout = 0.2
- Head: Linear(128→64) → ReLU → Dropout(0.1) → Linear(64→1)
Note: In this implementation,
seq_len = 1because the LSTM consumes the engineered feature vector as a single-step sequence tensor.
- Device: CUDA (GPU)
- Loss: MSE
- Optimizer: AdamW (lr=0.001, weight_decay=1e-5)
- LR Scheduler: ReduceLROnPlateau (factor=0.5, patience=10)
- Early stopping: patience = 20 (stopped at epoch 39)
The combined figure reports both the learning dynamics (train/validation loss) and the final prediction vs. ground truth on the test period. Training converges rapidly in the first epochs, then stabilizes; early stopping prevents overfitting once validation loss stops improving.
- RMSE: 22.3363
- R²: 0.8975
The LSTM tracks both the long-term solar-cycle structure and short-term fluctuations more reliably than the tabular baselines, resulting in the strongest generalization among the evaluated models.
Saved model:
V2-models/model_pytorch_lstm.pth
To provide a clear and fair evaluation, all trained models are compared using RMSE on the same held-out test set. Lower RMSE indicates better forecasting accuracy.
| Model | RMSE |
|---|---|
| PyTorch LSTM | 22.3363 |
| PyTorch MLP (DNN) | 22.4704 |
| Linear Regression | 22.7456 |
| Lasso Regression | 22.7500 |
| Ridge Regression | 22.7579 |
| SVR (Optimized) | 22.7760 |
| Random Forest | 22.8724 |
| LightGBM | 23.3378 |
| MLP (sklearn) | 23.6061 |
| PyTorch KNN | 24.3876 |
- PyTorch Stacked LSTM achieves the best overall performance, confirming the advantage of sequence-aware neural architectures for modeling long-term solar cycles.
- Advanced PyTorch MLP performs competitively, outperforming most classical models despite operating on tabular features.
- Linear models (OLS, Ridge, Lasso) show surprisingly strong baselines due to effective feature engineering.
- Tree-based ensembles (Random Forest, LightGBM) capture nonlinearities but slightly underperform sequence models.
- KNN, even with GPU acceleration, struggles to generalize long-horizon temporal dynamics.
The results demonstrate that while strong feature engineering enables classical models to perform well, deep learning models with temporal inductive bias—especially LSTM—provide the most accurate and robust forecasts for sunspot time series data.
All experiments in this repository are conducted using a fixed train/test split and consistent preprocessing pipelines to ensure fair comparison across models. Trained models, feature-engineered datasets, and visualization assets are preserved to support reproducibility and transparent evaluation.
The project is structured to allow easy extension with alternative feature sets, forecasting horizons, or sequence-based architectures.
This repository is intended as a complete and self-contained time series forecasting study, suitable for academic review, portfolio demonstration, and further experimentation.


























