Skip to content
/ vstats Public

A dependency-free Linear Algebra, Statistics, and Machine Learning library written from scratch in V

License

Notifications You must be signed in to change notification settings

rodabt/vstats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VStats 0.1.0

A dependency-free Linear Algebra, Statistics, and Machine Learning library written from scratch in V.

Installation

v install https://github.com/rodabt/vstats

Quick Start

import vstats.stats
import vstats.utils
import vstats.linalg

// Statistics with generic types (int or f64)
mean_val := stats.mean([1, 2, 3, 4, 5])  // Works with int or f64
variance := stats.variance([1.0, 2.0, 3.0])

// Advanced statistics
f_stat, p_val := stats.anova_one_way([group1, group2, group3])
lower, upper := stats.confidence_interval_mean(data, 0.95)

// Metrics
cm := utils.build_confusion_matrix(y_true, y_pred)
println("F1: ${cm.f1_score():.4f}")

// Linear Algebra (also supports generics)
v1 := [1, 2, 3]
v2 := [4, 5, 6]
v_sum := linalg.add(v1, v2)  // Works with int or f64
result := linalg.matmul(matrix_a, matrix_b)
distance := linalg.distance(vector_a, vector_b)

Features Overview

Module Purpose Status
linalg Vector & Matrix operations ✓ Complete
stats Descriptive & Advanced Statistics ✓ Complete
prob Probability Distributions ✓ Complete
optim Numerical Optimization ✓ Complete
utils Utilities, Metrics, Datasets ✓ Complete
ml Machine Learning Algorithms ✓ Complete
nn Neural Networks ✓ Complete
hypothesis Hypothesis Testing ✓ Complete
symbol Symbolic Computation 🚧 WIP

Generic Type Support

Many functions across VStats support generic numeric types (int, f64):

  • Linear Algebra (linalg): All vector and matrix operations

    • Examples: add[T], subtract[T], dot[T], matmul[T], distance[T]
    • Vector operations return the same type: add[int] returns []int
    • Matrix operations work seamlessly with both types
  • Statistics & Optimization: Accept generic input, return f64 for precision

    • Examples: mean[T], variance[T], correlation[T], mse_loss[T]
    • Design: Generic input []T → F64 output (maintains mathematical precision)
  • Type-Specific Functions: Require []f64 due to algorithmic constraints

    • median, quantile, mode - require sorting/hashing
    • Functions depending on [][]f64 matrices for feature operations

Modules implemented

The following is the list of modules and functions implemented

Linear Algebra (linalg)

Vectors

  • add[T](v []T, w []T) []T: Adds two vectors a and b: (a + b)
  • subtract[T](v []T, w []T) []T: Subtracts two vectors a and b: (a - b)
  • vector_sum[T](vector_list [][]T) []T: Sums a list of vectors, example: vector_sum([[1,2],[3,4]]) => [4, 6]
  • scalar_multiply[T](c f64, v []T) []T: Multiplies a scalar value c to each element of a vector v
  • vector_mean[T](vector_list [][]T) []T: Calculates 1/n sum_j (v[j])
  • dot[T](v []T, w []T) T: Dot product of v and w
  • sum_of_squares[T](v []T) T: Squares each term of a vector, example: [1,2,3]^2 = [1^2, 2^2, 3^2]
  • magnitude[T](v []T) T: Module of a vector, example: || [3,4] || = 5
  • squared_distance[T](v []T, w []T) T: Calculates sqrt[(v1-w1)^2 + (v2-w2)^2...]
  • distance[T](v []T, w []T) T: Calculates the distance between v and w

Matrices

  • shape[T](a [][]T) (int, int): Returns the shape of a matrix (rows, columns)
  • get_row[T](a [][]T, i int) []T: Gets the i-th row of a matrix as a vector
  • get_column[T](a [][]T, j int) []T: Gets the j-th column of a matrix as a vector
  • flatten[T](m [][]T) []T: Flattens a matrix to a 1D array
  • make_matrix[T](num_rows int, num_cols int, op fn (int, int) T) [][]T: Makes a matrix using a formula given by function op(i,j)
  • identity_matrix[T](n int) [][]T: Returns an n-identity matrix
  • matmul[T](a [][]T, b [][]T) [][]T: Multiplies matrix a with b

Probabilites (prob)

Distributions (CDF and PDF)

  • beta_function(x f64, y f64) f64
  • normal_cdf(x f64, mu f64, sigma f64) f64
  • inverse_normal_cdf(p f64, mu f64, sigma f64, dp DistribParams) f64
  • bernoulli_pdf(x f64, p f64) f64
  • bernoulli_cdf(x f64, p f64) f64
  • binomial_pdf(k int, n int, p f64) f64
  • poisson_pdf(k int, lambda f64) f64
  • poisson_cdf(k int, lambda f64) f64
  • exponential_pdf(x f64, lambda f64) f64
  • exponential_cdf(x f64, lambda f64) f64
  • gamma_pdf(x f64, k f64, theta f64) f64
  • chi_squared_pdf(x f64, df int) f64
  • students_t_pdf(x f64, df int) f64
  • f_distribution_pdf(x f64, d1 int, d2 int) f64
  • beta_pdf(x f64, alpha f64, beta f64) f64
  • uniform_pdf(x f64, a f64, b f64) f64
  • uniform_cdf(x f64, a f64, b f64) f64
  • negative_binomial_pdf(k int, r int, p f64) f64
  • negative_binomial_cdf(k int, r int, p f64) f64
  • multinomial_pdf(x []int, p []f64) f64
  • expectation[T](x []T, p []T) T - Generic expectation calculation

Statistics (stats)

Descriptive Statistics

  • sum[T](x []T) f64 - Accepts generic numeric input, returns f64
  • mean[T](x []T) f64 - Accepts generic numeric input, returns f64
  • median(x []f64) f64 - Requires f64 (requires sorting)
  • quantile(x []f64, p f64) f64 - Requires f64 (requires sorting)
  • mode(x []f64) []f64 - Requires f64 (requires hashing)
  • range[T](x []T) T - Accepts generic numeric input, returns same type
  • dev_mean[T](x []T) []f64 - Accepts generic numeric input, returns f64
  • variance[T](x []T) f64 - Accepts generic numeric input, returns f64
  • standard_deviation[T](x []T) f64 - Accepts generic numeric input, returns f64
  • interquartile_range(x []f64) f64 - Requires f64
  • covariance[T](x []T, y []T) f64 - Accepts generic numeric input, returns f64
  • correlation[T](x []T, y []T) f64 - Accepts generic numeric input, returns f64

Advanced Statistical Tests

  • anova_one_way(groups [][]f64) (f64, f64) - One-way ANOVA test
  • confidence_interval_mean(x []f64, confidence_level f64) (f64, f64) - CI for mean
  • cohens_d(group1 []f64, group2 []f64) f64 - Effect size between two groups
  • cramers_v(contingency [][]int) f64 - Effect size for categorical association
  • skewness(x []f64) f64 - Distribution asymmetry
  • kurtosis(x []f64) f64 - Distribution tailedness (excess kurtosis)

Optimization (optim)

  • difference_quotient(f fn (f64) f64, x f64, h f64) f64
  • partial_difference_quotient(f fn([]f64) f64, v []f64, i int, h f64) f64
  • gradient(f fn([]f64) f64, v []f64, h f64) []f64
  • gradient_step[T](v []T, gradient_vector []T, step_size T) []T - Generic gradient descent step
  • sum_of_squares_gradient[T](v []T) []T - Generic sum of squares gradient

Statistics (stats) - Advanced Functions

Hypothesis Testing

  • anova_one_way(groups [][]f64) (f64, f64) - One-way ANOVA test comparing group means
    • Returns: F-statistic and p-value
    • Tests if 3+ groups have significantly different means

Confidence Intervals

  • confidence_interval_mean(x []f64, confidence_level f64) (f64, f64) - CI for population mean
    • Supports: 90%, 95%, 99% confidence levels
    • Returns: (lower_bound, upper_bound)

Effect Sizes

  • cohens_d(group1 []f64, group2 []f64) f64 - Cohen's d effect size

    • Standardized difference between two group means
    • Interpretation: |d| > 0.8 = large effect
  • cramers_v(contingency [][]int) f64 - Cramér's V effect size

    • Measures association between categorical variables
    • Range: 0 (no association) to 1 (perfect association)

Distribution Shape

  • skewness(x []f64) f64 - Measure of distribution asymmetry

    • Positive = right-skewed, Negative = left-skewed
  • kurtosis(x []f64) f64 - Measure of tail heaviness

    • Returns excess kurtosis (normal distribution = 0)

Utilities (utils)

Classification Metrics

  • build_confusion_matrix(y_true []int, y_pred []int) ConfusionMatrix - Build confusion matrix from predictions
  • (ConfusionMatrix).accuracy() f64 - (TP+TN)/Total
  • (ConfusionMatrix).precision() f64 - TP/(TP+FP)
  • (ConfusionMatrix).recall() f64 - TP/(TP+FN) [Sensitivity]
  • (ConfusionMatrix).specificity() f64 - TN/(TN+FP)
  • (ConfusionMatrix).f1_score() f64 - Harmonic mean of precision & recall
  • (ConfusionMatrix).false_positive_rate() f64 - FP/(FP+TN)
  • (ConfusionMatrix).summary() string - Formatted metrics summary

ROC & AUC

  • roc_curve(y_true []int, y_proba []f64) ROC_Curve - Generate ROC curve with AUC
    • Calculates TPR and FPR at different thresholds
    • Returns ROC_Curve with auc value (0-1 scale)
  • (ROC_Curve).auc_value() f64 - Extract Area Under Curve

Quick Metrics Calculation

  • binary_classification_metrics(y_true []int, y_pred []int) map[string]f64

    • Returns all 6 metrics in one call: accuracy, precision, recall, specificity, f1_score, fpr
  • regression_metrics(y_true []f64, y_pred []f64) map[string]f64

    • Returns: mse, rmse, mae, r2

Hyperparameter Tuning

  • generate_param_grid(param_ranges map[string][]f64) []map[string]f64
    • Generates all parameter combinations for grid search
    • Supports 1-3 parameters

Training Utilities

  • (TrainingProgress).format_log() string - Pretty-print training progress with metrics

  • early_stopping(losses []f64, patience int) bool - Early stopping based on loss plateau

    • Prevents overfitting by checking if loss hasn't improved for N epochs
  • decay_learning_rate(initial_lr f64, epoch int, decay_rate f64) f64

    • Exponential learning rate decay: lr = initial_lr * (decay_rate)^epoch

Feature Normalization

  • normalize_features(x [][]f64) ([][]f64, []f64, []f64) - Standardize features using z-score normalization

    • Returns: (normalized_data, feature_means, feature_stds)
    • Computes mean and std for each feature, then applies (x - mean) / std
    • Essential for ML algorithms sensitive to feature scaling
  • apply_normalization(x [][]f64, means []f64, stds []f64) [][]f64 - Apply pre-computed normalization

    • Uses statistics from training data on new data
    • Prevents data leakage in train/test split scenarios
    • Handles zero standard deviation gracefully

Basic Utilities

  • factorial(n int) f64 - Compute factorial
  • combinations(n int, k int) f64 - Binomial coefficient C(n,k)
  • range(n int) []int - Generate range [0, 1, ..., n-1]

Dataset Functions

  • load_iris() !Dataset - Iris dataset (150 samples, 4 features, 3 classes)
  • load_wine() !Dataset - Wine dataset (178 samples, 13 features→4, 3 classes)
  • load_breast_cancer() !Dataset - Breast cancer dataset (binary classification)
  • load_boston_housing() !RegressionDataset - Boston housing (506 samples, regression)
  • load_linear_regression() RegressionDataset - Synthetic linear data (20 samples)
  • (Dataset).summary() string - Dataset information and class distribution
  • (Dataset).train_test_split(test_size f64) (Dataset, Dataset) - Split dataset
  • (Dataset).xy() ([][]f64, []int) - Get features and targets separately
  • Similar methods for RegressionDataset with continuous targets

Neural Networks (nn)

Loss Functions

Generic Loss Functions (Accept int or f64)

  • mse_loss[T](y_true []T, y_pred []T) f64 - Mean Squared Error
  • mse_loss_gradient[T](y_true []T, y_pred []T) []f64 - MSE gradient
  • mae_loss[T](y_true []T, y_pred []T) f64 - Mean Absolute Error
  • mae_loss_gradient[T](y_true []T, y_pred []T) []f64 - MAE gradient
  • huber_loss[T](y_true []T, y_pred []T, delta f64) f64 - Robust loss function
  • hinge_loss[T](y_true []T, y_pred []T) f64 - SVM-style loss
  • squared_hinge_loss[T](y_true []T, y_pred []T) f64 - Squared hinge loss
  • cosine_similarity_loss[T](y_true []T, y_pred []T) f64 - Cosine similarity-based loss
  • triplet_loss[T](anchor []T, positive []T, negative []T, margin f64) f64 - Metric learning loss

Fixed-Type Loss Functions (Require f64)

  • binary_crossentropy_loss(y_true []f64, y_pred []f64) f64 - Binary classification loss
  • binary_crossentropy_loss_gradient(y_true []f64, y_pred []f64) []f64 - BCE gradient
  • categorical_crossentropy_loss(y_true [][]f64, y_pred [][]f64) f64 - Multi-class loss
  • sparse_categorical_crossentropy_loss(y_true []int, y_pred [][]f64) f64 - Sparse multi-class loss
  • kl_divergence_loss(y_true []f64, y_pred []f64) f64 - KL divergence
  • contrastive_loss(y_true f64, distance f64, margin f64) f64 - Siamese network loss

Machine Learning (ml)

Regression (Generic Support)

All regression functions support generic numeric types with automatic conversion to f64 for precision.

Model Training & Prediction:

  • linear_regression[T](x [][]T, y []T) LinearModel[T] - Ordinary Least Squares regression
  • linear_predict[T](model LinearModel[T], x [][]T) []T - Predictions using linear model
  • logistic_regression[T](x [][]T, y []T, iterations int, learning_rate T) LogisticModel[T] - Binary classification with gradient descent
  • logistic_predict_proba[T](model LogisticModel[T], x [][]T) []T - Probability predictions
  • logistic_predict[T](model LogisticModel[T], x [][]T, threshold T) []T - Class predictions

Error Metrics (Generic input, f64 output for precision):

  • mse[T](y_true []T, y_pred []T) f64 - Mean Squared Error
  • rmse[T](y_true []T, y_pred []T) f64 - Root Mean Squared Error
  • mae[T](y_true []T, y_pred []T) f64 - Mean Absolute Error
  • r_squared[T](y_true []T, y_pred []T) f64 - R² coefficient of determination

Classification

  • logistic_classifier(x [][]f64, y []f64, iterations int, learning_rate f64) LogisticClassifier[f64]
  • logistic_classifier_predict(model LogisticClassifier[f64], x [][]f64, threshold f64) []int
  • logistic_classifier_predict_proba(model LogisticClassifier[f64], x [][]f64) []f64
  • naive_bayes_classifier(x [][]f64, y []int) NaiveBayesClassifier - Probabilistic classifier
  • naive_bayes_predict(model NaiveBayesClassifier, x [][]f64) []int
  • svm_classifier(x [][]f64, y []f64, learning_rate f64, iterations int, gamma f64, kernel string) SVMClassifier
  • svm_predict(model SVMClassifier, x [][]f64) []int
  • random_forest_classifier(x [][]f64, y []int, num_trees int, max_depth int) RandomForestClassifier
  • random_forest_predict(model RandomForestClassifier, x [][]f64) []int
  • accuracy(y_true []int, y_pred []int) f64 - Classification accuracy

Clustering

  • kmeans(data [][]f64, k int, max_iterations int) KMeansModel - K-means clustering
  • kmeans_predict(model KMeansModel, data [][]f64) []int - Predict cluster assignments
  • kmeans_inertia(model KMeansModel, data [][]f64) f64 - Sum of squared distances
  • silhouette_coefficient(data [][]f64, labels []int) f64 - Cluster quality metric
  • hierarchical_clustering(data [][]f64, num_clusters int) HierarchicalClustering - Agglomerative clustering
  • dbscan(data [][]f64, eps f64, min_points int) []int - Density-based clustering

Hypothesis Testing (hypothesis)

Statistical Tests

  • t_test_one_sample(x []f64, mu f64, tp TestParams) (f64, f64) - One-sample t-test
  • t_test_two_sample(x []f64, y []f64, tp TestParams) (f64, f64) - Two-sample t-test
  • chi_squared_test(observed []f64, expected []f64) (f64, f64) - Chi-squared goodness of fit
  • correlation_test(x []f64, y []f64, tp TestParams) (f64, f64) - Pearson correlation significance
  • wilcoxon_signed_rank_test(x []f64, y []f64) (f64, f64) - Non-parametric paired test
  • mann_whitney_u_test(x []f64, y []f64) (f64, f64) - Non-parametric independent samples test

Neural Network Layers (nn)

Layer Operations

  • dense_layer(input_size int, output_size int) DenseLayer - Fully connected layer
  • (layer DenseLayer) forward(input []f64) []f64 - Forward pass
  • (mut layer DenseLayer) backward(grad_output []f64, input []f64, learning_rate f64) []f64 - Backward pass

Activation Functions

  • relu(x f64) f64 - ReLU activation
  • relu_derivative(x f64) f64 - ReLU derivative
  • sigmoid(x f64) f64 - Sigmoid activation
  • sigmoid_derivative(x f64) f64 - Sigmoid derivative
  • tanh(x f64) f64 - Hyperbolic tangent
  • tanh_derivative(x f64) f64 - Tanh derivative
  • softmax(x []f64) []f64 - Softmax activation

Sequential Network

  • sequential(layer_sizes []int, activation_fn string) NeuralNetwork - Create sequential model
  • (net NeuralNetwork) forward(input []f64) []f64 - Forward propagation
  • (mut net NeuralNetwork) backward(grad_output []f64, input []f64, learning_rate f64) []f64 - Backpropagation
  • (mut net NeuralNetwork) train(x_train [][]f64, y_train []f64, config TrainingConfig) - Train network
  • (net NeuralNetwork) predict(x [][]f64) [][]f64 - Batch predictions
  • (net NeuralNetwork) predict_single(x []f64) []f64 - Single prediction
  • (net NeuralNetwork) evaluate(x_test [][]f64, y_test []f64) f64 - Evaluate accuracy

Usage Examples

Generic Types Support

import vstats.stats
import vstats.nn

// Statistical functions accept both int and f64
int_mean := stats.mean([1, 2, 3, 4, 5])  // Returns f64
f64_mean := stats.mean([1.0, 2.0, 3.0])  // Also returns f64

// Neural network loss functions are also generic
y_true := [1, 2, 3]
y_pred := [1, 2, 2]
loss := nn.mse_loss(y_true, y_pred)  // Works with int arrays

// See examples/generic_types_example.v for comprehensive demo

ANOVA Test

import vstats.stats

// Compare three treatment groups
control := [1.0, 2.0, 3.0, 2.5]
treatment_a := [4.0, 5.0, 4.5, 5.5]
treatment_b := [7.0, 8.0, 7.5, 8.5]

f_stat, p_val := stats.anova_one_way([control, treatment_a, treatment_b])
if p_val < 0.05 {
    println("Groups have significantly different means")
}

Model Evaluation

import vstats.utils

y_true := [1, 1, 0, 0, 1, 0]
y_pred := [1, 0, 0, 1, 1, 0]

// Method 1: Build confusion matrix manually
cm := utils.build_confusion_matrix(y_true, y_pred)
println("Accuracy: ${cm.accuracy():.4f}")
println("F1 Score: ${cm.f1_score():.4f}")
println(cm.summary())

// Method 2: Get all metrics at once
metrics := utils.binary_classification_metrics(y_true, y_pred)
for name, value in metrics {
    println("${name}: ${value:.4f}")
}

ROC-AUC Score

import vstats.utils

y_true := [1, 1, 0, 1, 0, 0]
y_proba := [0.9, 0.8, 0.3, 0.7, 0.2, 0.1]

roc := utils.roc_curve(y_true, y_proba)
println("AUC: ${roc.auc:.4f}")  // Closer to 1.0 is better

Training with Early Stopping and LR Decay

import vstats.utils

mut losses := []f64{}
for epoch in 0..100 {
    // Train model, compute loss
    loss := compute_loss()
    losses << loss
    
    // Check if should stop
    if utils.early_stopping(losses, patience: 10) {
        println("Early stopping at epoch ${epoch}")
        break
    }
    
    // Decay learning rate
    lr := utils.decay_learning_rate(0.1, epoch, 0.95)
    optimizer.set_learning_rate(lr)
}

Grid Search for Hyperparameters

import vstats.utils

param_ranges := {
    'learning_rate': [0.001, 0.01, 0.1]
    'batch_size': [16.0, 32.0, 64.0]
}

grid := utils.generate_param_grid(param_ranges)
for combo in grid {
    lr := combo['learning_rate']
    batch := combo['batch_size']
    // Train model with these parameters
}

Feature Normalization for ML

import vstats.utils

// Load dataset
iris := utils.load_iris()!
train, test := iris.train_test_split(0.2)
x_train, y_train := train.xy()
x_test, y_test := test.xy()

// Normalize using training set statistics
x_train_norm, means, stds := utils.normalize_features(x_train)

// Apply same normalization to test set (prevents data leakage)
x_test_norm := utils.apply_normalization(x_test, means, stds)

// Now train model on normalized data
model := ml.logistic_regression(x_train_norm, y_train_float, 1000, 0.01)

// Predict and evaluate on normalized test data
predictions := ml.logistic_predict(model, x_test_norm, 0.5)
metrics := utils.binary_classification_metrics(y_test, predictions)

Dataset Loading

import vstats.utils

// Load iris dataset
iris := utils.load_iris()!
println(iris.summary())

// Split into train/test
train, test := iris.train_test_split(0.2)
x_train, y_train := train.xy()

// Evaluate
predictions := model.predict(x_test)
metrics := utils.regression_metrics(y_test, predictions)

Roadmap

  • Add more optimization algorithms
  • Dimensionality reduction (PCA, t-SNE)
  • Time series forecasting (ARIMA, exponential smoothing)
  • Convolutional and Recurrent neural network layers
  • Learning rate scheduling variants (warmup, cosine annealing)
  • Model checkpointing and serialization
  • GPU acceleration support
  • Complete symbolic computation module

Disclaimer

  • This was written as an exercise to get V closer to Data Analytics and Machine Learning tasks
  • Heavily inspired by the book from Joel Grus "Data Science from Scratch: First principles with Python"
  • It is not optimized for performance (current focus is correctness and API design)
  • Documentation is an ongoing effort

Contributing

Contributions are welcome!

References

Pull requests are welcome!

About

A dependency-free Linear Algebra, Statistics, and Machine Learning library written from scratch in V

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages