pyNNMF is a Python library for computing Non-Negative Matrix Factorization (NMF) with built-in support for missing value imputation. Unlike standard NMF libraries (e.g., scikit-learn), pyNNMF is resilient to missing data (NaN values and unobserved entries) and handles them natively using optimized NumPy routines.
- Native Python & NumPy: Highly optimized vectorized linear algebra operations without compiled or external C/C++ dependencies.
- Missing Value Resiliency: Handles missing values (
NaN) and observed/unobserved zeroes without failing or distorting optimization gradients. - Multiple Solvers: Supports Multiplicative Updates (MU), Alternating Least Squares (ALS), and Hierarchical Alternating Least Squares (HALS).
- Multiple Cost Functions: Minimizes Frobenius Norm (Euclidean), Kullback-Leibler (KL) Divergence, and Itakura-Saito (IS) Divergence.
To install pyNNMF locally or prepare it for development:
git clone https://github.com/mariolpantunes/pyNNMF.git
cd pyNNMF
pip install -e .To get the best speed and accuracy (matrix completion / imputation) out of pyNNMF, select the solver and cost function pair according to the data's noise distribution and missingness ratio:
| Noise Type | Recommended Cost Function | Recommended Solver | Rationale |
|---|---|---|---|
| Additive / Gaussian | Frobenius Norm (cost_fb) |
nmf_als (or nmf_mu) |
Frobenius norm represents the true negative log-likelihood for Gaussian noise. ALS converges fast. |
| Count / Poisson / Sparse | KL Divergence (cost_kl) |
nmf_mu(cost='kl') (or nmf_mu_kl) |
KL divergence corresponds to Poisson likelihood and enforces sparsity naturally. |
| Scale-Invariant / Audio | IS Divergence (cost_is) |
nmf_mu(cost='is') (or nmf_mu_is) |
IS divergence measures relative rather than absolute errors, protecting small-magnitude values. |
- Low-to-Moderate Missingness (< 30%): HALS solver (
nmf_hals) is recommended. It updates variables coordinate-wise and converges to the lowest training objective minima very fast. (Note: HALS only supports Frobenius norm). - High Missingness (> 30%) / Highly Noisy: MU and ALS solvers (
nmf_mu,nmf_als,rwnmf) are recommended. Their slower, diagonally-scaled update trajectories act as an implicit regularizer, preventing overfitting on the small number of observed entries.
Demonstration scripts are available in the examples directory:
Demonstrates how to initialize a low-rank matrix, mask entries as missing (NaN), and reconstruct/impute them:
PYTHONPATH=src python examples/imputation_example.pyBenchmark execution times and validate prediction accuracy (out-of-sample RMSE/MAE) across different noise distributions and missingness ratios using your exectimeit library:
PYTHONPATH=src python examples/validate_solvers.py --size 100 --noise gaussian --ratio 0.15Evaluate the impact of different initialization strategies (random, nndsvd, svd_impute) on the convergence speed and final reconstruction error:
PYTHONPATH=src python examples/init_comparison.pyThe test suite can be run using the standard Python unittest module:
PYTHONPATH=src python -m unittest discover -s testTo run formatting and static type checking:
ruff check src/ test/ examples/
npx pyright src/ test/ examples/Detailed package documentation is hosted on GitHub Pages
- Mário Antunes - mariolpantunes
This project is licensed under the MIT License - see the LICENSE file for details.