Releases · vitaut/zmij

What’s Changed

This release focuses on correctness and performance, providing a minimal, safe API to obtain the shortest correctly rounded decimal representation in either exponential format or as a decimal floating-point number.

Performance & Algorithm Improvements

Optimized division, modulo, and logarithm computations
Reduced conditional branching compared to Schubfach
Simplified decimal significand selection by using a single shorter candidate, based on an idea by Cassio Neri
Simplified and optimized modified rounding computation
Applied an optimization by Yaoyuan Guo, replacing 2–3 costly 128×64 multiplications with a single multiplication in the common case
Reworked digit generation to process eight digits at a time instead of using lookup tables, reducing branching and enabling better compiler optimization for more consistent performance (#4, thanks @TobiSchluter)
Switched to BCD encoding to evaluate eight significand digits in parallel (#7, thanks @TobiSchluter and @xjb714)
Made exponent handling branch-free (#10, #16, thanks @TobiSchluter)
Switched to built-in leading-zero counting with a safe fallback for older compilers (#21, #22, #25, thanks @AlexGuteniev)
Applied a collection of improvements from Dougall Johnson (#49):
- Further optimized division, modulo, and logarithm computations
- Optimized exponent output logic
- Generated the powers-of-10 table using constexpr and 192-bit arithmetic
- Applied faster indexed loads on ARM to improve table access performance
- Introduced an optional precomputed exp_shift table to speed up decimal scaling
Peeled off the rightmost digit to enable cheaper division and remove zero checks, reducing branching and streamlining digit extraction (#72, thanks @TobiSchluter)
Replaced abs with a ternary expression to recover ~10% performance on GCC (#66, thanks @TobiSchluter)
Reduced conditional branching on ARM to improve performance (#73, thanks @xjb714)
Optimized NEON zero-check logic to speed up digit processing (#74, thanks @xjb714)

SIMD & Architecture Support

Added an optimized write_significand implementation using NEON to accelerate digit extraction on supported platforms (thanks @dougallj)
Added SSE SIMD support on x86 to leverage 128-bit vector instructions for faster parallel digit processing and improved performance (#59, thanks @TobiSchluter)
Enabled NEON vectorization on ARM64 MSVC (#55, thanks @AlexGuteniev)
Disabled SIMD correctly when ZMIJ_USE_SIMD=0 (#75, thanks @TobiSchluter)

Portability & Toolchain Fixes

Fixed MSVC support on x64 and ARM64 by improving code generation, replacing unavailable intrinsics, and resolving related warnings (#8, thanks @mmozeiko)
Fixed multiple MSVC issues across 32-bit builds, table generation, warning cleanup, vector type handling, and forced inlining (#30, #31, #34, #42, #44, #48, #50, #65, #69, #71, #76, thanks @AlexGuteniev)
Fixed compilation regressions after recent changes (#38, #45, #56, #57, #77, thanks @AlexGuteniev and @TobiSchluter)
Fixed GCC compilation issues related to NEON intrinsics (#53)

API & Usability

Added to_decimal for converting binary floating-point values to decimal (#6)
Added float (binary32) support (#1, #15)
Returned the size of the resulting representation (#32, thanks @AlexGuteniev)
Reduced float_buffer_size from 17 to 16 (#51, #52, thanks @dtolnay)
Trimmed leading zeros in float formatting (#27, thanks @dtolnay)
Lowered minimum required standard to C++14 (#61, #62, thanks @AlexGuteniev)

Correctness & Safety

Added assertion to countl_zero for non-zero input (#29, thanks @AlexGuteniev)
Completed fallback implementation for bswap64 (#23, thanks @AlexGuteniev)
Avoided subtracting unrelated pointers for small buffers (#36, thanks @AlexGuteniev)
Prevented use of hundreds in float exponent path (#43, thanks @AlexGuteniev)
Fixed handling of subnormals (#11, #17, #19)
Fixed incorrect formatting of 32-bit infinity and NaN values (#70)

Verification & Tooling

Added test coverage for major configurations: default (C++), no SIMD, no builtins, and C
Added verification programs
Parallelized the verification program to run across multiple threads, dividing the test space by hardware concurrency to speed up exhaustive correctness checking (#26, thanks @dtolnay)
Improved verification tooling:
- Increased concurrency support (#33)
- Made failures return non-zero exit codes (#37)
- Added buffer overrun tests (#46, #54)
  (thanks @AlexGuteniev)
Fixed CSV writer usage (#60, thanks @TobiSchluter)