Releases: vitaut/zmij
Releases · vitaut/zmij
1.0 "exponentially fast"
What’s Changed
This release focuses on correctness and performance, providing a minimal, safe API to obtain the shortest correctly rounded decimal representation in either exponential format or as a decimal floating-point number.
Performance & Algorithm Improvements
- Optimized division, modulo, and logarithm computations
- Reduced conditional branching compared to Schubfach
- Simplified decimal significand selection by using a single shorter candidate, based on an idea by Cassio Neri
- Simplified and optimized modified rounding computation
- Applied an optimization by Yaoyuan Guo, replacing 2–3 costly 128×64 multiplications with a single multiplication in the common case
- Reworked digit generation to process eight digits at a time instead of using lookup tables, reducing branching and enabling better compiler optimization for more consistent performance (#4, thanks @TobiSchluter)
- Switched to BCD encoding to evaluate eight significand digits in parallel (#7, thanks @TobiSchluter and @xjb714)
- Made exponent handling branch-free (#10, #16, thanks @TobiSchluter)
- Switched to built-in leading-zero counting with a safe fallback for older compilers (#21, #22, #25, thanks @AlexGuteniev)
- Applied a collection of improvements from Dougall Johnson (#49):
- Further optimized division, modulo, and logarithm computations
- Optimized exponent output logic
- Generated the powers-of-10 table using
constexprand 192-bit arithmetic - Applied faster indexed loads on ARM to improve table access performance
- Introduced an optional precomputed
exp_shifttable to speed up decimal scaling
- Peeled off the rightmost digit to enable cheaper division and remove zero checks, reducing branching and streamlining digit extraction (#72, thanks @TobiSchluter)
- Replaced
abswith a ternary expression to recover ~10% performance on GCC (#66, thanks @TobiSchluter) - Reduced conditional branching on ARM to improve performance (#73, thanks @xjb714)
- Optimized NEON zero-check logic to speed up digit processing (#74, thanks @xjb714)
SIMD & Architecture Support
- Added an optimized
write_significandimplementation using NEON to accelerate digit extraction on supported platforms (thanks @dougallj) - Added SSE SIMD support on x86 to leverage 128-bit vector instructions for faster parallel digit processing and improved performance (#59, thanks @TobiSchluter)
- Enabled NEON vectorization on ARM64 MSVC (#55, thanks @AlexGuteniev)
- Disabled SIMD correctly when
ZMIJ_USE_SIMD=0(#75, thanks @TobiSchluter)
Portability & Toolchain Fixes
- Fixed MSVC support on x64 and ARM64 by improving code generation, replacing unavailable intrinsics, and resolving related warnings (#8, thanks @mmozeiko)
- Fixed multiple MSVC issues across 32-bit builds, table generation, warning cleanup, vector type handling, and forced inlining (#30, #31, #34, #42, #44, #48, #50, #65, #69, #71, #76, thanks @AlexGuteniev)
- Fixed compilation regressions after recent changes (#38, #45, #56, #57, #77, thanks @AlexGuteniev and @TobiSchluter)
- Fixed GCC compilation issues related to NEON intrinsics (#53)
API & Usability
- Added
to_decimalfor converting binary floating-point values to decimal (#6) - Added
float(binary32) support (#1, #15) - Returned the size of the resulting representation (#32, thanks @AlexGuteniev)
- Reduced
float_buffer_sizefrom 17 to 16 (#51, #52, thanks @dtolnay) - Trimmed leading zeros in float formatting (#27, thanks @dtolnay)
- Lowered minimum required standard to C++14 (#61, #62, thanks @AlexGuteniev)
Correctness & Safety
- Added assertion to
countl_zerofor non-zero input (#29, thanks @AlexGuteniev) - Completed fallback implementation for
bswap64(#23, thanks @AlexGuteniev) - Avoided subtracting unrelated pointers for small buffers (#36, thanks @AlexGuteniev)
- Prevented use of hundreds in float exponent path (#43, thanks @AlexGuteniev)
- Fixed handling of subnormals (#11, #17, #19)
- Fixed incorrect formatting of 32-bit
infinityandNaNvalues (#70)
Verification & Tooling
- Added test coverage for major configurations: default (C++), no SIMD, no builtins, and C
- Added verification programs
- Parallelized the verification program to run across multiple threads, dividing the test space by hardware concurrency to speed up exhaustive correctness checking (#26, thanks @dtolnay)
- Improved verification tooling:
- Increased concurrency support (#33)
- Made failures return non-zero exit codes (#37)
- Added buffer overrun tests (#46, #54)
(thanks @AlexGuteniev)
- Fixed CSV writer usage (#60, thanks @TobiSchluter)
New Contributors
- @TobiSchluter – first contribution in #4
- @mmozeiko – first contribution in #8
- @dtolnay – first contribution in #13
Full Changelog:
https://github.com/vitaut/zmij/commits/v1.0