Skip to content

Essenceia/uselessly_fast_bfloat16_multiplier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Uselessly fast bfloat16 multiplier ASIC

This repository contains a very high frequency bfloat16 multiplier ASIC macro taped out as part of the Tiny Tapeout iph0p4 private experimental shuttle, targeting IHP's experimental 130 nm CMOS sg13cmos5l node.

This bfloat16 multiplier was designed as part of a maximum frequency challenge and can operate at up to 454.545 MHz on the nominal operating corner of 1.20 V at 25°C.

layout

Max frequency challenge

This design was built as a friendly (🔥 w 🔥) competition against NikLeberg, to see which of us could take the crown for the highest possible maximum frequency floating point multiplier on the nominal corner.

Each of us is using a different floating point type for our multiplier:

Designer Module Floating Point Type Denormals Infinity NaN Rounding Mode
NikLeberg tt_um_float_synth float8 Yes Yes No RTZ (Round to zero)
Essenceia tt_um_essen bfloat16 No No No RTZ (Round to zero)

Timing optimization strategy

Interestingly, each of us chose a very different strategy for optimizing our timing.

Synthesizer driven

Nik chose a tooling focused strategy with a strong emphasis on synthesis optimization, and more specifically backwards looking retiming. (retime -M 4 -b) The main idea of the retiming driven frequency optimization was to introduce extra empty cycle after the logic and let the synthesizer automatically rebalance the logic across these available cycles. The full explaination can be found in the tt_um_float's documentation.

synth result Synthesis json results rendered using LintyServices.linty-graphviz by NikLeberg, all credit belongs to him.

By pipelining the floatpoint multiplication over 8 cycles this design managed to reach a maximum operating frequency of 550 MHz, taking the crown for this challenge.

RTL driven

For the tt_um_essen project I chose to optimize timing through the more manual approach of RTL refinement: investing extra effort in optimize the critical paths, and by trading off wider logic for shallower paths. This was made much more approachable by the fact I had implemented the bfloat16 multiplication logic from scratch, as such I had good pre-existing intuitions about which logic would be on my critical paths once implemented.

Unlike the tt_um_float_synth, tt_um_essen only has an 8-bit long interface, and so needs 4 cycles to shift data in for a multiplication. It also needs 2 cycles to stream out the result given the output data bus width is also 8 bits. Although I will not be counting these cycles as being part of the floating point multiplication, for full transparency I would like to call to the readers attention that the fact these cycles have less logic depth does help the multiplication's cycles timing. Additionally, some part of the tt_um_float_synth's first and last path might be consumed by interfacing with the macro's IO pins.

The bfloat16 multiplication was cut into 2 cycles to improve performance. As expected, the main critical path went through the mantissa multiplication. Unfortunately, in the original implementation of the multiplication, I was using the synthesizer to infer an unsigned Booth radix-4 multiplier. Thus, in order to help pipeline this path, I needed to re-implement a custom 8-bit unsigned Booth radix-4 multiplier.

Inside this custom multiplication stage, a flop is added after the encoding stage, in the middle of the compression stage. We are storing the partial compression of the first two partial products, and the last 3 before, on the next cycle compressing them together to get the final result of this mantissa multiplication.

A few additional such implementations were performed throughout the multiplier allowing this design to reach a maximum operating frequency of 454.545 MHz.

Competition results

This competition was won hands down by nearly a full 100MHz margin by NikLeber 👑

Designer Module Floating Point Type Fmul cycles Fmax
NikLebery tt_um_float_synth float8 8 550 MHz
Essenceia tt_um_essen bfloat16 2 454 MHz

IO bottleneck

Both of us are well aware the the chip's IO is unlikely to reach a stable operating regime above 75MHz on the output path and 100MHz on the input path, we nevertheless decided to push our maximum operating frequency as far as we could.

Floating point math

As mentioned, this design includes a from scratch custom implementation of the bfloat16 artithemtic optimized for performance and area.

This implementation leverages the fact there is no official standard outlining the behavior of bfloat16 to implement only the subset of floating point behavior that I judge to be neccessary for our workload in favor of higher performance at a low area budget.

These choices are :

  • round toward zero rounding only
  • no subnormal support, all subnormals will be clamped to 0
  • no $\pm \infty$ or NaN support

For more information refer to the bfloat repository, the fast multiplier bf16_mul_fast is currently only on the fast_muul branch.

License

This project is licensed under the Apache License 2.0, see the LICENSE file for details.

Credits

Thanks to the Tiny Tapeout project, IHP, and all the community working on open source silicon tools for making this possible.

About

Pushing the bf16 multiplication clock frequency to the max on the nominal corner on IHP 130nm 5L node.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors