FFmpeg HEVC IDCT port #130

frankplow · 2023-08-27T14:19:35Z

This PR ports FFmpeg's HEVC IDCT optimisations.

To-do:

Checkasm benchmark results:

inv_dct2_dct2_4x4_8_c: 226.2
inv_dct2_dct2_4x4_8_avx2: 26.2
inv_dct2_dct2_4x4_10_c: 188.2
inv_dct2_dct2_4x4_10_avx2: 24.7
inv_dct2_dct2_8x8_8_c: 704.7
inv_dct2_dct2_8x8_8_avx2: 124.7
inv_dct2_dct2_8x8_10_c: 751.2
inv_dct2_dct2_8x8_10_avx2: 124.7
inv_dct2_dct2_16x16_8_c: 4289.7
inv_dct2_dct2_16x16_8_avx2: 621.2
inv_dct2_dct2_16x16_10_c: 4335.2
inv_dct2_dct2_16x16_10_avx2: 625.2

perf.py results:

Bitstream	Before	After	Delta
RitualDance_1920x1080_60_10_420_32_LD	99.7	99.3	-0.4%
RitualDance_1920x1080_60_10_420_37_RA	88.3	87.7	-0.5%
Tango2_3840x2160_60_10_420_27_LD	23.0	23.0	0.0%

The current perf.py performance is poor as the DCT's effect on overall decoding performance is dominated by the larger sizes which have not yet been implemented. The decrease in performance is explained by the additional overhead of optimising at the 2D level, the benefits of which are not being reaped here. As the larger sizes are implemented, performance will increase dramatically, in line with the checkasm benchmark result.

lavc/x86/vvc_itx_1d: Initial AVX2 optimisations for VVC IDCT2 lavc/x86/vvc_itx_1d.asm: Add size 64 AVX2 IDCT2 optimisation lavc/x86/vvc_itx_1d: Don't use AVX2 optimisations when extended_precision_flag set lavc/x86/vvc_itx_1d: Fix assembly with YASM YASM does not supported unnamed contexts, so give all contexts names lavc/x86/vvc_itx_1d: Fix YASM build Replace `x%[y]` with `x %+ y` lavc/vvc: Rename vvc_itx_1d to vvc_itx lavc/vvc: Make ITX modular at 2D level Make the ITX DSP context inverse transform functions perform 2D transforms rather than 1D transforms. This will allow for more efficient assembly optimisations in future. lavc/vvc: Fix ITX non-zero width usage

tests/checkasm/vvc_itx: Fix dst stride

Previously the pixels were stored as `int`s. The maximum bit depth supported by VVC is 16, hence an `int16_t` is guaranteed to be sufficient. Smaller sizes allow for more pixels to be packed into an SIMD register and therefore faster assembly optimisations. Further optimisation could be achieved by using `int8_t`s if the bit depth of the current video is only 8. tests/checkasm/vvc_itx: Use `int16_t` dst arrays tests/checkasm/vvc_itx: Fix dst stride

nuomi2021 · 2023-08-29T15:08:49Z

libavcodec/vvc/vvc_ctu.h

    int bd_offset;

    int *coeffs;
+    int16_t *pixels;


why do we need pixels? can we just change coeffs to int16_t?

If sps_extended_precision_flag = 1, then the transform coefficients are bit_depth + 6 bits each, therefore a transform coefficient is at most 22 bits.

ok , the best way to handle this is using function pointer.
define coeffs to uint8_t *
#define two functions for idct, one for normal and one for extended precision.
I can help with this, hope we can do it in the following 1~2 weeks. thank you

Note that if bit_depth <= 10, sps_extended_precision_flag must be 0. For now, I think it is easiest to treat transform coeffs as capable of fitting in a int16_t for bit_depth <= 10 and not for bit_depth > 10, as we already have the mechanism to use different functions depending on bit-depth. This is what the current code is already doing. In the long term, yes I agree it will be better to change based on sps_extended_precision_flag, but as there are few > 10-bit videos which do not use the range extension I don't think it needs to be a priority.

Re. coeffs/pixels, this is the same problem I was trying to get at in #117 (comment). I'm not quite sure why you suggest uint8_t * — is this just meant as an arbitrary type which will be reinterpreted according to when an int or int16_t is needed? If that's the case I think char may be more fitting.

Aside: I nearly have the 32x32 transform working, there is just a single conformance test failing due to an overflow from a very large transform coefficient.

Re. coeffs/pixels, this is the same problem I was trying to get at in #117 (comment). I'm not quite sure why you suggest uint8_t * — is this just meant as an arbitrary type which will be reinterpreted according to when an int or int16_t is needed? If that's the case I think char may be more fitting.

sizeof(char) is not always 8, according to spec. it only can make sure sizeof(char) is >= 8. https://en.cppreference.com/w/cpp/language/types

we can define a new type Coeff like this

#if BIT_DEPTH > 10 && EXTEND_PRECISION typedef Coeff int32_t; #define TR_FUNC(f) FUNC(f##_ep) #else typedef Coeff int16_t; #define TR_FUNC(f) FUNC(f) #endif static void TR_FUNC(transform_bdpcm)(uint8_t *_coeffs, const int width, const int height, const int vertical, const int log2_transform_range) { Coeff *coeff = (Coeff*)_coeffs ... }

The value is only clipped after being summed with the predicted sample
If it's only internal overflow(in the function), maybe you can use int32_t as an internal type.

At the moment, this is manifesting in an intermittent failure for the ITX checkasm of ff_vvc_inv_dct2_dct2_32x32_10_avx2
why overflow only happens at 32x32? will 16x16 .. 2x2 has the same issue? will 8 bits has the same issue?

Rather than reinterpreting the uint8_t parameter, can we not just use Coeff in the signature? Would need some extra work in vvcdsp.h but probably beneficial in the long run.

Not possible. Coeff will defined to uint16_t and uint32_t in condition. we can't not use it as a signature.

thank you

Ah I see. dav1d uses their coeff/pixel types in their signatures, and I thought this was a template which was instantiated for each bit depth, but it looks as though it is only a build-time selection.

I am still not sure about the uint8_t * type though — if you were to do any manipulations on coeffs as uint8_ts it would cause corruption. Maybe a void * to make it more explicit that the array will need to be interpreted according to the bit depth and sps_extended_precision_flag?

The value is only clipped after being summed with the predicted sample

If it's only internal overflow(in the function), maybe you can use int32_t as an internal type.

Unfortunately this summing does not take place in the transforms but instead later in add_residual. Integrating it into the transforms would be difficult as there are many different ways the residual samples are produced before this, e.g. using core transforms, LFNST, transform skip, etc.

At the moment, this is manifesting in an intermittent failure for the ITX checkasm of ff_vvc_inv_dct2_dct2_32x32_10_avx2

why overflow only happens at 32x32? will 16x16 .. 2x2 has the same issue? will 8 bits has the same issue?

Larger transforms produce larger residual samples simply because there are more terms in the dot products making up the matrix multiplication. The 8-bit 32x32 transform is not affected as its bd_shift is larger, enough to guarantee the samples remain below INT16_MAX. I have not yet implemented the 10-bit 64x64 transform, though I am fairly certain it will suffer from the same issue. I also think the non-extended-precision 12-bit+ 32x32 and 64x64 transforms are affected. There is also the possibility that the 8-bit 64x64 transform is affected and that smaller non-extended-precision 12-bit+ transforms are affected.

This is not really an issue in any sensible 10-bit video — why would the residual ever be many times larger than the full range of a pixel? — hence not being caught by any of the conformance tests. I think there is the possibility that this could cause issues with real-world 16-bit videos if we were to start supporting them however.

why would the residual ever be many times larger than the full range of a pixel

Coeff is from the frequency domain. Compared with pixel depth is not so reasonable :)

hence not being caught by any of the conformance tests

conformance tests are just a small subset of all possible cases. A decoder can work well for many years. Until someone posts some really weird and valid clips. You will find some basic code blocks that need to be changed. So we may need to carefully calculate all coeff depth, shift..

Ah I see. dav1d uses their coeff/pixel types in their signatures, and I thought this was a template which was instantiated for each bit depth, but it looks as though it is only a build-time selection.

the dav1d uses templates at a higher level. all bytefn(fn) are template functions.

The 8-bit 32x32 transform is not affected as its bd_shift is larger, enough to guarantee the samples remain below INT16_MAX

If you check this line https://github.com/ffvvc/FFmpeg/blob/main/libavcodec/vvc/vvc_intra.c#L470
It means the scale input depth is 5 + sps->log2_transform_range, the output depth is sps->bit_depth.
So the itx's output depth should be 5 + sps->log2_transform_range, not related to bit depth.
the range of sps->log_transform_range is [15, 20], so 5 + sps->log2_transform_range is [20, 25].
we can't hold it in int16_t. we need int32_t in any case.

nuomi2021 · 2023-09-08T06:20:33Z

libavcodec/vvc/vvcdsp_template.c

-        VVC_ITX(TYPE, type, 16);                                                \
-        VVC_ITX(TYPE, type, 32);
+#define ITX \
+    ITX_COMMON_SIZES(DCT2, dct2, DCT2, dct2);                                   \


maybe you can define

#define ITX_2D_WX(TYPE_H, type_h, TYPE_V, type_v, W) \ ITX_2D_(TYPE_H, type_h, TYPE_V, type_v, W, 2); \ ITX_2D_(TYPE_H, type_h, TYPE_V, type_v, W, 4）; \ ..

then you can do

#define ITX_COMMON_SIZES(DCT2, dct2, DCT2, dct2) \ ITX_2D_WX(DCT2, dct2, DCT2, dct2, 2); ....

Like this:

diff --git a/libavcodec/vvc/vvcdsp_template.c b/libavcodec/vvc/vvcdsp_template.c index 4cba360411..cf9fd2f69e 100644 --- a/libavcodec/vvc/vvcdsp_template.c +++ b/libavcodec/vvc/vvcdsp_template.c @@ -94,31 +94,19 @@ static void FUNC(transform_bdpcm)(int *coeffs, const int width, const int height } } +#define ITX_COMMON_WIDTHS(TYPE_H, type_h, TYPE_V, type_v, height) \ + ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 1, height) \ + ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 4, height) \ + ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 8, height) \ + ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 16, height) \ + ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 32, height) \ + #define ITX_COMMON_SIZES(TYPE_H, type_h, TYPE_V, type_v) \ - ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 1, 4); \ - ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 1, 8); \ - ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 1, 16); \ - ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 1, 32); \ - ITX_1D_H(TYPE_H, type_h, TYPE_V, type_v, 4, 1); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 4, 4); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 4, 8); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 4, 16); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 4, 32); \ - ITX_1D_H(TYPE_H, type_h, TYPE_V, type_v, 8, 1); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 8, 4); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 8, 8); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 8, 16); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 8, 32); \ - ITX_1D_H(TYPE_H, type_h, TYPE_V, type_v, 16, 1); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 16, 4); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 16, 8); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 16, 16); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 16, 32); \ - ITX_1D_H(TYPE_H, type_h, TYPE_V, type_v, 32, 1); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 32, 4); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 32, 8); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 32, 16); \ - ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 32, 32); + ITX_COMMON_WIDTHS(TYPE_h, type_h, TYPE_V, type_v, 1) \ + ITX_COMMON_WIDTHS(TYPE_h, type_h, TYPE_V, type_v, 4) \ + ITX_COMMON_WIDTHS(TYPE_h, type_h, TYPE_V, type_v, 8) \ + ITX_COMMON_WIDTHS(TYPE_h, type_h, TYPE_V, type_v, 16) \ + ITX_COMMON_WIDTHS(TYPE_h, type_h, TYPE_V, type_v, 32) #define ITX \ ITX_COMMON_SIZES(DCT2, dct2, DCT2, dct2); \

do you mean?

This is slightly less readable imo and it only saves 12 lines, but I am happy to add it if you think it's better.

Sidenote: the patch above actually fails to compile at the moment as I have not implemented the size-1 1-D transforms (i.e. for transform size 1x1). These an edge case and are not used in any of the conformance bitstreams, but I think they may actually be valid per the spec and should be implemented.
EDIT: 1x1 transforms implemented in 368e985.

libavcodec/vvc/vvcdsp.h

The VVC specification does not allow every combination of transform size and type. For example, there is no size-64 DST-7. By combining the TxSize and TxType enums, these invalid transforms can no longer be represented.

nuomi2021 · 2023-12-05T09:00:30Z

hi @frankplow ,
seems the int32_t is only needed by range extension. If range extension is not enabled, we can keep the transform coeffs as int16_t.
I will try to make some changes to this. Hope this will reduce the porting efforts.

frankplow · 2023-12-11T13:29:16Z

hi @frankplow , seems the int32_t is only needed by range extension. If range extension is not enabled, we can keep the transform coeffs as int16_t. I will try to make some changes to this. Hope this will reduce the porting efforts.

Yeah I think if we take this approach, it shouldn't be too hard to get transforms implemented for the square sizes. Unfortunately, I think it will be hard to extend the HEVC optimisations to rectangular sizes and MTS as the way it's written doesn't facilitate much code reuse/modularity. I have a branch where I've worked on a more modular optimisation, based on some of the custom ABI ideas dav1d uses, but I'm having to write this from the ground up and don't have much time alongside my Master's at the moment. I think then, the best way to get optimisations in for these most common square sizes is to, as you say, allow varying the coeff type based on whether the range extension is active and then port the HEVC transforms.

nuomi2021 · 2023-12-16T08:45:26Z

but I'm having to write this from the ground up and don't have much time alongside my Master's at the moment

No worries. I will continue your work after I have done the thread optimizations.

frankplow added 9 commits August 27, 2023 15:01

tests/checkasm: Add VVC ITX test

3b00538

tests/checkasm/vvc_itx: Fix dst stride

Remove extended_precision_flag parameter from dsp_init

82686f7

tests/checkasm/vvc_itx: Fix randomisation

0b59c90

lavc/x86/vvc_itx: Import FFmpeg HEVC asm

0090622

lavc/x86/vvc_itx: Port 4x4 IDCT from HEVC to VVC

2ed407b

tests/checkasm/vvc_itx: Fix benchmark

8c53221

lavc/x86/vvc_itx: Port 8x8 and 16x16 transforms

5de8025

This was referenced Aug 27, 2023

1D inverse type-II DCT AVX2 optimisations #114

Open

WIP: dav1d ITX AVX2 Assembly Port #117

Closed

nuomi2021 reviewed Aug 29, 2023

View reviewed changes

lavc/x86/vvc_itx: Get 32x32 DCT-II working

88f8a7d

frankplow mentioned this pull request Sep 7, 2023

transform: change coeffs type from int to int32_t #140

Closed

nuomi2021 reviewed Sep 8, 2023

View reviewed changes

libavcodec/vvc/vvcdsp.h Outdated Show resolved Hide resolved

frankplow added 4 commits September 10, 2023 17:27

Merge branch 'main' into ffmpeg-hevc-idct

cc47ceb

Move vvc_itx.asm to libavcodec/x86/vvc

1a7f7ca

lavc/vvc: Implement 1x1 transforms

368e985

lavc/vvc: Make invalid ITXs unrepresentable

140554f

The VVC specification does not allow every combination of transform size and type. For example, there is no size-64 DST-7. By combining the TxSize and TxType enums, these invalid transforms can no longer be represented.

FFmpeg HEVC IDCT port #130

Are you sure you want to change the base?

FFmpeg HEVC IDCT port #130

Uh oh!

Conversation

frankplow commented Aug 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nuomi2021 Aug 29, 2023

Choose a reason for hiding this comment

Uh oh!

frankplow Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuomi2021 Aug 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frankplow Aug 30, 2023

Choose a reason for hiding this comment

Uh oh!

nuomi2021 Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuomi2021 Sep 1, 2023

Choose a reason for hiding this comment

Uh oh!

frankplow Sep 1, 2023

Choose a reason for hiding this comment

Uh oh!

frankplow Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuomi2021 Sep 1, 2023

Choose a reason for hiding this comment

Uh oh!

nuomi2021 Sep 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuomi2021 Sep 8, 2023

Choose a reason for hiding this comment

Uh oh!

frankplow Sep 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nuomi2021 commented Dec 5, 2023

Uh oh!

frankplow commented Dec 11, 2023

Uh oh!

nuomi2021 commented Dec 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

frankplow commented Aug 27, 2023 •

edited

Loading

frankplow Aug 29, 2023 •

edited

Loading

nuomi2021 Aug 30, 2023 •

edited

Loading

nuomi2021 Aug 31, 2023 •

edited

Loading

frankplow Sep 1, 2023 •

edited

Loading

nuomi2021 Sep 2, 2023 •

edited

Loading

frankplow Sep 10, 2023 •

edited

Loading