Skip to content

Conversation

@frankplow
Copy link
Collaborator

@frankplow frankplow commented Aug 27, 2023

This PR ports FFmpeg's HEVC IDCT optimisations.

To-do:

  • Port 4x4 transform
  • Port 8x8 transform
  • Port 16x16 transform
  • Port 32x32 transform
  • Add 2x2 transform
  • Add 64x64 transform
  • Add rectangular transforms
  • Add 1D transforms
  • Change residual type to int32_t
  • Refactor template code for 1-D functions to reduce duplication

Checkasm benchmark results:

inv_dct2_dct2_4x4_8_c: 226.2
inv_dct2_dct2_4x4_8_avx2: 26.2
inv_dct2_dct2_4x4_10_c: 188.2
inv_dct2_dct2_4x4_10_avx2: 24.7
inv_dct2_dct2_8x8_8_c: 704.7
inv_dct2_dct2_8x8_8_avx2: 124.7
inv_dct2_dct2_8x8_10_c: 751.2
inv_dct2_dct2_8x8_10_avx2: 124.7
inv_dct2_dct2_16x16_8_c: 4289.7
inv_dct2_dct2_16x16_8_avx2: 621.2
inv_dct2_dct2_16x16_10_c: 4335.2
inv_dct2_dct2_16x16_10_avx2: 625.2

perf.py results:

Bitstream Before After Delta
RitualDance_1920x1080_60_10_420_32_LD 99.7 99.3 -0.4%
RitualDance_1920x1080_60_10_420_37_RA 88.3 87.7 -0.5%
Tango2_3840x2160_60_10_420_27_LD 23.0 23.0 0.0%

The current perf.py performance is poor as the DCT's effect on overall decoding performance is dominated by the larger sizes which have not yet been implemented. The decrease in performance is explained by the additional overhead of optimising at the 2D level, the benefits of which are not being reaped here. As the larger sizes are implemented, performance will increase dramatically, in line with the checkasm benchmark result.

lavc/x86/vvc_itx_1d: Initial AVX2 optimisations for VVC IDCT2

lavc/x86/vvc_itx_1d.asm: Add size 64 AVX2 IDCT2 optimisation

lavc/x86/vvc_itx_1d: Don't use AVX2 optimisations when extended_precision_flag set

lavc/x86/vvc_itx_1d: Fix assembly with YASM

YASM does not supported unnamed contexts, so give all
contexts names

lavc/x86/vvc_itx_1d: Fix YASM build

Replace `x%[y]` with `x %+ y`

lavc/vvc: Rename vvc_itx_1d to vvc_itx

lavc/vvc: Make ITX modular at 2D level

Make the ITX DSP context inverse transform functions
perform 2D transforms rather than 1D transforms. This
will allow for more efficient assembly optimisations in
future.

lavc/vvc: Fix ITX non-zero width usage
tests/checkasm/vvc_itx: Fix dst stride
Previously the pixels were stored as `int`s. The maximum bit depth
supported by VVC is 16, hence an `int16_t` is guaranteed to be
sufficient. Smaller sizes allow for more pixels to be packed into an
SIMD register and therefore faster assembly optimisations. Further
optimisation could be achieved by using `int8_t`s if the bit depth of
the current video is only 8.

tests/checkasm/vvc_itx: Use `int16_t` dst arrays

tests/checkasm/vvc_itx: Fix dst stride
int bd_offset;

int *coeffs;
int16_t *pixels;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need pixels? can we just change coeffs to int16_t?

Copy link
Collaborator Author

@frankplow frankplow Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sps_extended_precision_flag = 1, then the transform coefficients are bit_depth + 6 bits each, therefore a transform coefficient is at most 22 bits.

Copy link
Member

@nuomi2021 nuomi2021 Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , the best way to handle this is using function pointer.
define coeffs to uint8_t *
#define two functions for idct, one for normal and one for extended precision.
I can help with this, hope we can do it in the following 1~2 weeks. thank you

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if bit_depth <= 10, sps_extended_precision_flag must be 0. For now, I think it is easiest to treat transform coeffs as capable of fitting in a int16_t for bit_depth <= 10 and not for bit_depth > 10, as we already have the mechanism to use different functions depending on bit-depth. This is what the current code is already doing. In the long term, yes I agree it will be better to change based on sps_extended_precision_flag, but as there are few > 10-bit videos which do not use the range extension I don't think it needs to be a priority.

Re. coeffs/pixels, this is the same problem I was trying to get at in #117 (comment). I'm not quite sure why you suggest uint8_t * — is this just meant as an arbitrary type which will be reinterpreted according to when an int or int16_t is needed? If that's the case I think char may be more fitting.

Aside: I nearly have the 32x32 transform working, there is just a single conformance test failing due to an overflow from a very large transform coefficient.

Copy link
Member

@nuomi2021 nuomi2021 Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. coeffs/pixels, this is the same problem I was trying to get at in #117 (comment). I'm not quite sure why you suggest uint8_t * — is this just meant as an arbitrary type which will be reinterpreted according to when an int or int16_t is needed? If that's the case I think char may be more fitting.

sizeof(char) is not always 8, according to spec. it only can make sure sizeof(char) is >= 8. https://en.cppreference.com/w/cpp/language/types

we can define a new type Coeff like this

#if BIT_DEPTH > 10 && EXTEND_PRECISION
 typedef Coeff int32_t;
#define TR_FUNC(f) FUNC(f##_ep)
#else 
 typedef Coeff int16_t;
#define TR_FUNC(f) FUNC(f)
#endif

static void TR_FUNC(transform_bdpcm)(uint8_t *_coeffs, const int width, const int height,
    const int vertical, const int log2_transform_range)
{
          Coeff *coeff = (Coeff*)_coeffs
...
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is only clipped after being summed with the predicted sample
If it's only internal overflow(in the function), maybe you can use int32_t as an internal type.

At the moment, this is manifesting in an intermittent failure for the ITX checkasm of ff_vvc_inv_dct2_dct2_32x32_10_avx2
why overflow only happens at 32x32? will 16x16 .. 2x2 has the same issue? will 8 bits has the same issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than reinterpreting the uint8_t parameter, can we not just use Coeff in the signature? Would need some extra work in vvcdsp.h but probably beneficial in the long run.

Not possible. Coeff will defined to uint16_t and uint32_t in condition. we can't not use it as a signature.

thank you

Ah I see. dav1d uses their coeff/pixel types in their signatures, and I thought this was a template which was instantiated for each bit depth, but it looks as though it is only a build-time selection.

I am still not sure about the uint8_t * type though — if you were to do any manipulations on coeffs as uint8_ts it would cause corruption. Maybe a void * to make it more explicit that the array will need to be interpreted according to the bit depth and sps_extended_precision_flag?

Copy link
Collaborator Author

@frankplow frankplow Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is only clipped after being summed with the predicted sample

If it's only internal overflow(in the function), maybe you can use int32_t as an internal type.

Unfortunately this summing does not take place in the transforms but instead later in add_residual. Integrating it into the transforms would be difficult as there are many different ways the residual samples are produced before this, e.g. using core transforms, LFNST, transform skip, etc.

At the moment, this is manifesting in an intermittent failure for the ITX checkasm of ff_vvc_inv_dct2_dct2_32x32_10_avx2

why overflow only happens at 32x32? will 16x16 .. 2x2 has the same issue? will 8 bits has the same issue?

Larger transforms produce larger residual samples simply because there are more terms in the dot products making up the matrix multiplication. The 8-bit 32x32 transform is not affected as its bd_shift is larger, enough to guarantee the samples remain below INT16_MAX. I have not yet implemented the 10-bit 64x64 transform, though I am fairly certain it will suffer from the same issue. I also think the non-extended-precision 12-bit+ 32x32 and 64x64 transforms are affected. There is also the possibility that the 8-bit 64x64 transform is affected and that smaller non-extended-precision 12-bit+ transforms are affected.

This is not really an issue in any sensible 10-bit video — why would the residual ever be many times larger than the full range of a pixel? — hence not being caught by any of the conformance tests. I think there is the possibility that this could cause issues with real-world 16-bit videos if we were to start supporting them however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would the residual ever be many times larger than the full range of a pixel

Coeff is from the frequency domain. Compared with pixel depth is not so reasonable :)

hence not being caught by any of the conformance tests

conformance tests are just a small subset of all possible cases. A decoder can work well for many years. Until someone posts some really weird and valid clips. You will find some basic code blocks that need to be changed. So we may need to carefully calculate all coeff depth, shift..

Copy link
Member

@nuomi2021 nuomi2021 Sep 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. dav1d uses their coeff/pixel types in their signatures, and I thought this was a template which was instantiated for each bit depth, but it looks as though it is only a build-time selection.

the dav1d uses templates at a higher level. all bytefn(fn) are template functions.

The 8-bit 32x32 transform is not affected as its bd_shift is larger, enough to guarantee the samples remain below INT16_MAX

If you check this line https://github.com/ffvvc/FFmpeg/blob/main/libavcodec/vvc/vvc_intra.c#L470
It means the scale input depth is 5 + sps->log2_transform_range, the output depth is sps->bit_depth.
So the itx's output depth should be 5 + sps->log2_transform_range, not related to bit depth.
the range of sps->log_transform_range is [15, 20], so 5 + sps->log2_transform_range is [20, 25].
we can't hold it in int16_t. we need int32_t in any case.

VVC_ITX(TYPE, type, 16); \
VVC_ITX(TYPE, type, 32);
#define ITX \
ITX_COMMON_SIZES(DCT2, dct2, DCT2, dct2); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can define

#define ITX_2D_WX(TYPE_H, type_h, TYPE_V, type_v, W) \
    ITX_2D_(TYPE_H, type_h, TYPE_V, type_v, W, 2); \
    ITX_2D_(TYPE_H, type_h, TYPE_V, type_v, W, 4); \
    ..

then you can do

#define ITX_COMMON_SIZES(DCT2, dct2, DCT2, dct2) \
    ITX_2D_WX(DCT2, dct2, DCT2, dct2, 2);
    ....

Copy link
Collaborator Author

@frankplow frankplow Sep 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this:

diff --git a/libavcodec/vvc/vvcdsp_template.c b/libavcodec/vvc/vvcdsp_template.c
index 4cba360411..cf9fd2f69e 100644
--- a/libavcodec/vvc/vvcdsp_template.c
+++ b/libavcodec/vvc/vvcdsp_template.c
@@ -94,31 +94,19 @@ static void FUNC(transform_bdpcm)(int *coeffs, const int width, const int height
     }
 }
 
+#define ITX_COMMON_WIDTHS(TYPE_H, type_h, TYPE_V, type_v, height)               \
+    ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 1, height)                         \
+    ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 4, height)                         \
+    ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 8, height)                         \
+    ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 16, height)                        \
+    ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 32, height)                        \
+
 #define ITX_COMMON_SIZES(TYPE_H, type_h, TYPE_V, type_v)                        \
-    ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 1, 4);                             \
-    ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 1, 8);                             \
-    ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 1, 16);                            \
-    ITX_1D_V(TYPE_H, type_h, TYPE_V, type_v, 1, 32);                            \
-    ITX_1D_H(TYPE_H, type_h, TYPE_V, type_v, 4, 1);                             \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 4, 4);                               \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 4, 8);                               \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 4, 16);                              \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 4, 32);                              \
-    ITX_1D_H(TYPE_H, type_h, TYPE_V, type_v, 8, 1);                             \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 8, 4);                               \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 8, 8);                               \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 8, 16);                              \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 8, 32);                              \
-    ITX_1D_H(TYPE_H, type_h, TYPE_V, type_v, 16, 1);                            \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 16, 4);                              \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 16, 8);                              \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 16, 16);                             \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 16, 32);                             \
-    ITX_1D_H(TYPE_H, type_h, TYPE_V, type_v, 32, 1);                            \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 32, 4);                              \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 32, 8);                              \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 32, 16);                             \
-    ITX_2D(TYPE_H, type_h, TYPE_V, type_v, 32, 32);
+    ITX_COMMON_WIDTHS(TYPE_h, type_h, TYPE_V, type_v, 1)                        \
+    ITX_COMMON_WIDTHS(TYPE_h, type_h, TYPE_V, type_v, 4)                        \
+    ITX_COMMON_WIDTHS(TYPE_h, type_h, TYPE_V, type_v, 8)                        \
+    ITX_COMMON_WIDTHS(TYPE_h, type_h, TYPE_V, type_v, 16)                       \
+    ITX_COMMON_WIDTHS(TYPE_h, type_h, TYPE_V, type_v, 32)
 
 #define ITX \
     ITX_COMMON_SIZES(DCT2, dct2, DCT2, dct2);                                   \

do you mean?

This is slightly less readable imo and it only saves 12 lines, but I am happy to add it if you think it's better.

Sidenote: the patch above actually fails to compile at the moment as I have not implemented the size-1 1-D transforms (i.e. for transform size 1x1). These an edge case and are not used in any of the conformance bitstreams, but I think they may actually be valid per the spec and should be implemented.
EDIT: 1x1 transforms implemented in 368e985.

The VVC specification does not allow every combination of transform size
and type. For example, there is no size-64 DST-7. By combining the
TxSize and TxType enums, these invalid transforms can no longer be
represented.
@nuomi2021
Copy link
Member

hi @frankplow ,
seems the int32_t is only needed by range extension. If range extension is not enabled, we can keep the transform coeffs as int16_t.
I will try to make some changes to this. Hope this will reduce the porting efforts.

@frankplow
Copy link
Collaborator Author

hi @frankplow , seems the int32_t is only needed by range extension. If range extension is not enabled, we can keep the transform coeffs as int16_t. I will try to make some changes to this. Hope this will reduce the porting efforts.

Yeah I think if we take this approach, it shouldn't be too hard to get transforms implemented for the square sizes. Unfortunately, I think it will be hard to extend the HEVC optimisations to rectangular sizes and MTS as the way it's written doesn't facilitate much code reuse/modularity. I have a branch where I've worked on a more modular optimisation, based on some of the custom ABI ideas dav1d uses, but I'm having to write this from the ground up and don't have much time alongside my Master's at the moment. I think then, the best way to get optimisations in for these most common square sizes is to, as you say, allow varying the coeff type based on whether the range extension is active and then port the HEVC transforms.

@nuomi2021
Copy link
Member

but I'm having to write this from the ground up and don't have much time alongside my Master's at the moment

No worries. I will continue your work after I have done the thread optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants