SVE2 Implementation for Opus Codec Library Analysis

Introduction

Previously, we successfully implemented SVE2 into Opus codec library by utilizing auto-vectorization method. In this post, we will analyze the result to further test if the SVE2 code is implemented correctly and determine its possible impact on the software's performance.

SVE2 Code Analysis

As we explored in the previous post, the compiler auto-vectorized many parts of the package. Let's take a look at one of them to see where SVE2 code is used.

Opus Codec utilizes Celt as one of ways to encode and decode audio source. In opus/celt, we can see the following list of files.

$ ls
arch.h           celt.o         entenc.o          mdct.c              quant_bands.lo
arm              cpu_support.h  fixed_c5x.h       mdct.h              quant_bands.o
bands.c          cwrs.c         fixed_c6x.h       mdct.lo             rate.c
bands.h          cwrs.h         fixed_debug.h     mdct.o              rate.h
bands.lo         cwrs.lo        fixed_generic.h   meson.build         rate.lo
bands.o          cwrs.o         float_cast.h      mfrngcod.h          rate.o
celt.c           dump_modes     kiss_fft.c        mips                stack_alloc.h
celt_decoder.c   ecintrin.h     _kiss_fft_guts.h  modes.c             static_modes_fixed_arm_ne10.h
celt_decoder.lo  entcode.c      kiss_fft.h        modes.h             static_modes_fixed.h
celt_decoder.o   entcode.h      kiss_fft.lo       modes.lo            static_modes_float_arm_ne10.h
celt_encoder.c   entcode.lo     kiss_fft.o        modes.o             static_modes_float.h
celt_encoder.lo  entcode.o      laplace.c         opus_custom_demo.c  tests
celt_encoder.o   entdec.c       laplace.h         os_support.h        vq.c
celt.h           entdec.h       laplace.lo        pitch.c             vq.h
celt.lo          entdec.lo      laplace.o         pitch.h             vq.lo
celt_lpc.c       entdec.o       mathops.c         pitch.lo            vq.o
celt_lpc.h       entenc.c       mathops.h         pitch.o             x86
celt_lpc.lo      entenc.h       mathops.lo        quant_bands.c
celt_lpc.o       entenc.lo      mathops.o         quant_bands.h

In celt_encoder.c file, we can see that it contains many for loops that may benefit from SVE2 implementation. The following code example is one of them:

// celt_encode.c
// ...

1100       /* For non-transient CBR/CVBR frames, halve the dynalloc contribution */
1101       if ((!vbr || constrained_vbr)&&!isTransient)
1102       {
1103          for (i=start;i<end;i++)
1104             follower[i] = HALF16(follower[i]);
1105       }
1106       for (i=start;i<end;i++)
1107       {
1108          if (i<8)
1109             follower[i] *= 2;
1110          if (i>=12)
1111             follower[i] = HALF16(follower[i]);

// ...

In the code, we can see a loop that iterates from start to end. Depending on the value of i, the ith element of follower array is either halved or multiplied by two. As we can see, this does not involve complex logic and process a large amount of data in the uniform manner and, therefore, this could be a good candidate to utilize the auto-vectorization by the compiler.

And as we expected, the celt_encoder.o contains multiple SVE-specific whilelo instructions when we disassemble it.

$ objdump -d celt_encoder.o | grep whilelo
     174:       25a30fe0        whilelo p0.s, wzr, w3
     198:       25a30c00        whilelo p0.s, w0, w3
     1e8:       25b40fe0        whilelo p0.s, wzr, w20
     200:       25b40c00        whilelo p0.s, w0, w20
     418:       25bc0fe0        whilelo p0.s, wzr, w28
     430:       25bc0c00        whilelo p0.s, w0, w28
     498:       25bc0fe0        whilelo p0.s, wzr, w28
     4b0:       25bc0c20        whilelo p0.s, w1, w28
   # ...    
    57ac:       25a10c00        whilelo p0.s, w0, w1
    5844:       25a10fe0        whilelo p0.s, wzr, w1
    585c:       25a10c00        whilelo p0.s, w0, w1
    5ae0:       25a10fe0        whilelo p0.s, wzr, w1
    5b00:       25a10c00        whilelo p0.s, w0, w1
    5ea8:       25a10fe0        whilelo p0.s, wzr, w1
    5ebc:       25a10c00        whilelo p0.s, w0, w1

But, this only shows that celt_encode have implemented SVE2 instruction. How can we know if the code that we are interested in utilizes SVE2?

Let's look at this in a different angle - how the compiler can determine if the codes are suitable for auto-vectorization? For this, we can specify an additional option to enable feature when you generate configure binary.

$ ./configure CFLAGS="-g -O3 -fopt-info-vec-all -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes" 

$ make -j24 |& tee make.log

fopt-info generates additional log in the compiler output. We specifically asks for all information regarding to vectorization by using vec-all. When we compile the package again using make, this feature will tell us why (or why not) the compiler add SVE2 implementation.

Once we run make command as above, we have the following make.log file that contains every information we want to know.

$ ll make.log
-rw-r--r--. 1 swji1 swji1 2831714 Apr 22 14:21 make.log

Let's refine the result by only searching the logs that happened in the celt directory as follows:

$ grep "celt/celt_encoder"
celt/celt_encoder.c:1810:22: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1780:16: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1780:16: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1778:40: missed: couldn't vectorize loop
celt/celt_encoder.c:1778:40: missed: not vectorized: number of iterations cannot be computed.
celt/celt_encoder.c:1756:17: missed: couldn't vectorize loop
celt/celt_encoder.c:1761:20: missed: not vectorized: complicated access pattern.

We can see which lines of the code are vectorized or not as above. Let's find if the code located at line 1106 that we have examined is vectorized as well.

$ grep "celt/celt_encoder.c:1106"
celt/celt_encoder.c:1106:21: celt/pitch.h:143:14: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1106:21: optimized: loop vectorized using variable length vectors

As we expected, the loop is vectorized by the compiler.

Now, we may wonder what are the codes that the compiler cannot perform auto-vectorization and why? Let's take a look at one of them.

celt/celt_encoder.c:1922:39: missed: not vectorized: complicated access pattern.

// celt_encoder.c
// ...
 do {
1915       for (i=start;i<end;i++)
1916       {
1917          /* When the energy is stable, slightly bias energy quantization towards
1918             the previous error to make the gain more stable (a constant offset is
1919             better than fluctuations). */
1920          if (ABS32(SUB32(bandLogE[i+c*nbEBands], oldBandE[i+c*nbEBands])) < QCONST16(2.f, DB_SHIFT))
1921          {
1922             bandLogE[i+c*nbEBands] -= MULT16_16_Q15(energyError[i+c*nbEBands], QCONST16(0.25f, 15));
1923          }
1924       }
1925    } while (++c < C);
// ...

In the if statement inside of the loop, we can see that each element of the arrays requires extensive calculations beforehand. For this reason, the compiler cannot vectorize the loop as it requires complex access pattern.

Performance Prediction

Unfortunately, we cannot benchmark the performance of the package at the moment due to the lack of hardware that supports SVE2. However, we do know the SVE2 implementation would potentially improve the performance as it optimizes loops when processing large datasets like audio and video resources. For this reason, we can assume there is a positive correlation between the number of SVE2 instructions and the performance.

Before we begin, we need to also consider that opus package contains multiple unit tests that can potentially increase the total number. Thus, we have to be extra careful to exclude them.
Let's count the total number of optimizations that are done by the compiler.

$ grep -v "test" make.log | grep "optimized" -c
632

The compiler managed to auto-vectorize a significant amount (632) of codes. Let's take a look at how many of SVE-specific whilelo instruction and registers (i.e. predicate register and scalable vector register) are implemented in the executable opus codec library, libopus.

$ objdump -d libopus.so.0.8.0 | grep whilelo -c
671
$ objdump -d libopus.so.0.8.0 | grep whilelo
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2f28:       25a40c60        whilelo p0.s, w3, w4
    2f7c:       25a40fe1        whilelo p1.s, wzr, w4
    2fa0:       25a40c60        whilelo p0.s, w3, w4
    3314:       25b80fe0        whilelo p0.s, wzr, w24
    3344:       25b80c20        whilelo p0.s, w1, w24
# ...
   47b38:       25a50fe0        whilelo p0.s, wzr, w5
   47b3c:       25a80c23        whilelo p3.s, w1, w8
   47b4c:       25aa0c24        whilelo p4.s, w1, w10
   47b54:       25250c26        whilelo p6.b, w1, w5
   47b5c:       25a60c22        whilelo p2.s, w1, w6
   47b68:       25a50c25        whilelo p5.s, w1, w5
   47b98:       25a50c20        whilelo p0.s, w1, w5

$objdump -d libopus.so.0.8.0 | egrep "[^[:alpha:]]z[[:digit:]]|[^[:alpha:]]p[[:digit:]]" -c
5274
$objdump -d libopus.so.0.8.0 | egrep "[^[:alpha:]]z[[:digit:]]|[^[:alpha:]]p[[:digit:]]"
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2ef8:       04a34801        index   z1.s, #0, w3
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2ef8:       04a34801        index   z1.s, #0, w3
    2f0c:       25814420        mov     p0.b, p1.b
    2f18:       856140a0        ld1w    {z0.s}, p0/z, [x5, z1.s, sxtw #2]
    2f1c:       e54340c0        st1w    {z0.s}, p0, [x6, x3, lsl #2]
    2f28:       25a40c60        whilelo p0.s, w3, w4
# ...
   47f48:       6594a000        scvtf   z0.s, p0/m, z0.s
   47f4c:       25886100        mov     p0.b, p8.b
   47f50:       e544e4a2        st1w    {z2.s}, p1, [x5, #4, mul vl]
   47f54:       e546e0a1        st1w    {z1.s}, p0, [x5, #6, mul vl]
   47f58:       25896520        mov     p0.b, p9.b
   47f5c:       e547e0a0        st1w    {z0.s}, p0, [x5, #7, mul vl]

As we can see, there are substantial amount of SVE2 specific codes that are implemented by the auto-vectorization. Therefore, we can suspect that the opus library may benefit from it to increase the overall performance.

Things that Can Further Improve the Performance

We already know the compiler auto-vectorize a large portion of the codes. But, we have to admit there is a limit to this method. As we already found before, the compiler cannot auto-vectorize some codes. However, this does not mean they cannot be vectorized. In some cases, we may find places where SVE2 implementation could take place if the loop is written differently. For example, as this article suggested, we may use restrict qualifiers to inform the compiler that there is no array overlaps.

Original and SVE2 Implementation Comparison

Now, we know SVE2 implementation is successfully performed by the auto-vectorization. However, this is meaningless if the SVE2-improved library does not generate the same result as the original library. For this, let's examine if the improved version of the program works as well as the original version.

# original file
$ ll libopus.so.0.8.0
-rwxr-xr-x. 1 swji1 swji1 1498808 Apr 13 20:16 libopus.so.0.8.0

# SVE2 implemented file
$ ll libopus.so.0.8.0
-rwxr-xr-x. 1 swji1 swji1 1684704 Apr 22 14:21 libopus.so.0.8.0

The SVE2 implemented version has a little bit larger in size (~0.2 MiB) but does not show a significant change.

Let's run the unit tests that are provided by the package authors. As we know from the previous post, we have to execute them using qemu-aarch64 command to run the emulation. But, unlike previous post, we will run several unit tests to see if the SVE2 code works correctly.

$ ./test_opus_api
Testing the libopus 1.3.1-107-gccaaffa9-dirty API deterministically
Decoder basic API tests
  ---------------------------------------------------
    opus_decoder_get_size(0)=0 ................... OK.
    opus_decoder_get_size(1)=18228 ............... OK.
    opus_decoder_get_size(2)=26996 ............... OK.
    opus_decoder_get_size(3)=0 ................... OK.
    opus_decoder_create() ........................ OK.
    opus_decoder_init() .......................... OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_GET_SAMPLE_RATE ......................... OK.
    OPUS_GET_PITCH ............................... OK.
    OPUS_GET_LAST_PACKET_DURATION ................ OK.
    OPUS_SET_GAIN ................................ OK.
    OPUS_GET_GAIN ................................ OK.
    OPUS_RESET_STATE ............................. OK.
    opus_{packet,decoder}_get_nb_samples() ....... OK.
    opus_packet_get_nb_frames() .................. OK.
    opus_packet_get_bandwidth() .................. OK.
    opus_packet_get_samples_per_frame() .......... OK.
    opus_decode() ................................ OK.
    opus_decode_float() .......................... OK.
                   All decoder interface tests passed
                             (1219433 API invocations)
# ...

Repacketizer tests
  ---------------------------------------------------
    opus_repacketizer_get_size()=496 ............. OK.
    opus_repacketizer_init ....................... OK.
    opus_repacketizer_create ..................... OK.
    opus_repacketizer_get_nb_frames .............. OK.
    opus_repacketizer_cat ........................ OK.
    opus_repacketizer_out ........................ OK.
    opus_repacketizer_out_range .................. OK.
    opus_packet_pad .............................. OK.
    opus_packet_unpad ............................ OK.
    opus_multistream_packet_pad .................. OK.
    opus_multistream_packet_unpad ................ OK.
                        All repacketizer tests passed
                            (6713561 API invocations)

  malloc() failure tests
  ---------------------------------------------------
    opus_decoder_create() ................... SKIPPED.
    opus_encoder_create() ................... SKIPPED.
    opus_repacketizer_create() .............. SKIPPED.
    opus_multistream_decoder_create() ....... SKIPPED.
    opus_multistream_encoder_create() ....... SKIPPED.
(Test only supported with GLIBC and without valgrind)

All API tests passed.
The libopus API was invoked 115421979 times.

$ ./test_opus_decode
Testing libopus 1.3.1-107-gccaaffa9-dirty decoder. Random seed: 2918850151 (76BD)
  Starting 10 decoders...
    opus_decoder_create(48000,1) OK. Copy OK.
    opus_decoder_create(48000,2) OK. Copy OK.
    opus_decoder_create(24000,1) OK. Copy OK.
    opus_decoder_create(24000,2) OK. Copy OK.
    opus_decoder_create(16000,1) OK. Copy OK.
    opus_decoder_create(16000,2) OK. Copy OK.
    opus_decoder_create(12000,1) OK. Copy OK.
    opus_decoder_create(12000,2) OK. Copy OK.
    opus_decoder_create( 8000,1) OK. Copy OK.
    opus_decoder_create( 8000,2) OK. Copy OK.
  dec[all] initial frame PLC OK.
  dec[all] all 2-byte prefix for length 3 and PLC, all modes (64) OK.
  dec[  5] all 3-byte prefix for length 4, mode 28 OK.
  dec[  0] all 3-byte prefix for length 4, mode  4 OK.
  dec[all] random packets, all modes (64), every 8th size from from 7 bytes to maximum OK.
  dec[all] random packets, all mode pairs (4096), 145 bytes/frame OK.
  dec[  3] random packets, all mode pairs (4096)*10, 81 bytes/frame OK.
  dec[  0] pre-selected random packets OK.
  Decoders stopped.
  Testing opus_pcm_soft_clip... OK.

$ ./test_opus_encode
Testing libopus 1.3.1-107-gccaaffa9-dirty encoder. Random seed: 2953257216 (421F)
Running simple tests for bugs that have been fixed previously
  Encode+Decode tests.
    Mode     LP FB encode  VBR,   9119 bps OK.
    Mode     LP FB encode  VBR,  13234 bps OK.
    Mode     LP FB encode  VBR,  64668 bps OK.
    Mode Hybrid FB encode  VBR,  28306 bps OK.
    Mode Hybrid FB encode  VBR,  54852 bps OK.
    Mode Hybrid FB encode  VBR,  55130 bps OK.
    Mode Hybrid FB encode  VBR,  96362 bps OK.
    Mode   MDCT FB encode  VBR, 893620 bps OK.
    Mode   MDCT FB encode  VBR,  25608 bps OK.
    Mode   MDCT FB encode  VBR,  29011 bps OK.
    Mode   MDCT FB encode  VBR,  93628 bps OK.
    Mode   MDCT FB encode  VBR,  93328 bps OK.
    Mode   MDCT FB encode  VBR, 160982 bps OK.
# ...
    Mode     LP NB dual-mono MS encode  CBR,  21883 bps OK.
    Mode     LP NB dual-mono MS encode  CBR,  60566 bps OK.
    Mode     LP NB dual-mono MS encode  CBR,  76774 bps OK.
    Mode     LP NB dual-mono MS encode  CBR, 167879 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,   6953 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  12756 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  60193 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  14915 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  16946 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  34028 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  86938 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR, 172977 bps OK.
    All framesize pairs switching encode, 9683 frames OK.
Running fuzz_encoder_settings with 5 encoder(s) and 40 setting change(s) each.
Tests completed successfully.

As we can see, the SVE2 program passes all the unit tests to confirm that it works as well as the original program.

Conclusion

In this post, we found that the compiler successfully vectorized the codes and there would be a significant improvement in the performance considering the substantial amount of SVE2-specific instructions and registers. We also checked that SVE2 does not break the program and run as well as the original program. These findings suggest that the authors of opus package may greatly benefit from the vectorization of the codes when SVE2 become publicly available in the near future.