DEV Community

Seung Woo (Paul) Ji
Seung Woo (Paul) Ji

Posted on

Implementing SVE2 for Opus Codec Library Part 3: Auto-vectorization

Introduction

Previously, we tried to implement SVE2 by using the existing codes that are written in NEON instructions. Unfortunately, the result was not so fruitful. Instead, in this post, we will try to utilize the auto-vectorization method in order to add SVE2 instructions.

Before We Start

As we know, Opus package already supports NEON intrinsics. When we run the ./configure, we can see that the script automatically detects that the existing processor supports ARM NEON intrinsics optimizations.

$ ./configure
opus 1.3.1-107-gccaaffa9-dirty:  Automatic configuration OK.

    Compiler support:

      C99 var arrays: ................ yes
      C99 lrintf: .................... yes
      Use alloca: .................... no (using var arrays)

    General configuration:

      Floating point support: ........ yes
      Fast float approximations: ..... no
      Fixed point debugging: ......... no
      Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
      External Assembly Optimizations:
      Intrinsics Optimizations: ...... ARM (NEON) (NEON Aarch64)
      Run-time CPU detection: ........ no
      Custom modes: .................. no
      Assertion checking: ............ no
      Hardening: ..................... yes
      Fuzzing: ....................... no
      Check ASM: ..................... no

      API documentation: ............. yes
      Extra programs: ................ yes

Enter fullscreen mode Exit fullscreen mode

However, NEON intrinsic may conflict with auto-vectorizations that are done by the compiler. For this reason, we have to look for a way to disable this feature in the first place.

Thankfully, we can easily achieve this by editing the configure.ac file. When we search for neon keyword, we can find the following codes:

AS_IF([test x"$enable_intrinsics" = x"yes"],[
   intrinsics_support=""
   AS_CASE([$host_cpu],
   [arm*|aarch64*],
   [
      cpu_arm=yes
      OPUS_CHECK_INTRINSICS(
         [ARM Neon],
         [$ARM_NEON_INTR_CFLAGS],
         [OPUS_ARM_MAY_HAVE_NEON_INTR],
         [OPUS_ARM_PRESUME_NEON_INTR],
         [[#include <arm_neon.h>
         ]],
         [[
            static float32x4_t A0, A1, SUMM;
            SUMM = vmlaq_f32(SUMM, A0, A1);
            return (int)vgetq_lane_f32(SUMM, 0);
         ]]
      )
Enter fullscreen mode Exit fullscreen mode

The code checks if the enable_intrinsics is true or not. By changing x"yes" to x"", we can disable the intrinsic configuration. When we rerun autogen.sh and configure scripts, we can confirm that the intrinsic optimizations are disabled.

$ ./autogen.sh
$ ./configure
opus 1.3.1-107-gccaaffa9-dirty:  Automatic configuration OK.

    Compiler support:

      C99 var arrays: ................ yes
      C99 lrintf: .................... yes
      Use alloca: .................... no (using var arrays)

    General configuration:

      Floating point support: ........ yes
      Fast float approximations: ..... no
      Fixed point debugging: ......... no
      Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
      External Assembly Optimizations:
      Intrinsics Optimizations: ...... no
      Run-time CPU detection: ........ no
      Custom modes: .................. no
      Assertion checking: ............ no
      Hardening: ..................... yes
      Fuzzing: ....................... no
      Check ASM: ..................... no

      API documentation: ............. yes
      Extra programs: ................ yes
Enter fullscreen mode Exit fullscreen mode

Auto-vectorization Implementation

Before we begin, we need to check what what compiler flags are being used by the package. To do this, we can look at the Makefile.

# Makefile
CFLAGS = -g -O2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes
Enter fullscreen mode Exit fullscreen mode

In order to enable the auto-vectorization, we need to edit the CFLAGS. However, this is not-so-easy job to do because there are multiple Makefile throughout the package. That is, we have to edit (possibly) all the following files.

$ find . -name Makefile
./doc/Makefile
./doc/latex/Makefile
./Makefile
./celt/dump_modes/Makefile
Enter fullscreen mode Exit fullscreen mode

Fortunately, we can avoid this problem by overriding the configure script that is generated for us. For this, we use the existing CFLAGS and modify it as follows:

$ ./configure CFLAGS="-g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes"
Enter fullscreen mode Exit fullscreen mode

We change the optimization level to 03 to enable the auto-vectorization and specify the machine architecture to be ArmV8 with SVE2 extension. Once we run it, we can check that the CFLAGS are updated successfully in the Makefile.

# Makefile
CFLAGS = -g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes -fvisibility=hidden -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes
Enter fullscreen mode Exit fullscreen mode

Let's try to run the unit tests to see whether the auto-vectorization kicked in during the compilation.

$ make check

# ...

./test-driver: line 107: 1431013 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_cwrs32
./test-driver: line 107: 1431028 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: tests/test_opus_api

# ...

============================================================================
Testsuite summary for opus 1.3.1-107-gccaaffa9-dirty
============================================================================
# TOTAL: 14
# PASS:  4
# SKIP:  0
# XFAIL: 0
# FAIL:  10
# XPASS: 0
# ERROR: 0
============================================================================
See ./test-suite.log
Please report to opus@xiph.org
============================================================================
Enter fullscreen mode Exit fullscreen mode

This is expected because we are trying to execute a binary that is coded with SVE2 instructions and the existing hardware does not yet support them. To solve this, we need to run the emulation by using qemu-aarch64 command. Let's run the command using one of the unit test that has failed - test_opus_api.

$ qemu-aarch64 test_opus_api
Error while loading test_opus_api: Exec format error
Enter fullscreen mode Exit fullscreen mode

Interestingly, the command cannot run because the test file format is invalid. Let's take a look at the content of the test file.

$ vi test_opus_api
#! /bin/sh

# tests/test_opus_api - temporary wrapper script for .libs/test_opus_api
# Generated by libtool (GNU libtool) 2.4.6
#
# The tests/test_opus_api program cannot be directly executed until all the libtool
# libraries that it depends on are installed.
#
# This wrapper script should never be moved out of the build directory.
# If it is, it will not operate correctly.

# Sed substitution that helps us do robust quoting.  It backslashifies
# metacharacters that are still active within double-quoted strings.

Enter fullscreen mode Exit fullscreen mode

The test file is actually a wrapper script and that is why the qemu-aarch64 cannot run. We can easily solve this by inserting the command when the script launches the actual program.

Let's look for a code where the script starts the program and add the qemu-aarch64 command:

# Core function for launching the target application
func_exec_program_core ()
{

      if test -n "$lt_option_debug"; then
        $ECHO "test_opus_api:tests/test_opus_api:$LINENO: newargv[0]: $progdir/$program" 1>&2
        func_lt_dump_args ${1+"$@"} 1>&2
      fi
      exec qemu-aarch64 "$progdir/$program" ${1+"$@"}

      $ECHO "$0: cannot exec $program $*" 1>&2
      exit 1
}

Enter fullscreen mode Exit fullscreen mode

When we rerun the unit test, we can see that all tests passed without any problems.

$ ./test_opus_api
Testing the libopus 1.3.1-107-gccaaffa9-dirty API deterministically

  Decoder basic API tests
  ---------------------------------------------------
    opus_decoder_get_size(0)=0 ................... OK.
    opus_decoder_get_size(1)=18228 ............... OK.
    opus_decoder_get_size(2)=26996 ............... OK.
    opus_decoder_get_size(3)=0 ................... OK.
    opus_decoder_create() ........................ OK.
    opus_decoder_init() .......................... OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_GET_SAMPLE_RATE ......................... OK.
    OPUS_GET_PITCH ............................... OK.
    OPUS_GET_LAST_PACKET_DURATION ................ OK.
    OPUS_SET_GAIN ................................ OK.
    OPUS_GET_GAIN ................................ OK.
    OPUS_RESET_STATE ............................. OK.
    opus_{packet,decoder}_get_nb_samples() ....... OK.
    opus_packet_get_nb_frames() .................. OK.
    opus_packet_get_bandwidth() .................. OK.
    opus_packet_get_samples_per_frame() .......... OK.
    opus_decode() ................................ OK.
    opus_decode_float() .......................... OK.
                   All decoder interface tests passed
                             (1219433 API invocations)

  Multistream decoder basic API tests
  ---------------------------------------------------
    opus_multistream_decoder_get_size(-1,-1)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 0)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 1)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 2)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 0,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 0)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 1)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 2)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 1,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 1, 0)=18504 OK.
    opus_multistream_decoder_get_size( 1, 1)=27272 OK.
    opus_multistream_decoder_get_size( 1, 2)=0 ... OK.
    opus_multistream_decoder_get_size( 1, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 2,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 2, 0)=36736 OK.
    opus_multistream_decoder_get_size( 2, 1)=45504 OK.
    opus_multistream_decoder_get_size( 2, 2)=54272 OK.
    opus_multistream_decoder_get_size( 2, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 3,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 3, 0)=54968 OK.
    opus_multistream_decoder_get_size( 3, 1)=63736 OK.
    opus_multistream_decoder_get_size( 3, 2)=72504 OK.
    opus_multistream_decoder_get_size( 3, 3)=81272 OK.
    opus_multistream_decoder_create() ............ OK.
    opus_multistream_decoder_init() .............. OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_MULTISTREAM_GET_DECODER_STATE ........... OK.
    OPUS_SET_GAIN ................................ OK.
    OPUS_GET_GAIN ................................ OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_RESET_STATE ............................. OK.
    opus_multistream_decode() .................... OK.
    opus_multistream_decode_float() .............. OK.
       All multistream decoder interface tests passed
                             (576106 API invocations)

  Packet header parsing tests
  ---------------------------------------------------
    code 0 (65 cases) ............................ OK.
    code 1 (163456 cases) ........................ OK.
    code 2 (326528 cases) ........................ OK.
    code 3 m-truncation (64 cases) ............... OK.
    code 3 m=0,49-64 (4096 cases) ................ OK.
    code 3 m=1 CBR (81728 cases) ................. OK.
    code 3 m=1-48 CBR (103544448 cases) .......... OK.
    code 3 m=1-48 VBR (120832 cases) ............. OK.
    code 3 padding (1519448 cases) ............... OK.
    opus_packet_parse ............................ OK.
                      All packet parsing tests passed
                          (105760666 API invocations)

  Encoder basic API tests
  ---------------------------------------------------
    opus_encoder_get_size(0)=0 ................... OK.
    opus_encoder_get_size(1)=43572 ............... OK.
    opus_encoder_get_size(2)=48484 ............... OK.
    opus_encoder_get_size(3)=0 ................... OK.
    opus_encoder_create() ........................ OK.
    opus_encoder_init() .......................... OK.
    OPUS_GET_LOOKAHEAD ........................... OK.
    OPUS_GET_SAMPLE_RATE ......................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_SET_APPLICATION ......................... OK.
    OPUS_GET_APPLICATION ......................... OK.
    OPUS_SET_BITRATE ............................. OK.
    OPUS_GET_BITRATE ............................. OK.
    OPUS_SET_FORCE_CHANNELS ...................... OK.
    OPUS_GET_FORCE_CHANNELS ...................... OK.
    OPUS_SET_BANDWIDTH ........................... OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_SET_MAX_BANDWIDTH ....................... OK.
    OPUS_GET_MAX_BANDWIDTH ....................... OK.
    OPUS_SET_DTX ................................. OK.
    OPUS_GET_DTX ................................. OK.
    OPUS_SET_COMPLEXITY .......................... OK.
    OPUS_GET_COMPLEXITY .......................... OK.
    OPUS_SET_INBAND_FEC .......................... OK.
    OPUS_GET_INBAND_FEC .......................... OK.
    OPUS_SET_PACKET_LOSS_PERC .................... OK.
    OPUS_GET_PACKET_LOSS_PERC .................... OK.
    OPUS_SET_VBR ................................. OK.
    OPUS_GET_VBR ................................. OK.
    OPUS_SET_VBR_CONSTRAINT ...................... OK.
    OPUS_GET_VBR_CONSTRAINT ...................... OK.
    OPUS_SET_SIGNAL .............................. OK.
    OPUS_GET_SIGNAL .............................. OK.
    OPUS_SET_LSB_DEPTH ........................... OK.
    OPUS_GET_LSB_DEPTH ........................... OK.
    OPUS_SET_PREDICTION_DISABLED ................. OK.
    OPUS_GET_PREDICTION_DISABLED ................. OK.
    OPUS_SET_EXPERT_FRAME_DURATION ............... OK.
    OPUS_GET_EXPERT_FRAME_DURATION ............... OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_RESET_STATE ............................. OK.
    opus_encode() ................................ OK.
    opus_encode_float() .......................... OK.
                   All encoder interface tests passed
                             (1152209 API invocations)

  Repacketizer tests
  ---------------------------------------------------
    opus_repacketizer_get_size()=496 ............. OK.
    opus_repacketizer_init ....................... OK.
    opus_repacketizer_create ..................... OK.
    opus_repacketizer_get_nb_frames .............. OK.
    opus_repacketizer_cat ........................ OK.
    opus_repacketizer_out ........................ OK.
    opus_repacketizer_out_range .................. OK.
    opus_packet_pad .............................. OK.
    opus_packet_unpad ............................ OK.
    opus_multistream_packet_pad .................. OK.
    opus_multistream_packet_unpad ................ OK.
                        All repacketizer tests passed
                            (6713561 API invocations)

  malloc() failure tests
  ---------------------------------------------------
    opus_decoder_create() ................... SKIPPED.
    opus_encoder_create() ................... SKIPPED.
    opus_repacketizer_create() .............. SKIPPED.
    opus_multistream_decoder_create() ....... SKIPPED.
    opus_multistream_encoder_create() ....... SKIPPED.
(Test only supported with GLIBC and without valgrind)

All API tests passed.
The libopus API was invoked 115421979 times.
Enter fullscreen mode Exit fullscreen mode

Now, we know the SVE2 implementation is successfully added to the program. Let's double-check this by looking for the presence of SVE2 specific instruction within the binary files. Using the following command, we can see the list of files with whilelo instruction.

find . -type f -executable | while read F ; do echo ======= $F ; objdump -d $F 2> /dev/null | grep whilelo ; done

#...
======= ./.libs/libopus.so.0.8.0
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2f28:       25a40c60        whilelo p0.s, w3, w4
    2f7c:       25a40fe1        whilelo p1.s, wzr, w4
    2fa0:       25a40c60        whilelo p0.s, w3, w4
    3314:       25b80fe0        whilelo p0.s, wzr, w24
    3344:       25b80c20        whilelo p0.s, w1, w24
    3784:       25b80fe0        whilelo p0.s, wzr, w24
    37a8:       25b80c00        whilelo p0.s, w0, w24
    39cc:       25b80fe0        whilelo p0.s, wzr, w24
    39f0:       25b80c00        whilelo p0.s, w0, w24
    3a5c:       25b80fe0        whilelo p0.s, wzr, w24
    3a80:       25b80c00        whilelo p0.s, w0, w24
    4488:       25b50fe2        whilelo p2.s, wzr, w21
    44fc:       25b50c00        whilelo p0.s, w0, w21
    4590:       25b50fe2        whilelo p2.s, wzr, w21
    45f8:       25b50c00        whilelo p0.s, w0, w21
    4780:       25a40fe2        whilelo p2.s, wzr, w4
    47e4:       25a40c00        whilelo p0.s, w0, w4
    4940:       25b80fe0        whilelo p0.s, wzr, w24
    4954:       25b80c00        whilelo p0.s, w0, w24
    4a2c:       25a40fe1        whilelo p1.s, wzr, w4
    4a50:       25a40c00        whilelo p0.s, w0, w4
    4b34:       25a40fe1        whilelo p1.s, wzr, w4
    4b68:       25a40c60        whilelo p0.s, w3, w4
    4e34:       25b30fe0        whilelo p0.s, wzr, w19
    4e54:       25b30c60        whilelo p0.s, w3, w19
    4fe4:       25b30fe0        whilelo p0.s, wzr, w19
    5000:       25b30c20        whilelo p0.s, w1, w19
    51d8:       25b30fe0        whilelo p0.s, wzr, w19
    5208:       25b30c00        whilelo p0.s, w0, w19
    54a0:       25a10fe0        whilelo p0.s, wzr, w1
    54b8:       25a10c00        whilelo p0.s, w0, w1
    55c8:       25a11fe0        whilelo p0.s, xzr, x1
    55e8:       25a11c00        whilelo p0.s, x0, x1
    5724:       25a11fe0        whilelo p0.s, xzr, x1
    5738:       25a11c00        whilelo p0.s, x0, x1
    5c2c:       25a80fe1        whilelo p1.s, wzr, w8
    5c7c:       25a80d21        whilelo p1.s, w9, w8
    5ec4:       25a50fe2        whilelo p2.s, wzr, w5
    5f30:       25a50c00        whilelo p0.s, w0, w5
    64dc:       25a10fe0        whilelo p0.s, wzr, w1
    6500:       25a10c00        whilelo p0.s, w0, w1
    6818:       25a11fe0        whilelo p0.s, xzr, x1
    6830:       25a11c00        whilelo p0.s, x0, x1
#...
Enter fullscreen mode Exit fullscreen mode

And when we count the total number of lines that use whilelo, we get a total of 2903 lines.

$ find . -type f -executable | while read F ; do echo ======= $F ; objdump -d $F 2> /dev/null | grep whilelo ; done | wc -l
2903
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this post, we explored and implemented SVE2 by using auto-vectorization of the compiler. Using the existing unit tests, we were able to identify if the auto-vectorization was added successfully. We also found that it added a significant number of SVE2 specific instruction, whilelo (i.e. 2903 lines). This indicates that the Opus project may greatly benefit from SVE2 implementation.

Discussion (0)