Implementing SVE2 for Opus Codec Library Part 2: Compiler Intrinsics

Introduction

In the last post, we explored how we can compile and test the package. From now on, we will explore how we can add SVE2 implementation to it.

Finding Candidates

As we explored before, Opus contains a number of files that utilizes compiler intrinsics for SIMD implementation.

$ find | grep -i neon
./celt/arm/celt_neon_intr.c
./celt/arm/pitch_neon_intr.c
./silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c
./silk/arm/NSQ_neon.c
./silk/arm/LPC_inv_pred_gain_neon_intr.c
./silk/arm/NSQ_neon.h
./silk/arm/biquad_alt_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.c

Among these, we need to find a file with loops. Let's take a look at celt_neon_intr.c file.

void xcorr_kernel_neon_fixed(const opus_val16 * x, const opus_val16 * y, opus_val32 sum[4], int len)
{
   int j;
   int32x4_t a = vld1q_s32(sum);
   /* Load y[0...3] */
   /* This requires len>0 to always be valid (which we assert in the C code). */
   int16x4_t y0 = vld1_s16(y);

   for (j = 0; j + 8 <= len; j += 8)
   {
      /* Load x[0...7] */
      int16x8_t xx = vld1q_s16(x);
      int16x4_t x0 = vget_low_s16(xx);
      int16x4_t x4 = vget_high_s16(xx);
      /* Load y[4...11] */
      int16x8_t yy = vld1q_s16(y);
      int16x4_t y4 = vget_low_s16(yy);
      int16x4_t y8 = vget_high_s16(yy);
      int32x4_t a0 = vmlal_lane_s16(a, y0, x0, 0);
      int32x4_t a1 = vmlal_lane_s16(a0, y4, x4, 0);

      int16x4_t y1 = vext_s16(y0, y4, 1);
      int16x4_t y5 = vext_s16(y4, y8, 1);
      int32x4_t a2 = vmlal_lane_s16(a1, y1, x0, 1);
      int32x4_t a3 = vmlal_lane_s16(a2, y5, x4, 1);

      int16x4_t y2 = vext_s16(y0, y4, 2);
      int16x4_t y6 = vext_s16(y4, y8, 2);
      int32x4_t a4 = vmlal_lane_s16(a3, y2, x0, 2);
      int32x4_t a5 = vmlal_lane_s16(a4, y6, x4, 2);

      int16x4_t y3 = vext_s16(y0, y4, 3);
      int16x4_t y7 = vext_s16(y4, y8, 3);
      int32x4_t a6 = vmlal_lane_s16(a5, y3, x0, 3);
      int32x4_t a7 = vmlal_lane_s16(a6, y7, x4, 3);

      y0 = y8;
      a = a7;
      x += 8;
      y += 8;
   }

 for (; j < len; j++)
   {
      int16x4_t x0 = vld1_dup_s16(x);  /* load next x */
      int32x4_t a0 = vmlal_s16(a, y0, x0);

      int16x4_t y4 = vld1_dup_s16(y);  /* load next y */
      y0 = vext_s16(y0, y4, 1);
      a = a0;
      x++;
      y++;
   }

   vst1q_s32(sum, a);
}

This function uses multiple intrinsic extensions inside of the for loops which meet our expectation. Before we start implementing SVE2, we need to understand the code thoroughly. Let's walk through the code one by one.

We can see the function takes in three arrays, x, y, and sum. The sum array is first loaded to the vector register with a tuple of 4 lanes that each has 32 bits in length. Since this code uses NEON to implement SIMD, it makes sense the total length of the vector register is limited to 128 bits in total.

Then, the y array is loaded to the vector with a tuple of 4 lanes in which each has 16 bits in length. These correspond to the first four elements in the y array (i.e. y[0...3]).

In the for loop, the x array is loaded into the register. The vector first contains 8 lanes of 16 bits. These, in turn, are divided into two groups, x0, and x4. Ultimately these correspond to the first eight elements in the x array (i.e. x[0...3]).

The code repeats the previous steps for y array. Since, we already assign a vector for the first four elements from the array, we start from the fifth element in the array. At the end, these correspond to the elements ranged from the eighth to the eleventh element (i.e. y[4...11]).

To better understand what we have learned, we can make the following diagram:

x*(val16)   0    1    2    3    4    5    6    7
            |      x0      |    |      x4      |   

y*(val16)   0    1    2    3    4    5    6    7    8    9    10    11
            |      y0      |    |      y4      |    |      y8       | 

sum(val32)  0            1            2            3

In the first vmlal_lane_s16, the intrinsic multiplies the first lane (0) of the x0 to each lane of y0. The result is then accumulated to the destination vector where each element is twice as long as the elements that are multiplied (i.e. 16 bit -> 32 bit). This means we do the following operations between two arrays:

x[0] * (y[0], y[1], y[2], y[3]) = (sum[0], sum[1], sum[2], sum[3])

We repeat the same operation as above but with y4 and x4.

Next, vext_s16 extract a vector from the y0 and y4 pairs. This is done by extracting the lowest vector elements from y4 and the highest vector elements from y0 starting from the element of desired index (i.e. 1). This means we get the following vector as a result:

y0    : y[0], y[1], y[2], y[3] // taking the highest vector starting from the index 1.
y4    : y[4], y[5], y[6], y[7] // filling up the result vector by taking the lowest vector
Result: y[1], y[2], y[3], y[4]

Afterwards, we do the same steps to keep multiplying and adding the rest of x and y elements.

Problem

Unfortunately, the codes that we walked though together are not easy to translate into ones with SVE2 instructions. One of the reasons is because of the lack of SVE2 counterparts of the NEON instruction that are used. This makes sense considering that the SVE2 does not restrict the length of vector registers. In order to solve this, we have to rewrite the codes in such a way that no
tuple of vector lanes are used.

Conclusion

In this post, we explored and analyzed whether the intrinsic codes the package uses are good for implementing SVE2. Unfortunately, the codes are fairly complex and requires more NEON and SVE2 knowledges that are beyond the scope that we have covered in the previous posts. In the following post, we will look for an alternative method to implement SVE2 - that is, by using auto-vectorization.