SPO600-Lab6, Project Stage 2: Analysis of an AArch64 SIMD Optimization

#spo600 #simd #numpy #aarch64

Introduction

At this stage of project, I am going to focus on one of the open source projects which has applied SIMD (See Definition Below) Optimization, locate and analysis the SIMD code for AArch64 and other architectures including x86_64, SSE/SSE2, and AVX512.

Open Source Projects

It is worth mentioning that all of the following projects have SIMD optimization implemented:

ardour
eigen
ffmpeg
gnuradio
gromacs
lame
mjpegtools
nettle
numpy
pipewire
pulseaudio
vlc
xz

SIMD

SIMD optimization or SIMD operations both mean the same thing. Single instruction, multiple data (SIMD) is a type of parallel processing. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

numpy

The source code of NumPy can be found here. This project, NumPy is the fundamental package for scientific computing with Python. Originally Python was not designed for numeric computation. And the Numpy was created in 2005 to address this challenge. As the array size increase, Numpy gets around 30 times faster than Python List.

Several years ago, a StackOverflow answer clearly pointed out, there are SIMD vectorized operations in NumPy project. In this program, it was wrote as

"""
/* Currently contains sse2 functions that are built on amd64, x32 or non-generic builds (CFLAGS=-march=...) */
"""

However, this is only a simple example of SIMD optimization applied in this project.

Details of SIMD in NumPy

A detailed paper published in NumPy support documentation, CPU/SIMD Optimizations, clearly states that:

"""
NumPy comes with a flexible working mechanism that allows it to harness the SIMD features that CPUs own, in order to provide faster and more stable performance on all popular platforms. Currently, NumPy supports the X86, IBM/Power, ARM7 and ARM8 architectures.
"""

Where the AArch64, aka ARM64 was first introduced with the ARMv8-A architecture. Another review paper was published here.

The SIMD optimization support for AArch64 was summarized here:

The SIMD optimization support for X86 was summarized here:

However, the above StackOverflow question is outdated now, a very important Pull Request was merged in the last year, ENH: enable multi-platform SIMD compiler optimizations. This PR is very important to trace back all of the SIMD optimization applied in NumPy, it contains both a lot of discussions on how to implement SIMD in NumPy and corresponding changed programs.

Code examples, the code shown below is about the selection mechanisms for the SIMD code, including the compile-time mechanisms:

#ifndef NPY_DISABLE_OPTIMIZATION
     #if defined(__powerpc64__) && !defined(__cplusplus) && defined(bool)
         /**
          * "altivec.h" header contains the definitions(bool, vector, pixel),
          * usually in c++ we undefine them after including the header.
          * It's better anyway to take them off and use built-in types(__vector, __pixel, __bool) instead,
          * since c99 supports bool variables which may lead to ambiguous errors.
         */
         // backup 'bool' before including '_cpu_dispatch.h', since it may not defiend as a compiler token.
         #define NPY__DISPATCH_DEFBOOL
         typedef bool npy__dispatch_bkbool;
     #endif
     #include "_cpu_dispatch.h"
     #ifdef NPY_HAVE_VSX
         #undef bool
         #undef vector
         #undef pixel
         #ifdef NPY__DISPATCH_DEFBOOL
             #define bool npy__dispatch_bkbool
         #endif
     #endif
 #endif

Summary

In general, the SIMD code in NumPy is used for linear algebra. ARMv9 is on the horizon, and it includes an improved SIMD implementation called Scalable Vector Extensions v2 (SVE2). Now, NumPy does not support of ARMv9.

The following code:

#ifdef _MSC_VER
     #include <Intrin.h>
 #endif
 #include <arm_neon.h>

 int main(void)
 {
     float32x4_t v1 = vdupq_n_f32(1.0f), v2 = vdupq_n_f32(2.0f);
     /* MAXMIN */
     int ret  = (int)vgetq_lane_f32(vmaxnmq_f32(v1, v2), 0);
         ret += (int)vgetq_lane_f32(vminnmq_f32(v1, v2), 0);
     /* ROUNDING */
     ret += (int)vgetq_lane_f32(vrndq_f32(v1), 0);
 #ifdef __aarch64__
     {
         float64x2_t vd1 = vdupq_n_f64(1.0), vd2 = vdupq_n_f64(2.0);
         /* MAXMIN */
         ret += (int)vgetq_lane_f64(vmaxnmq_f64(vd1, vd2), 0);
         ret += (int)vgetq_lane_f64(vminnmq_f64(vd1, vd2), 0);
         /* ROUNDING */
         ret += (int)vgetq_lane_f64(vrndq_f64(vd1), 0);
     }
 #endif
     return ret;
 }

The code is for AArch64, might be updated for ARMv9.