Multiple levels of parallel hardware

#softwareengineering #computerscience #performance #concurrency

An excerpt from Grokking Concurrency by Kirill Bobrov

Take 25% off Grokking Concurrency by entering fccbobrov into the discount code box at checkout at manning.com.

After the multicore crisis shook the industry, instead of trying to make chips ever smarter and faster, computer architects began to increase the parallelism of hardware components. This represented a shift from relying on higher clock speeds, which required higher power consumption, to a more energy-efficient parallel approach which resulted in multiple levels of parallelism in the hardware.

CPU

A CPU is composed of a large number of circuits (ALUs) that can perform basic arithmetic operations (like addition or multiplication). Because the CPU has many arithmetic units, it can break up complex mathematical operations so that subparts of the operation run on separate arithmetic units at the same time, simultaneously. This is called instruction-level parallelism. Sometimes, this type of parallelism is taken to an even deeper level of paralleling – bit-level parallelism.

The developer rarely thinks about this level. And there is not much sense in doing so. The work of arranging instructions in the most convenient sequence for the processor is done by the compiler. Only a small group of engineers that tries to squeeze all possible power from processor or compiler developers can be interested in this level.

Multi-processor

Another simple idea for creating parallel hardware is that we can install more than one chip in a computer system – just completely replicate the processor. Just as if we hired a second joiner to work in the workshop. This is called multi-processor – that is what you can call any computer system with more than one processor which are linked together by an interconnection network.

Multi-core processor

A multi-core processor is a special kind of multi-processor – the only difference is all processors are on the same chip. Each core can work independently, and the OS perceives each core as a separate processor. There are slight differences between these two approaches, in terms of how quickly the processors can work together and how they access memory but for now, we’ll treat them as the same.

Also, among the parallel hardware a vector processor is often used. It is a specialized processor that uses a set of vector instructions, providing execution of operations with large one-dimensional data arrays in a one processor cycle. A typical representative of this type is the GPU.

Taxonomy of parallel computers

One of the most widely used systems for classifying multiprocessor architectures is Flynn’s Taxonomy. It distinguishes four classes of computer architectures based on two independent dimensions of instructions and data flow.

The first category is Single Instruction Single Data or SISD. SISD processes one instruction and works with one block of data. So, it first processes one instruction, then the second, then the third, and so on, serially. There is no parallelization here, of course, so we will skip this category.

The second category is Multiple Instruction Single Data or MISD. In this approach, we still work with one block of data but simultaneously perform several instructions. Like the previous category, this is also not relevant for concurrent systems, and it’s here just for reference.

The third category is Single Instruction Multiple Data or SIMD. Such processing resources have shared control units across multiple cores. This design defines its features. Such computing resources have common control units for several cores, which determine their characteristics. One of the main characteristics is the ability to execute one instruction simultaneously on all available processing resources. Hence, the same instruction can be performed on a massive set of data elements simultaneously using all available processing resources. At the same time, all processing resources are not universal in SIMD machines – the set of instructions of such machines is very limited so SIMD-systems are usually used to solve specific problems that usually require less flexibility and versatility than actual computing power.

The fourth category is Multiple Instruction Multiple Data or MIMD. Here, each processing resource has an independent control unit. So it does not have limitations on types of instructions and it executes different instructions independently on a separate block of data. Thus, it includes architectures with multiple cores, multiple CPUs, or even multiple machines, so different tasks can be literally executed on several different devices simultaneously.

MIMD has a wider set of instructions, and individual processing resources are more versatile than in SIMD. That’s why MIMD is the most commonly used architecture in Flynn’s Taxonomy, and you’ll find it in everything from multi-core PCs to distributed clusters.

CPU vs GPU

CPU and GPU (a.k.a. Graphics Processing Unit) are very similar, they both consist of a huge number of transistors and can process a vast number of instructions per second. But how exactly are these two important components of a computer system different, and when should you use one or the other?

Standard CPUs are built using the MIMD architecture. A modern CPU is powerful because engineers have implemented a wide variety of instructions in them. And a computer system is capable of completing a task because its CPU is capable of completing that task.

The GPU is a specialized type of processor based on similar to SIMD architecture, optimized for a very limited set of instructions. The GPU operates at a lower clock speed than CPU but has a huge number of cores, hundreds or even thousands that run simultaneously. That means it performs a huge number of simple instructions at incredible speed due to massive parallelism.

For example, Nvidia GTX 1080 graphics card has 2560 cores with 1607 Mhz clock speed. Thanks to these cores, Nvidia GTX 1080 can perform 2560 instructions per clock cycle. If you want to make the picture by 1% brighter, the GPU will cope with this without any difficulty. But a new Apple M1 CPU with 3.2GHz can only execute 8 instructions per clock cycle.

Although individual CPU cores are faster based on CPU clock speed and have extensive instruction sets, the sheer number of GPU cores and the massive parallelism they provide more than compensate for the difference in CPU core clock speed and limited instruction set.

CPUs are better suited for complex linear tasks.

GPUs are best suited for repetitive and highly parallel computational tasks such as video and image processing, machine learning, financial simulation, and many other types of scientific computing. You can imagine that operations such as matrix addition and multiplication are easily performed using the GPU because most of these operations in matrix cells are independent of each other, are similar in nature, and therefore can be parallelized.

Hardware architectures are highly variable and can affect the portability of programs between different systems as well as programs themselves can sometimes inherently accelerate differently depending on where they run. For example, many graphics programs run much better and faster on GPU resources, while ordinary programs with mixed logic make sense to run on the CPU.

That’s all for this article. Thanks for reading. Check out Grokking Concurrency for more!

Originally published at https://freecontent.manning.com on June 3, 2022.