## **19:10** Vector Multiplication as an IPC Primitive

Since time immemorial computer scientists have pondered what could be the best way for two processes to interact with each other. Is it shared memory? Is it message queues? Is it sockets? Wait no more, dear neighbor, because in this modest article I'm going to present a novel and more promising way. We will see that processes can communicate with one another by using little more than vector instructions!

## Overview of power management

Starting with the Sandy Bridge architecture, Intel's ISA included a new set of instructions called AVX, to operate on larger, 256-bit sized, registers. More recent architectures further extended this functionality with another set, AVX2.

As keeping these wide registers turned on all the time wasn't power-efficient, Skylake and later architectures kept them inactive during the normal scalar code execution. The CPU would start powering on these wider, vector data paths only when the first SIMD instruction got executed.

This process takes time, and while the vector execution units are being turned on, the vector code gets dispatched to  $\mu$ ops that make use of narrower registers and, consequently, execute at roughly half the speed. Also, after the core encounters a vector instruction, the processor will keep the registers active for a while (on the order of milliseconds) after the last SIMD instruction is scheduled to run.

As the core that runs this sort of vector code will require more power to keep the registers active, the Package Control Unit (PCU)—an on-chip microcontroller that manages frequencies and voltages of the processor—will increase that core's voltage with a mechanism that Intel calls "granting a power license."

Within the bureaucratic apparatus that is the processor, a core is granted a different power license depending on the kind of instructions it is executing. For all AVX instructions, and for some simple AVX2 instructions like loads and adds, the core gets to run on the modest LVLO\_TURBO\_LICENSE. For complex AVX2 instructions it gets the regular LVL1\_TURBO\_LICENSE, while the cores lucky enough to run AVX-512 win a premium LVL2\_TURBO\_LICENSE.

RAN 112385 100 **STEPS 100 STEP LEARN** MODE, KEYBOARD PROGRAMMING CAPABILITY. RPN logic • Rollable 4-level stack • 8digit plus 2-digit exponent LED display Scientific notation . Sine, cosine, tangent & inverse trigonometric functions Common & natural logarithms & antilogarithms • Instant automatic calculation of powers and roots . Single-key square root calculations • Single-key Pi entry • Separate storage memory • Square, square root and reciprocal calculations . Change sign & register exchange keys • Includes NiCad batteries. Mfg. by National Semiconductor 1 year warranty + 10 day money back guarantee 100 pg. Application Handbook - \$5.00; AC charger - \$4.95; Carrying case - \$2.95; Stand - \$2.00; Ship & hndl. \$3.75. TO ORDER CALL 213) 559-1044 OR SEND CHECK TO ILDAN Inc. 6020 Washington Blvd 12 Culver City, CA 90230

Also, the core's frequency gets capped by the PCU to a lower value, which is referred as the AVX2 Turbo frequency. For commercial desktop and laptops CPUs, this applies to not just the core running vector code but to all cores in the same processor.

This led me to wonder: what is happening to the wide SIMD units of the other cores during that time? Are they all powered-on all together? If so, could this be used to make our processes have a little chat without bothering the OS with expensive syscalls?

by Lorenzo Benelli

## Latency is key

With this rough idea of the inner workings of the Intel's CPU power management, I wrote a tiny snippet of code that launches two processes with the ability to communicate without any nasty interaction with the OS.

```
#include <immintrin.h>
   #include <stdio.h>
3
   #define TIME SCALE 1.0
  #define BUFS\overline{Z} 0x400
5
7
   void bsleep(uint64_t);
   void send(uint8_t);
9
  void recv(void);
11
  int main() {
     pid t pid;
13
     if ((pid = fork()) == 0) {
15
       recv():
       else if (pid != -1) {
       send('P');
17
       send('o');
19
       send('C');
       bsleep(0x40000000);
21
        kill(pid, 9);
23
     return 0;
   }
25
   void bsleep(uint64_t clk) {
27
     uint64\_t \ beg \ , \ end \ ;
     uint32_t hi0, lo0, hi1, lo1;
29
     asm volatile (
       "cpuid\n\t'
       "rdtscnt"
31
       "mov %% edx , %0\n\t"
33
       "mov %%eax , %1h t"
       : "=r" (hi0), "=r" (lo0)::
"%rax", "%rbx", "%rcx", "%rdx"
35
     );
37
     end = beg = (((uint64_t)hi0 \ll 32) | lo0);
     while (end - beg < clk) {
39
       asm volatile (
          "cpuid\n\t'
          "rdtscnt"
41
          "mov %%edx , %0n t"
          "mov %%eax,
43
                       \%1\n\t "
          "pause\n\t'
          : = r' (hi1), = r' (lo1)::
45
          "%rax", "%rbx", "%rcx", "%rdx"
47
       );
       end = (((uint64 t)hi1 \ll 32) | lo1);
49
     ł
   }
```

| SAM COUPE AND SPECTRUM                                                                                                   | MAGAZINE!         |
|--------------------------------------------------------------------------------------------------------------------------|-------------------|
|                                                                                                                          |                   |
| PROGRAMS,<br>UTILITIES, INFO<br><b>"OUTLE</b> ]                                                                          | AND HELP PAGES,   |
| IDEAS! NEWS,                                                                                                             | SERIOUS SOFTWARE  |
| REVIEWS AND HOMEGROWN SOFTWARE MONTHLY SINCE 1987!                                                                       |                   |
| SPECIAL OFFER! Latest issue £2.50 to newcomers on:-                                                                      |                   |
| +3, DISCIPLE/+D, MICRODRIVE, OPUS                                                                                        | 5, TAPE, SAM DISC |
| +3, DISCIPLE/+D, MICRODRIVE, OPUS, TAPE, SAM DISC<br>CHEZRON SOFTWARE, 605 LOUGHBOROUGH RD., BIRSTALL, LEICESTER LE4 4NJ |                   |

One parameter offered by the code is TIME\_-SCALE, which you can set at your convenience in case your plotting utility doesn't implement horizontal zooming, or if you wish to pin the processes to far away cores.

As we'd like to eventually store some measurements, BUFSZ provides a way to delay the unavoidable write() call, because the longer we can prolong our abstinence from kernel communication, the better.

For each bit to be transmitted, the sender process either executes a *very long* succession of AVX2 multiplications, or enters a busy loop, doing nothing for long enough that the PCU decides to revoke its power license, powering off the vector execution units.

Another process, the receiver, runs a *short* burst of vector instructions, then also sleeps for enough time that the PCU decides to revoke its power license. The receiver process is also keeping track of its execution speed via the rdtsc instruction, periodically dumping it to stdout.

```
void send(uint8 t c) {
  for (int i=0; i<8; i++) {
    uint8_t bit = (c >> i \& 1);
    if (bit) {
     for (uint64 t i=0; i<0x4000*SCALE; i++){
         asm volatile(
          "pushq 0x4000000 \ r \ n"
          "vbroadcastss 0(\% rsp), \% mm \sqrt{r n}"
          "vbroadcastss 0(\%\% rsp), \%\% mm1\r\n"
          "mov 10000, %%ecx\r\n"
          "loop1:\r \n"
          "vmulps \%ymm0, \%ymm1, \%ymm1\r\n"
          "dec \%ecx \setminus r \setminus n"
          "jnz loop1r\n"
          "popq \%rcx \setminus r \setminus n"
          :::
         ):
         bsleep(0x20000);
     }
      else {
    }
     bsleep(0x8db6db6d * SCALE);
     fprintf(stderr, "tick %d\n", bit);
  }
}
```

 $\mathbf{2}$ 

4

6

8

10

12

14

16

18

20

22

24

```
1
   void recv(void) {
     uint64_t beg, end, i = 0;
uint32_t hi0, lo0, hi1, lo1;
3
     static uint64 t time[BUFSZ];
 \mathbf{5}
     static char buf[0x10000], *it = buf;
 7
     while (1) {
       asm volatile (
9
          "cpuid\n\t"
          "rdtscnt"
          "mov %%edx , %0n t"
11
          "mov %%eax, %1\n\t"
          : "=r" (hi0), "=r" (lo0)::
13
            "%rax", "%rbx", "%rcx", "%rdx"
15
       );
       asm volatile(
          "pushq 0x4000000\r\n"
17
          'vbroadcastss 0(\% rsp), \% mm 0 r n"
          "vbroadcastss 0(\% rsp), \% mn1 r n"
19
          "mov 10000, \frac{10000}{r n}"
          "loop: \ r \ n"
21
          "vmulps %%ymm0, %%ymm1, %%ymm1r n"
23
          "dec \%ecx \ r \ n"
          "jnz loop\r\n"
25
          "popq %%rcx\r\n"
27
       );
       asm volatile (
          "cpuid\n\t"
29
           "rdtsc\n\t"
          31
33
            "=r" (hi1), "=r" (lo1)::
            "%rax", "%rbx", "%rcx", "%rdx"
35
       );
       37
       \operatorname{time}[i++] = \operatorname{end} - \operatorname{beg};
39
       bsleep(0x1000000);
41
       if (i == BUFSZ) {
43
          i = 0;
          for (uint64_t i = 0; i < 1024; i++) {
            it += sprintf(it, "\%lu \n", time[i]);
45
47
          printf("%s", buf);
          it = buf;
49
     }
51
   }
```

Employees must wash hands before returning to libc



If the receiver process is running during a quiescent period of the sender process, meaning that the vector registers are powered down, it will run at about half the speed for at least 150K clock cycles, which is roughly the warm-up period on Coffee Lake. Otherwise, it will dash forth at full speed. Repeating this enough times, the receiver can gather sufficient evidence to know what bit was being sent to him by his neighboring process.

On page 58 you can see the data plots taken from some Kaby, Coffee Lake, and Sky Lake systems, and a reference of the inverted ASCII signal, where the most significant bits are sent last.

## The End

What is actually happening inside the processor is not completely clear to me. Perhaps the vector units are not kept active *all* the time while executing AVX code. Since the PCU on mixed scalar/vector workloads has already lowered the frequency of all the cores, it has more room to adjust their voltages quickly, and it is consequently able to power the wide paths faster, ultimately with similar effects. Let me know if you manage to figure this out, neighbors!

Finally, a few words about why I think this is a better way for processes to communicate.

First, the processes get to avoid those pesky **syscall** instructions which make the software we write daily completely non-portable.

Second, although not as fast as other IPC implementations, this one makes communication a CPUbound problem instead of an I/O-bound one, which, as everybody knows, is a much nicer problem to have.

Third, two processes in completely separate VMs can now communicate, without the extra long and boring configuration jobs that sysadmins have to do in order to get the infrastructure to work.

This is why, neighbors, you should promptly experiment with this method, as well as try to find further novel and nifty ways to use our processors. Maybe we will one day be able to multiply two vectors with only syscall instructions!



Kaby Lake Warmup Time

Reference Message (POC)