Diego Bravo Estrada

This is a very-very-very basic introduction to programming in assembly for 64-bit Linux based in AMD/Intel cpus. We are using Debian 12, but should be usable with any modern Linux system.

We follow several sources, like:

  1. https://en.wikipedia.org/wiki/X86_instruction_listings

  2. Seyfarth, R. (2011). Introduction to 64 Bit Intel Assembly Language Programming: Getting the most out of your computer. CreateSpace.

  3. Kusswurm, D. (2014). Modern X86 Assembly Language Programming 32-bit, 64-bit. SSE, and AVX, Apress.

  4. Intel® 64 and IA-32 Architectures Software Developer’s Manual (edition of December 2023)

  5. Stackoverflow

Intro

We’ll use the popular nasm and (for further testing) yasm, which are recognized assemblers for x86 processors:

sudo apt install nasm
sudo apt install yasm

We recommend checking the installation of the binutils, build-essential and make packages.

For more information, see https://www.nasm.us/ and https://yasm.tortall.net/

All the examples should be workable with both assemblers.

On 64-bit architecture:

"Intel found itself in the position of adopting the ISA which AMD had created as an extension to Intel’s own x86 processor line. Intel’s name for this instruction set has changed several times. […​] within weeks they began referring to it as IA-32e (for IA-32 extensions) and in March 2004 unveiled the "official" name EM64T (Extended Memory 64 Technology). In late 2006 Intel began instead using the name Intel 64 for its implementation, paralleling AMD’s use of the name AMD64."

— https://en.wikipedia.org/wiki/X86-64

The Tutorial

From this point, we’ll implement several small programs in order to explain the relevant concepts.

First program

We’ll replicate the following C program:

#include <stdio.h>
#include <stdlib.h>

int main() {
	printf("%#x\n", 0x12345678);
	exit(0);
}

Which may be compile with:

diego@dataone:~/devel/ASM$ gcc -Wall -o model model.c
diego@dataone:~/devel/ASM$ ./model
12345678

We assume the reader understands this program. Now the assembly version:

; test1.asm
          default rel
          extern printf            ; extern symbols

          section .rodata
format:   db "%#x", 10, 0          ; C 0-terminated string: "%#x\n" 

          section .text
          global _start
_start:
          sub   rsp, 8             ; re-align the stack to 16 before calling another function

; call printf
          mov   esi, 0x12345678    ; "%x" takes a 32-bit unsigned int
          lea   rdi, [format]
          xor   eax, eax           ; AL=0  no FP args in XMM regs
          call  printf wrt ..plt

; call exit
          mov   rax, 60
          xor   rdi, rdi
          syscall

Build with:

nasm -f elf64 -g -l test1.lst test1.asm
ld --dynamic-linker=/lib64/ld-linux-x86-64.so.2  -o test1 test1.o -lc

or

yasm -f elf64 -g dwarf2 -l test1.lst test1.asm
ld --dynamic-linker=/lib64/ld-linux-x86-64.so.2  -o test1 test1.o -lc

Execute with:

./test1
0x12345678

Before explaining the code, let’s understand the build steps.

Assembling

nasm or yasm got our assembly listing test1.asm which in turn will generate an "object file" named test1.o. This is the assembler’s main work.

We also ask for an "hexadecimal listing" of the object (with -l test1.lst), which provides a detailed view of the corresponding machine code opcodes.

The -f elf64 option instructs the assembler to generate an object file following the ELF standard (used in Linux and some other unixes) using the "64-bit model", and -g asks for the inclusion of debugging information (for example, to be used by gdb.)

Note
Dwarf stands for "Debug With Arbitrary Record Format".

Linking

The "link editor" ld is used to create the executable. As we’ll create an executable "dynamically linked" (i.e. will load required libraries at runtime) then we need to provide the path for the "dynamic linker/loader" which (sadly) is not provided by ld so we add the --dynamic-linker option. The file /lib64/ld-linux-x86-64.so.2 ls really a symbolic link to the real library. For example, in out test system it is:

diego@dataone:~/devel/ASM$ ls -l /lib64/ld-linux-x86-64.so.2
lrwxrwxrwx 1 root root 32 Dec  7 11:38 /lib64/ld-linux-x86-64.so.2 ->
	/lib/x86_64-linux-gnu/ld-2.27.so
Note
At least in the binutils world (where Linux ld lives), it is not expected the direct usage of ld by the user; instead, the C compiler gcc should be employed, even if the source is not in the C language. See more information in https://bugzilla.redhat.com/show_bug.cgi?id=868662 .

Finally, we instruct ld to set a dynamic link for the executable with the C standard library (libc.so.nnn) with the option -lc. More libraries may be added with the syntax -lXXX corresponding to libraries named libXXX.so.nnn.

The linked libraries may be inspected with ldd:

diego@dataone:~/devel/ASM$ ldd test1
	linux-vdso.so.1 (0x00007ffc915f7000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f720d284000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f720d67c000)
Note
The linux-vdso is a shared library provided by the kernel allowing the access to some services without the rather slow "jump" to kernel mode of a system call.

Entry point

When a process is started, the "entry point" is specified by a symbol named _start. That is the reason our program has such symbol in the place where we want our code to start its execution:

_start:
    sub   rsp, 8

    ; Call printf.

The _start symbol must be marked as global, so the operating system may find it when the process is executed:

global _start

The instructions are stored in a memory "section" (a region in RAM dynamically assigned at runtime) named .text. There is another section for read-only data suitable named .rodata, where we put the arguments to the printf() call:

section .rodata
format:   db "%#x", 10, 0   ; C 0-terminated string: "%#x\n"

The db assembly directive instructs the assembler to consider the arguments as plain data which must be stored in the object file (unlike the assembly cpu instructions below.) The db (roughly "define bytes") directive accepts several comma-separated arguments including strings and numbers as in the example. Note that the 10 corresponds to an ASCII new-line (or \n), and the last 0 is the mandatory end mark required by C language strings. The format is simply a label to reference this data from the code below.

Registers

As the reader probably knows, the cpu contains a number of "registers" which may be considered some kind of very fast integer predefined "variable" in the assembly language. The mov family of instructions are used to put some value into the registers, and to extract the value contained in the registers. A syntax like mov a, b means that the value represented by b is assigned into a (we’ll see some examples below.) This order for the assignations is known as the "Intel syntax". For more information, see: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html#s3

Process termination: System Calls

The first thing we’ll review is how to terminate a running process. To that effect we must use the exit() system call, which is provided by the Linux kernel (and every POSIX compliant operating system.)

The system calls are identified by a numeric code which may be located in a Linux header file (albeit its exact location may vary); for example, our test system has the following contents inside /usr/include/x86_64-linux-gnu/asm/unistd_64.h:

...
#define __NR_execve 59
#define __NR_exit 60
#define __NR_wait4 61
...

Following https://blog.rchapman.org/posts/Linux_System_Call_Table_for_x86_64/ we’ll discover that the desired system call must be identified by such numeric code put into the rax cpu register.

So we conclude that the code for the exit() system call is 60, and we must put this value in rax. Also, this system call does require a single numeric parameter (which will be the "process return value" for the operating system); such value must be sent in the rdi register (xor rdi, rdi is an usual abbreviated way to call mov rdi, 0.) So in order to call exit(0);, we issue the assembly instructions:

    mov   rax, 60
    xor   rdi, rdi
    syscall

Calling library functions

A couple of citations:

An application binary interface (ABI) is an interface between two binary program modules. Often, one of these modules is a library or operating system facility, and the other is a program that is being run by a user.

An ABI defines how data structures or computational routines are accessed in machine code, which is a low-level, hardware-dependent format.

— https://en.wikipedia.org/wiki/Application_binary_interface

Also:

Calling convention is an implementation-level (low-level) scheme for how […​] functions receive parameters from their caller and how they return a result. […​] These transfers typically done via certain registers or within a stack frame on the call stack.

Calling conventions are usually considered part of the application binary interface (ABI).

— https://en.wikipedia.org/wiki/Calling_convention

We call printf which at runtime is a global symbol external to our code (is located in the C standard library.) Such external symbols must be signaled with:

extern printf

Now, for the call to printf we prepare the arguments: a formatting argument and a numeric one with the registers rdi and rsi. Also, al (via eax) is zeroed, since it is used to indicate the number of floating point arguments present in the call (here, zero.)

In general, the AMD64 ABI uses the sequence of registers rdi, rsi, rdx, rcx, r8 and r9 for integer or pointer-type parameters: https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf for a friendlier set of related slides: https://bart.disi.unige.it/zxgio/phd-course-2017/x86intro_slides.pdf

Note
The floating point arguments are expected to be stored in (rather heavy) "vector registers". The zeroed eax is an optimization advice for integer-only variadic calls.

Note that the first argument to printf is a "pointer" to the memory containing the formatting string. We could try something like

mov rdi, format

but that will introduce some problems with PIC (see below.) So we prefer the "load effective address" instruction which does (almost) the same:

lea rdi, [format]

Now, the actual call:

call  printf wrt ..plt
Note
In other assemblers this is usually written as call printf@PLT.

PLT means "Procedure Linkage Table"; from https://eli.thegreenplace.net/2011/11/03/position-independent-code-pic-in-shared-libraries "The PLT is part of the executable text section, consisting of a set of entries (one for each external function the shared library calls). Each PLT entry is a short chunk of executable code. Instead of calling the function directly, the code calls an entry in the PLT, which then takes care to call the actual function."

So this is used by "position independent code" to make calls to shared library’s functions whose real address is not known at process startup. This resolution is done in a lazy way: only if the function is actually called, then its address is resolved.

Note
The System V ABI requires the functions to preserve the registers rbx, rbp, and r12-r15.

Position Independent Code (PIC)

Refers to compiled code whose internal addresses (for example, for jumps or for loading some data) is only known at runtime. The usual case are shared libraries, which are loaded in demand, and stored in memory at effectively random addresses.

As of 2020, recent versions of Linux also provide executable programs composed by PIC; this is done for security, with the intention of making difficult some attacks which depend on previous knowledge of (non-random) memory addresses.

So we call with the help of a PLT.

Also, references to memory must be relative. So instead of

lea rdi, [format]

we should use something like

lea rdi, [rel format]

In yasm another available syntax is:

lea rdi, [format wrt rip]

We avoid this with the assembler directive:

default rel

then the rip index is added automatically to the memory references.

Note
In the following examples we’ll always use default rel.

The stack pointer

As an ABI requirement, the stack pointer address must be multiple of 16 at the start of the execution of the called function. Since the call instruction consumes 8 bytes of the stack, we start our function adding another 8 bytes:

    sub   rsp, 8

Note that the stack is filled in descending order, so when subtracting the address we are effectively filling it with the specified bytes.

Before returning from a function, the stack should be "corrected" to its original value. Here we don’t do that since the exit call will stop the process execution and we no longer care about additional (caller) code.

Note
the stack alignment of 16 bytes is not required at all for code which does not call external functions.
Note
even if no function calls are to be issued, the stack contents should be aligned to 8 bytes to avoid further unexpected failures: code which tries to push quad-words will fail if the stack is not 8-byte aligned.

Using Gcc

This is the preferred method for building the executable. Also, it automatically links with the C library.

; test2.asm
          default rel
          extern printf

          section .rodata
format:   db "%#x", 10, 0          ; C 0-terminated string: "%#x\n" 

          section .text
          global main
main:
          sub   rsp, 8             ; re-align the stack to 16 before calling another function

; Call printf.
          mov   esi, 0x12345678    ; "%x" takes a 32-bit unsigned int
          lea   rdi, [format]
          xor   eax, eax           ; AL=0  no FP args in XMM regs
          call  printf wrt ..plt

; Return from main.
          xor   eax, eax
          add   rsp, 8
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

The C compiler adds the _start initialization code and looks for the popular main global symbol used by every C program. Here we implement a main function just providing such symbol.

Assembly from Gcc

The -S option generates assembly code for a C program. For example:

gcc -S model.c

generates the following listing in my system:

	.file	"model.c"
	.text
	.section	.rodata
.LC0:
	.string	"%#x\n"
	.text
	.globl	main
	.type	main, @function
main:
.LFB6:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	movl	$305419896, %esi
	leaq	.LC0(%rip), %rax
	movq	%rax, %rdi
	movl	$0, %eax
	call	printf@PLT
	movl	$0, %edi
	call	exit@PLT
	.cfi_endproc
.LFE6:
	.size	main, .-main
	.ident	"GCC: (Debian 12.2.0-14) 12.2.0"
	.section	.note.GNU-stack,"",@progbits

It is very similar to the previous test*.asm listings, but:

  1. The syntax is for the Gnu assembler gas, which uses the AT&T instruction syntax, where the assignments are in reversed order: mov a, b moves a into b.

  2. The register names are prefixed by a percentage sign

This is evident in the instruction

leaq	.LC0(%rip), %rdi

which is analog to the test3.asm (the .LC0 label corresponds to our format label):

lea   rdi, [rel format]

Another useful tool for analyzing already compiled code is objdump. For example, to disassemble test1.o simply issue the command objdump -d -M intel test1.o. Also works in the final executable test1.

Since we are generating ELF-format object and executable files, the readelf command may be helpful too. For example. readelf -a test1.o provides the file’s ELF-related information.

More about the PLT

To further illustrate the point, we suggest to the reader to create a small modification from our test2 into an equivalent version with "absolute addressing" (without default rel):

diego@dataone:~/devel/ASM$ diff test2.asm test3.asm
1d0
<           default rel
14c13
<           lea   rdi, [format]
---
>           lea   rdi, [rel format]

Note that we may force the generation of a (less secure) non PIC executable, so we can use absolute addressing:

diego@dataone:~/devel/ASM$ diff test2.asm test4.asm
1d0
< default rel
14c13
<     lea   rdi, [format]
---
>     mov   rdi, format

We must build with -no-pie (no position independent executable):

nasm -f elf64 -g -l test4.lst test4.asm
# also: yasm -f elf64 -g dwarf2 -l test4.lst test4.asm
gcc -no-pie -o test4 test4.o
diego@dataone:~/devel/ASM$ ./test4
0x12345678

Note that the executables with PIC and without PIC may be queried with file:

diego@dataone:~/devel/ASM$ file test3 test4
test3: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, [...] with debug_info, not stripped
test4: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, [...] with debug_info, not stripped

The addressing (RIP/PIC) here presented isusing the default "small" code model of the x64 architecture. For more information check this article: https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models

Returning from a function

Since we are running the main function, to terminate the process it is enough with returning from it (of course, calling exit is allowed as before.)

Following the ABI specification, the (integer) return value goes in the eax register, here zeroed with xor:

    xor   eax, eax
    add   rsp, 8
    ret

Note that the stack pointer is returned to its starting value by adding 8 bytes before returning to the caller.

Note
The section .note.GNU-stack added to the listings avoids some warnings which are not interesting at this time. For more information see for example: https://www.redhat.com/en/blog/linkers-warnings-about-executable-stacks-and-segments

Digression: 32-bit ABI

There are LOTS of code built using the 32-bit ABI. The calling conventions are distinct; for example, for system calls instead of the syscall indexed by rax with rdi+rsi+rdx…​ as parameters, the "interruption 0x80" is used indexed by eax with the ebx+ecx+…​ as parameters, as shown below:

; exit32.asm
          segment .text
          global _start
_start:
          mov eax , 1   ; 1 is the exit sys call number
          mov ebx , 5   ; the status value to return
          int 0x80      ; execute a system call

Compile with:

nasm -f elf64 -g -l exit32.lst exit32.asm
# also: yasm -f elf64 -g dwarf2 -l exit32.lst exit32.asm
ld -o exit32 exit32.o

This code usually works in 64-bits Linux since legacy support for 32 bits is very common (but it is expected to decline with the time.) Note that no link to the C library is made, since it is not called (unlike previous examples which use printf.)

Two complement

For signed integers the x86 family cpus use the "two’s complement" representation: for positive values the MSB is zero. Negative numbers have MSB=1, their the absolute value bit representation is inverted (complemented) and a unit is added.

This representation allows for adding and subtracting following the usual arithmetic rules (in base 2) if the two-complement meaning is consistently used to interpret the operations result.

Floating point

Three flavors: 32 bits, 64 bits, and 80 bits. Approximated precisions of 7, 16 and 19 digits.

Memory

The mapping from logic memory addresses (of a process) o the physical addresses is made using pages. The X84-64 CPU maps the memory in two page sizes: 4kB and 2MB. In modern CPUs also 1 GB pages are supported.

The Linux process' memory comprehends the regions: text, data, heap and stack. The logical addresses are 47-bit wide.

In the initial addresses there is the .text segment which does not require to grow. Next the initialized data segment .data. On top there is the .bss segment (block started by symbol) which is statically assigned data (initially zeroes) and does not exist at all in the executable file.

The heap contains dynamically allocated data. The last is the stack, restricted by to kernel to (usually) 16 MB.

The heap may also map shared libraries and memory mapped files.

Conditional jumps

The following example shows how to print this:

diego@dataone:~/devel/ASM$ ./triangle
*
**
***
****
*****
******
*******
********
*********

Now, the program:

; triangle.asm: adapted from https://cs.lmu.edu/~ray/notes/nasmtutorial/
          default rel
          global    main
          section   .text
main:
          lea       rdx, [output]           ; rdx holds address of next byte to write
          mov       r8, 1                   ; initial line length
          mov       r9, 0                   ; number of stars written on line so far
line:
          mov       byte [rdx], '*'         ; write single star
          inc       rdx                     ; advance pointer to next cell to write
          inc       r9                      ; "count" number so far on line
          cmp       r9, r8                  ; did we reach the number of stars for this line?
          jne       line                    ; not yet, keep writing on this line
lineDone:
          mov       byte [rdx], 10          ; write a new line char
          inc       rdx                     ; and move pointer to where next char goes
          inc       r8                      ; next line will be one char longer
          mov       r9, 0                   ; reset count of stars written on this line
          cmp       r8, maxlines            ; wait, did we already finish the last line?
          jle       line                    ; if not, begin writing this line
;         jng       line
done:
          mov       rax, 1                  ; system call for write
          mov       rdi, 1                  ; file handle 1 is stdout
          lea       rsi, [output]           ; address of string to output
          mov       rdx, dataSize           ; number of bytes
          syscall                           ; invoke operating system to do the write
          xor       rax, rax
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

          section   .bss
maxlines: equ       9
dataSize: equ       (maxlines+1)*(maxlines+2)/2-1
output:   resb      dataSize

Some macro comments:

  1. We use the equ directive to define two constants: the first one sets the label maxline to the constant 9. That’s it.

  2. The maxline: equ may be located at any place in the file, since its workings are totally internal to the assembly phase; i.e. it is not directly converted into machine code

  3. The second equ defines a constant via a constant expression; here it is used to calculate the required buffer to accumulate all the asterisks to be printed with a single call to write.

  4. Another way would have been to define a single-line buffer and make several write calls

  5. The resb directive reserves "uninitialized" data blocks; here its size is the (just calculated) constant dataSize; there blocks should be inside the .bss section

About the flow:

  1. The rdx register is used as a pointer to write into the buffer; the instruction mov byte [rdx], ? is analogous to the C version *rdx=? where ? is a character constant (asterisk) or the ASCII code for newline (10.)

  2. The cmp r9, r8 compares those registers. If not equal then jump back to line to get more asterisks. jne stands for jump-if-not-equal.

  3. After the asterisks for a line, add a newline and compare r8 (the number of asterisks in a line AND the number of lines to print) with maxlines; here we must jump back to print the next line if r8⇐maxlines so jle stands for jump-if-less-or-equal

  4. The last comparison is logically the same as !r8>maxlines, so the jle synonym instruction jng stands for jump-if-not-greater.

Externally linked function

The following file defines a function with a global symbol xdump:

; xdump.asm - dump hex bytes to stdout
; arguments:
;
; rdi: pointer to buffer
; esi: number of bytes
;
          default rel
          extern printf

          section .text
          global xdump
xdump:
          push   rbp
          mov    rbp, rsp
          sub    rsp, 32

; between calls to printf we need to save the buffer
; pointer, the offset and the total count
buffer:   equ 0
total:    equ 8
xoffset:  equ 16

; save the parameters rdi and esi in variables
          mov   [rsp+buffer], rdi
          mov   [rsp+total], rsi
; offset starts in zero
          xor   r8, r8
          mov   [rsp+xoffset], r8

; for the comparison offset->r8 total->rsi
nextbyte: cmp   r8, rsi
          je finalize

; Call printf.
          lea   rdi, [format]
          mov   rsi, [rsp+buffer]
          add   rsi, r8
          mov   al,  [rsi]
          mov   rsi, rax
          and   rsi, 0xff
          xor   eax, eax
          call  printf wrt ..plt

; increment offset
          mov   r8, [rsp+xoffset]
          inc   r8
          mov   [rsp+xoffset], r8

; get the total
          mov   rsi, [rsp+total]
          jmp   nextbyte

finalize:
          lea   rdi, [newline]
          xor   eax, eax
          call  printf wrt ..plt
          xor   eax, eax
          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

          section .rodata
format:   db "%02x", 0              ; C 0-terminated string
newline:  db 10, 0                ; 0-terminated new line

This function does print the provided buffer contents one byte at a time using printf. The buffer size must be provided too.

Here we illustrate the preferred way to deal with the stack:

          push   rbp
          mov    rbp, rsp

These instructions save the rbp register in the stack (which effectively uses 8 bytes so is equivalent to the subtraction to rsp we used before, but this way is conventionally used by the debugger to navigate in the stack frames.

The next instruction "reserves" some space in the stack for local variables:

          sub    rsp, 32

We must use multiples of 16 in order to preserve the stack alignment. Here we’ll use just 24 bytes in order to store the buffer pointer, the buffer size and an offset counter. This information will go to and from some registers and need to be saved since we’ll call printf for each buffer byte.

With this arrangement, we’ll put our "variables" at positions rsp, rsp+8 and rsp+16. To easy this scheme a bit, three constants are defined:

buffer:   equ 0
total:    equ 8
offset:   equ 16

So we may reference the variables as rsp+buffer, etc. and their contents as [rsp+buffer], etc.

The buffer address is stored in rsp+buffer so it is extracted as [rsp+buffer] into rsi; this address is then increased by the current offset (we use r8 for that) and then the content of the pointed byte is stored in al (mov al, [rsi].)

This is sent to rsi for ABI requirements and the higher bits are cleared with an and instruction.

We call this with this test program:

#include <string.h>

extern int xdump(const char *ptr, int sz);

int main() {
	const char *data = "A1 *** @@@ !!!";
	xdump(data, strlen(data));
	return 0;
}

The compilation steps and test:

nasm -f elf64 -g -l xdump.lst xdump.asm
# also: yasm -f elf64 -g dwarf2 -l xdump.lst xdump.asm
gcc -c  xdump1.c
gcc -o  xdump1  xdump1.o xdump.o
diego@dataone:~/devel/ASM$ ./xdump1
4131202a2a2a2040404020212121

Addressing

The following program shows several memory/register transfer sizes:

;
; sizes.asm
;
; what happens when we move data between registers, memory and code
;
          default rel
          extern xdump, printf

          section .text
          global main

main:
          push   rbp
          mov    rbp, rsp

%macro    title 1
          lea   rdi, [%1]
          xor   eax, eax
          call  printf wrt ..plt
          call   justones                     ; fill 111111111....
%endmacro

                                              ; MEMORY <- value
                                              ; writing in memory
                                              ; in immediate mode
                                              ; must set write size
          title  tv2mb
          mov    byte [testbuf1], 0x34
          mov    byte [testbuf1+1], 0xa7
          mov    byte [testbuf2], 0x35
;         mov    [testbuf1], 0x34             error!
;         mov    byte [testbuf1], 0x3412      warning!
;         mov    word [testbuf1], 0x341264    warning!
          call   prn
          title  tv2mw
          mov    word [testbuf1], 0x53
          mov    word [testbuf2+1], 0x5556
          mov    word [testbuf2+2], 0x5758
;         mov    word [testbuf2+2], 0x575859  warning!
          call   prn
          title  tv2md
          mov    dword [testbuf1], 0xab
          mov    dword [testbuf2+1], 0xbe444545
          mov    dword [testbuf2], 0xbeaa5566
          call   prn
          title  tv2mq
          mov    qword [testbuf1], 0xbeaa
          mov    qword [testbuf2], 0x4eaa5566
;         mov    qword [testbuf2], 0xbeaa5566         warning!
          mov    qword [testbuf2+8], 0x36373839
;         mov    qword [testbuf2], 0xbcbdbebfa0a1a2a3 warning!
          call   prn

                                              ; MEMORY <- (PARTIAL)REG
                                              ; from register to memory
                                              ; must set write size
          title  tr2m
          mov    rax, 0xbcbdbebfa0a1a2a3
          mov    qword [testbuf1], rax
          mov    rax, 0x252627
          mov    qword [testbuf1+8], rax
          mov    rbx, 0x4444444444444444
          mov    qword [testbuf2], rbx
          mov    rbx, 0x123456
          mov    qword [testbuf2+8], rbx
          call   prn
          title  tr2ms
          mov    rax, 0x445566778899aabb      ; saving subregisters
;         mov    qword [testbuf1], rax        ok
;         mov    dword [testbuf1], rax        error!
;         mov    qword [testbuf1], eax        error!
          mov    dword [testbuf1], eax
;         mov    word [testbuf1], eax         error!
;         mov    qword [testbuf1+8], ax       error!
;         mov    dword [testbuf1+8], ax       error!
          mov    word [testbuf1+8], ax
;         mov    qword [testbuf2], ah         error!
;         mov    dword [testbuf2], ah         error!
;         mov    word [testbuf2], ah          error!
          mov    byte [testbuf2], ah
          mov    [testbuf2+1], ah             ; exceptions: byte assumed
          mov    [testbuf2+2], al
          call   prn

                                              ; (PARTIAL)REG <- MEMORY
                                              ; partial memory loads
          title  tm2r
          mov    rax, 0x445566778899aabb      ; fill a pattern in rax
          mov    qword [testbuf1], rax        ; store in memory
          mov    rax, 0xcccccccccccccccc      ; fill another pattern in rax
          mov    rax, [testbuf1]              ; load from memory
          mov    qword [testbuf1+8], rax      ; store in memory
          mov    rax, 0xcccccccccccccccc      ; fill another pattern in rax
          mov    eax, [testbuf1]              ; partial load from memory -> clears the rest of rax!
          mov    qword [testbuf2], rax        ; store in memory -> see zeros in high bits!
          mov    rax, 0xcccccccccccccccc      ; fill another pattern in rax
          mov    ax, [testbuf1]               ; partial load from memory
          mov    qword [testbuf2+8], rax      ; store in memory -> pattern remains in high bits!
          call   prn
                                              ; al/ah tests
                                              ; memory loads
          title  tm2rs
          mov    rax, 0x445566778899aabb      ; fill a pattern in rax
          mov    qword [testbuf1], rax        ; store in memory
          mov    rax, 0xcccccccccccccccc      ; fill another pattern in rax
          mov    al, [testbuf1]               ; partial load from memory
          mov    qword [testbuf1+8], rax      ; store in memory -> pattern remains in high bits!
          mov    rax, 0xcccccccccccccccc      ; fill another pattern in rax
          mov    ah, [testbuf1]               ; partial load from memory
          mov    qword [testbuf2], rax        ; store in memory -> pattern remains in high bits!
          call   prn

                                              ; (PARTIAL)REG <- value
                                              ; immediate mode load
          title  tv2r
          mov    rax, 0x445566778899aabb      ; fill a pattern in rax
          mov    eax, 0xdddddddd              ; fill a pattern in eax
          mov    qword [testbuf1], rax        ; store in memory -> dddddddd00000000
          mov    rax, 0x445566778899aabb      ; fill a pattern in rax
          mov    ax, 0xeeee                   ; fill a pattern in ax
          mov    qword [testbuf1+8], rax      ; store in memory -> eeee998877665544
          mov    rax, 0x445566778899aabb      ; fill a pattern in rax
          mov    ah, 0xff                     ; fill a pattern in ah
          mov    qword [testbuf2], rax        ; store in memory -> bbff998877665544
          mov    rax, 0x445566778899aabb      ; fill a pattern in rax
          mov    al, 0xff                     ; fill a pattern in al
          mov    qword [testbuf2+8], rax      ; store in memory -> ffaa998877665544
          call   prn

                                              ; (PARTIAL)REG <- (PARTIAL)REG
                                              ; register copy
          title  tr2r
          mov    rax, 0x445566778899aabb      ; fill a pattern in rax
          mov    rbx, 0xeeeeeeeeeeeeeeee      ; and in rbx
          mov    eax, ebx                     ; partial copy
          mov    qword [testbuf1], rax        ; store in memory -> dddddddd00000000
          mov    rax, 0x445566778899aabb      ; fill a pattern in rax
          mov    rbx, 0xeeeeeeeeeeeeeeee      ; and in rbx
          mov    ax, bx                       ; partial copy
          mov    qword [testbuf1+8], rax      ; store in memory -> eeee998877665544
          mov    rax, 0x445566778899aabb      ; fill a pattern in rax
          mov    rbx, 0xeeeeeeeeeeeeeeee      ; and in rbx
          mov    ah, bh                       ; partial copy
          mov    qword [testbuf2], rax        ; store in memory -> bbff998877665544
          mov    rax, 0x445566778899aabb      ; fill a pattern in rax
          mov    rbx, 0xeeeeeeeeeeeeeeee      ; and in rbx
          mov    al, bl                       ; partial copy
          mov    qword [testbuf2+8], rax      ; store in memory -> ffaa998877665544
          call   prn

          xor    rax, rax                     ; exit
          leave
          ret

justones:
                                              ; fill with 0x12
          xor    esi, esi
          lea    rdi, [testbuf1]
nxt:
          mov    byte [rdi], 0x11
          inc    rdi
          inc    esi
          cmp    esi, 32
          jne    nxt
          ret

prn:
                                              ; dump 48 bytes
          push   rbp
          mov    rbp, rsp

          lea   rdi, [format]
          xor   eax, eax
          call  printf wrt ..plt

          lea    rdi, [testbuf1]
          mov    rsi, 32
          call   xdump wrt ..plt
          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

          section .bss
testbuf1: resb  16
testbuf2: resb  16

          section .rodata
format:   db "000102030405060708090A0B0C0D0E0F"
          db "000102030405060708090A0B0C0D0E0F", 10, 0
tv2mb:    db "MEMORY byte <- value", 10, 0
tv2mw:    db "MEMORY word <- value", 10, 0
tv2md:    db "MEMORY dword <- value", 10, 0
tv2mq:    db "MEMORY qword <- value", 10, 0
tr2m:     db "MEMORY <- REG", 10, 0
tr2ms:    db "MEMORY <- PARTIAL REG", 10, 0
tm2r:     db "REG <- MEMORY", 10, 0
tm2rs:    db "PARTIAL REG <- MEMORY", 10, 0
tv2r:     db "(PARTIAL) REG <- value", 10, 0
tr2r:     db "PARTIAL REG <- PARTIAL REG", 10, 0

Note the use of the %macro…​%endmacro assembler directives used to define the title macro which receives one argument (a message to be printed.)

The commented warning for the line 49 is subtle: There is no available mov mode to transfer a 64-bit value (qword) directly to memory (without the intervention of some register.) Thus the value 0xbeaa5566 is interpreted as a 64 bit one which must be converted to a signed number of 32 bits, but this requires to be in the range -0x80000000 …​ +0x7FFFFFFF, hence the warning.

The results:

diego@dataone:~/devel/ASM$ ./sizes |
	sed 's/\(.\{16\}\)\(.\{16\}\)\(.\{16\}\)\(.*\)/\1|\2|\3|\4/'
MEMORY byte <- value
0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F
34a7111111111111|1111111111111111|3511111111111111|1111111111111111
MEMORY word <- value
0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F
5300111111111111|1111111111111111|1156585711111111|1111111111111111
MEMORY dword <- value
0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F
ab00000011111111|1111111111111111|6655aabebe111111|1111111111111111
MEMORY qword <- value
0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F
aabe000000000000|1111111111111111|6655aa4e00000000|3938373600000000
MEMORY <- REG
0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F
a3a2a1a0bfbebdbc|2726250000000000|4444444444444444|5634120000000000
MEMORY <- PARTIAL REG
0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F
bbaa998811111111|bbaa111111111111|aaaabb1111111111|1111111111111111
REG <- MEMORY
0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F
bbaa998877665544|bbaa998877665544|bbaa998800000000|bbaacccccccccccc
PARTIAL REG <- MEMORY
0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F
bbaa998877665544|bbcccccccccccccc|ccbbcccccccccccc|1111111111111111
(PARTIAL) REG <- value
0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F
dddddddd00000000|eeee998877665544|bbff998877665544|ffaa998877665544

The build does require the xdump.o module to link:

nasm -f elf64 -g -l sizes.lst sizes.asm
# also: yasm -f elf64 -g dwarf2 -l sizes.lst sizes.asm
gcc -O0 -o sizes sizes.o xdump.o

Printing a float

The following example shows how to do this:

; test5.asm: print a float number
;
          default rel
          extern printf

          section .text
          global main
main:
          push   rbp
          mov    rbp, rsp

          lea   rdi, [format]
          mov   eax, 1
          movsd xmm0, [piaprox]
          call  printf wrt ..plt
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

          section .rodata
format:   db "%g", 10, 0          ; C 0-terminated string
piaprox:  dq 3.1416

We’re using the SSE extension xmm* registers (with 128 bytes.) There is no immediate mode for setting a float value (like mov xmm0, 1.4) but constants may be loaded from memory.

See that we’re setting the eax register to 1 in order to notify printf about one floating point argument.

See also the dq assembler instruction to signal a quad-byte floating point number.

The next example expands these ideas and approximates the PI constant with the (rather slowly convergent) Leibniz formula:

PI=4*(1-1/3+1/5-1/7...)

See the usage of the call instruction to call a local (not globally exported) subroutine. Here we show several instructions to do arithmetic operations:

; test6.asm: approximate pi via Leibniz formula
;
          default rel
          extern printf

          section .text
          global main
main:
          push   rbp
          mov    rbp, rsp

          lea   rdi, [format]
          mov   ecx, 10000
          call  calcpi
          mov   eax, 1
          call  printf wrt ..plt
          ret

calcpi:
          mov   rax, 1
          cvtsi2sd xmm1, rax      ; xmm1 contains 1 for ever
          movapd   xmm2, xmm1
          addpd    xmm2, xmm2     ; xmm2 contains 2 in the loop
          movapd   xmm0, xmm1     ; accumulator starts in 1
          movapd   xmm4, xmm1     ; divisor starts in 1
          mov      bl,   0        ; subtract
nextpi:
          addpd    xmm4, xmm2     ; divisor += 2
          movapd   xmm3, xmm1     ; xmm3<- 1
          divpd    xmm3, xmm4     ; xmm3<- 1/divisor
          not      bl
          cmp      bl, 0
          jne      l_subtract
l_add:
          addpd    xmm0, xmm3     ; xmm0 += quotient
          dec      ecx
          jnz      nextpi
          jmp      finalize
l_subtract:
          subpd    xmm0, xmm3     ; xmm0 -= quotient
          dec      ecx
          jnz      nextpi
finalize:
          addpd    xmm2, xmm2     ; xmm2 contains 4
          mulpd    xmm0, xmm2     ; 4*accumulator
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

          section .rodata
format:   db "%g", 10, 0          ; C 0-terminated string

Note that we don’t use the memory in order to initialize the required constants, but make a conversion from an integer value with the cvtsi2sd xmm1, rax instruction. Also, the alternating operations are obtained by continuous negation of the bl subregister which turn it from all-zeros to all-ones.

A note about SIMD

The xmm* registers and its related instructions are facilities collectively known as "single instruction multiple data" (or SIMD.) For example, a 128-bits wide register is able to represent (or contain) two 64-bits double precision floating point values, or four 32-bits single precision floating point values, and (most importantly) to execute arithmetic operations simultaneously with all the contained values.

Since these facilities were not included in the original 8086 cpu, they are announced as "extensions" to be created/supported by the successive releases of Intel and AMD cpus. The extensions which provided SIMD facilities were (chronologically) MMX, SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions.)

MMX introduced eight 64-bits wide registers (MM0…​MM7) able to hold integers of 8, 16, 32 and 64 bits. These are aliased to the x87 registers (share the same bits.) AMD added the capability to store 32-bit single precision floating point numbers with its extension named 3DNow!. Later, the SSE extensions provided the 128-bits xmm* registers used in the previous example. MMX and x87 instructions are currently obsolete.

Fibonacci

The next program asks the user to provide a positive integer number and tries to calculate the n-th number of the well known Fibonacci' series:

1, 1, 2, 3, 5, 8, 13, 21...

In order to get some text from the user, we’ll call fgets; but first note that this is not portable enough, since stdin is not always an external symbol but a macro: https://stackoverflow.com/questions/3689919/how-to-call-fgets-in-x86-assembly

; test7.asm: calculate Fibonacci series number
;
          default rel
          extern printf, fgets, atoi, stdin

          section .text
          global main
main:
          push   rbp
          mov    rbp, rsp

; show prompt
          lea   rdi, [prompt]
          xor   eax, eax
          call  printf wrt ..plt

; Get a number by calling
; char *fgets(char *s, int size, FILE *stream);
          lea   rdi, [inbuf]
          mov   rsi, 8
          mov   rdx, [stdin]
          call  fgets wrt ..plt

; Get an integer by calling
; int atoi(const char *nptr);
          lea   rdi, [inbuf]
          call  atoi wrt ..plt

          cmp   rax, 0             ; integer must be positive
          jg    nok

; else reject
          lea   rdi, [errneg]
          xor   eax, eax
          call  printf wrt ..plt
          leave
          ret
nok:
          cmp   rax, 2            ; >=2 to for calculations
          jg    g2ok
          mov   rax, 1            ; else just print 1
          call  printer
          leave
          ret

g2ok:
                                  ; r8=r9=1
          mov   r8,  1
          mov   r9,  1
          sub   rax, 2
fstep:
                                  ; swap r8/r9, r8=r8+r9
          mov   r10, r8
          mov   r8, r9
          mov   r9, r10
          add   r8, r9
          dec   rax
          jnz   fstep
          mov   rax, r8
          call printer
          leave
          ret

printer:
          push   rbp
          mov    rbp, rsp

          lea   rdi, [format]
          mov   rsi, rax          ; atoi() answer
          xor   eax, eax
          call  printf wrt ..plt
          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

; C 0-terminated strings
          section .rodata
format:   db "%d", 10, 0
errneg:   db "The n-th value must be positive", 10, 0
prompt:   db "Please enter a positive integer> ", 0
          section .bss
inbuf     resb  8

The text buffer is converted to an integer using atoi() and the result (in rax) is tested for positiveness. If not positive, an error message is printed and the program terminates. Else, rax is tested against 2 since the first couple of terms are 1 and does not need further calculation.

Next, we use rax as a counter, and setup r9 and r10 as "current terms" of the series. Then a simple loop is executed until rax becomes zero (jumps to retry while not zero, jnz.)

Square Root x87

Here we’ll make use of the (now obsolete) x87 Floating Point Unit. It provides eight 80-bits-wide registers which are organized in a stack fashion; the stack elements are referenced by st0, st1, etc. Some assemblers use st as an abbreviation for st0.

The last "pushed" element corresponds to st0; that push operation automatically moves the previous value of st0 into st1 and so on.

The x87 registers store floating point numbers, and interchange those values with the memory. The memory values may be values encoded as:

  1. signed integer word (16 bits)

  2. signed integer double word (32 bits), usually C’s int type

  3. signed integer quad word (64 bits), usually C’s long type

  4. single precision floating point (32 bits), usually C’s float type

  5. double precision floating point (64 bits), usually C’s double type

  6. double extended precision floating point (80 bits)

The x87 instructions do not allow the data interchange with the standard cpu registers: the memory must be used as temporary storage.

The following example shows how to calculate and print the square root of five:

; test8: calculate and print sqrt(5)
;
          default rel
          extern printf

          section .text
          global main
main:
          push   rbp
          mov    rbp, rsp

          mov   rax, 5
          mov   [numbuf], rax
          fild  qword [numbuf]       ; store 5 in st(0)
          fsqrt
          fstp  qword [numbuf]       ; save sqrt(5)
          call  printer
          leave
          ret
printer:
          push   rbp
          mov    rbp, rsp

          lea   rdi, [format]
          movsd xmm0, [numbuf]
          mov   eax, 1
          call  printf wrt ..plt
          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

; C 0-terminated strings
          section .rodata
format:   db "sqrt(5) is %g", 10, 0
          section .bss
numbuf:   resb  8

An integer five is stored into [numbuf] as a quad word integer, which in turn is used to feed the x87 FPU:

fild  qword [numbuf]

Now st0 does contain 5.0; the fsqrt just turns st0 into sqrt(st0) so this register contents is sent into the same memory buffer:

fstp  qword [numbuf]

Note that there is a sister instruction fst with the same purpose, but the fstp version additionally "pops" the stack so it is effectively emptied. Several x87 instructions have a *p version for a complementary stack pop.

The buffer content is then stored into xmm0 in order to be sent into printf() as previously shown.

Fibonacci revisited

Now we’ll make use of the FPU in order to calculate Fibonacci numbers by using the Binet’s formula:

Fn = (phi^n - (-1/phi)^2)/sqrt(5)

where phi is the golden ratio: (1+sqrt(5))/2. See https://en.wikipedia.org/wiki/Fibonacci_number for more information.

; test9.asm: Binet formula
;
          default rel
          extern printf, atoi

          section .text
          global main
main:
          push   rbp
          mov    rbp, rsp

          cmp   rdi, 2             ; arc must be 2
          je    a1
          lea   rdi, [errarg]
          xor   eax, eax
          call  printf wrt ..plt
          leave
          ret
a1:
          mov   rdi, [rsi+8]       ; call atoi(argv[1])
          call  atoi wrt ..plt
          cmp   rax, 0             ; integer must be positive
          jg    nok

; else reject
          lea   rdi, [errneg]
          xor   eax, eax
          call  printf wrt ..plt
          leave
          ret
nok:
          cmp   rax, 2            ; >=2 to for calculations
          jg    g2ok
          fld1
          fstp  qword [numbuf]
          call  printer
          leave
          ret

g2ok:
          mov   rdi, rax
          call  calc5
          call  calcgold
          call  fibo
          call  printer
          leave
          ret
fibo:
          mov   [pow_exp], rdi
          mov   rax, [gold]
          mov   [pow_base], rax
          call  power        ; pow
          fld1               ; 1 pow
          fld   st1          ; pow 1 pow
          fdivp              ; 1/pow pow
          and   rdi, 0x1     ; test if rdi is even
          jnz   fibo2
                             ; negate st0
          fldz               ; 0 1/pow pow
          fxch               ; 1/pow 0 pow
          fsubp              ; -1/pow pow
fibo2:
          faddp              ; [+-]1/pow + pow
; divide by sqrt(5)
          fld   qword [sq5]  ; sq5 ([+-]1/pow + pow)
          fdivp              ; (+)/sq5
          fstp  qword [numbuf]
          ret
          
calc5:
          mov   rax, 5
          mov   [sq5], rax
          fild  qword [sq5]
          fsqrt
          fstp  qword [sq5]
          ret

calcgold:
          fld   qword [sq5]
          fld1
          faddp
          fld1
          fld1
          faddp
          fdivp
          fstp  qword [gold]
          ret

power:
          push  rbp
          mov   rbp, rsp

          fild  qword [pow_exp]
          fld   qword [pow_base]
          fyl2x
          fld1
          fld   st1
          fprem
          f2xm1
          faddp
          fscale
          fxch  st1
          fstp  st0
          leave
          ret
printer:
          push  rbp
          mov   rbp, rsp

          lea   rdi, [format]
          movsd xmm0, [numbuf]
          mov   eax, 1
          call  printf wrt ..plt
          xor   rax, rax        ; don't care printf retval
          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

; C 0-terminated strings
          section .rodata
format:   db "pow is %g", 10, 0
prompt:   db "Please enter a positive integer> ", 0
errneg:   db "The n-th value must be positive", 10, 0
errarg:   db "Must provide one command line argument", 10, 0
          section .bss
inbuf:    resb  8
pow_base: resb  8
pow_exp:  resb  8
numbuf:   resb  8
sq5:      resb  8
gold:     resb  8

The program is an extension from the two previous ones.

Note
The power routine was adapted from https://www.madwizard.org/programming/snippets?id=36

The power routine makes use of the identity:

x^y = exp(y*log(x))

but the FPU provides two-based functions, so the identity must be rewritten as:

x^y = 2^(y*log2(x)) = 2^K

The FPU provides the fyl2x instruction in order to compute K = y*log2(x)), but there is a quirk: the calculation of 2^K must be done in two stages. If K=[K]+{K} (integer and fractional parts of K), then:

x^y = 2^([K]+{K]) = 2^[K] * 2^{k} = 2^[K] * (2^{k} -1 + 1)

The f2xm1 instruction is able to calculate 2^x -1, provided -1 < x < 1, and the fscale calculates y * 2^[x]. So, in pseudocode:

K = fyl2x(x,y)
fscale(K, add(f2xm1(K), 1)

Note also the use of fld1 to load a 1 in st0. It is used in order to get the fractional part of K with the fprem instruction (remainder of division by 1.)

Finally, after the fscale execution, one unneeded value remains in st1. It is exchanged with st0 (so the unneeded value goes into st0), and that value is "popped" with fstp st0. The final st0 is used as return value.

Other point of interest is the implementation of the Binet’s formula. This is implemented as:

phi^n + ((+/-) phi^-n)

where the sign depends on the parity of the exponent. This is tested by checking the last bit of the exponent in rdi with an and 0x1 instruction. The negated sign version is obtained by subtracting from zero (the zero is loaded with fldz.)

Finally, the program expects the n-th term to be provided from command line (not via fgets.) The argc,argv arguments come in rdi,rsi, so we first check that rdi==2, then get the pointer argv[1] in the following way: argv==rsiargv+1==rsi+8*(argv+1)==argv[1]==[rsi+8].

Parenthesis: most used instructions

Here we made a trivial and gross analysis about the most used instructions in the Linux kernel (downloaded on March 4, 2021 from https://github.com/torvalds/linux); we excluded the crypto subdir since it contains code which is too specialized to be considered as a reference to study (but I wonder if all the subdirs are too specialized in some way.)

The source files were concatenated into /tmp/linux-raw.c in order to apply the gcc pre-compiler for comments removal, else many non-assembly words leak out to the output. Of course, no C compilation is performed at all!

We get usage counters for the instructions, which are grouped by 20-units-frequency. Note that several words are not instructions but macros or assembler directives. Also, note that the frequency "0" does mean the 0-19 range.

There is also assembly code embedded in many C language files, which was not extracted since the parsing was not trivial. The instructions use the AT&T syntax.

diego@dataone:~/Desktop/linux-master/arch/x86$ cat
	$(find $(ls | grep -v crypto) -name '*.S') > /tmp/linux-raw.c
diego@dataone:~/Desktop/linux-master/arch/x86$ cd /tmp
diego@dataone:/tmp$ gcc -fpreprocessed -dD -E linux-raw.c > linux-raw.out
diego@dataone:/tmp$ cat linux-raw.out | grep -v '^;' | sed 's/^[        ]*//' |
	grep '^[a-z]' | grep -v ':' | sed 's/[  \.].*//' | grep '^[a-z]*$' |
	sort | uniq -c | awk '{a=20*int($1/20); print a,$2}' | sort -nr > /tmp/linux-freq.txt
diego@dataone:/tmp$ awk
	'{dat[$1]=dat[$1] " " $2}END{for(i in dat){print i,dat[i]}}' /tmp/linux-freq.txt |
	sort -nr
1160  movl
580  movq
280  pushl
240  jmp
200  pushq
180  popl adcl
160  xorl mov
140  ret call addl
120  jz
100  pop cmpl
80  subl push movw leaq jnz andl
60  testl popq orl jne je
40  xor shrl mull movb leal jb addq
20  testb subq shrd sbbl rep rcrl popfl leave lea jnc jae divl decl cmpw cmpq cmp cld
0  xorw xorq xchgl xchg wrmsr writepost wrgsbase wbinvd text testq test sysretq sysretl
sysexit syscall swapgs subw subb sub str sti std stc source sldt sidt shrw shrq shrb
shr shlq shll shld shl sgdtl sgdt sfence setz setne setaeb sbbb sbb sarl sar rorq roll
retq retl rdmsr rcll pushw pushfq pushfl pushf pushal popw popfq popf popal pgtno percpu
pause outb orw orq orb or notq notl note nop negw negl movzwq movzwl movzbq movzbl movswl
movsq movsl movsb movabsq movabs ltr lss lretw lretq lret lodsb lmsw lldt ljmpw ljmpl
ljmp lidtl lidt lgdtl lgdt lfence js jns jnb jmpq jmpl jle jl jiffies jge jg jecxz jc jbe
ja iretq iret int init incw incq incl inc imulq imull idtentry i hlt enclu dynamic dest
decw decq dec data cpuid cmpb clts cli clc cdq calll btsq btsl bts btrl btr btl bt bsrl
andw andq andb and addw addb add adcq

For comparison, we did the same for the venerable Microsoft DOS 2.0, available from https://github.com/microsoft/MS-DOS; the DOSMAC.ASM contained several macros for the assembler which must be excluded from the output, but several assembler directives remain:

diego@dataone:~/Desktop/MS-DOS-master/v2.0/source$ grep -ai macro DOSMAC.ASM | grep '^[a-zA-Z]' |
	awk '{print $1}' | tr 'a-z' 'A-Z' | sed 's/.*/^&$/' > /tmp/macros.tmp
diego@dataone:~/Desktop/MS-DOS-master/v2.0/source$ cat *.ASM | grep -av '^;' | grep -av ':' |
        grep -a '^        [^ ]' | sed 's/^        //' | sed 's/[  ].*//' | grep '^[a-zA-Z]' |
	tr 'a-z' 'A-Z' | egrep -av -f /tmp/macros.tmp |  sed 's/^M//' | sort | uniq -c |
	awk '{a=20*int($1/20); print a,$2}' | sort -nr > /tmp/dos-raw.txt
diego@dataone:~/Desktop/MS-DOS-master/v2.0/source$ awk
	'{dat[$1]=dat[$1] " " $2}END{for(i in dat){print i,dat[i]}}' /tmp/dos-raw.txt | sort -nr
6160  MOV
1780  CALL
1660  POP
1640  PUSH CMP
1120  JZ
1080  JMP
900  JNZ
760  INT
640  DB
600  INC
520  XOR
460  OR ADD
400  DW
380  RET
340  DEC
320  ENDIF
300  IF
280  PUBLIC
260  SUB
220  LODSB
200  TEST STOSB RETURN JC
180  XCHG
160  SHR AND
140  SHL
120  REP LDS JNC JB
100  STOSW LOOP JE INCLUDE
80  LES JCXZ JAE
60  STC PAGE JNE JBE JA ELSE
40  SUBTTL STI MUL END CLD CLC
20  RETZ REPNE PUSHF POPF OUT ORG MOVSW MOVSB LODSW IRET DIV CONTEXT CLI CBW ADC
0  XLAT TRANSFERIF TITLE SYS STD SCASB SBB SAVE SAR SAHF ROR ROL RETNZ RETNC RETC
RESTORE REPZ REPNZ REPE RCR RCL NOT NEG NAME MESSAGE LOOPNZ LOOPNE LOCAL LEA LAHF
JS JPE JNS JNB JNA JLE JL JGE JG IN IFIDN IFDEF ENDM DO_EXT DAA CMPSB CMC AAM AAD

Getting the time

The next program uses the clock_gettime() POSIX function used to get the current time with a (theoretical) precision of nanoseconds. The real precision should depend on the operating system and hardware. We introduce a dummy delay of 50000 microseconds (0.05 seconds) using the usleep() POSIX function.

; timing.asm: showing a delay and time computation
;
          default rel
          extern printf, clock_gettime, usleep

%macro    prelude 0
          push   rbp
          mov    rbp, rsp
%endmacro

          section .text
          global main
main:
          prelude
          sub   rsp, 32
          mov   rdi, 0
          mov   rsi, rsp
          call  clock_gettime wrt ..plt

          mov   rdi, 50000
          call  usleep wrt ..plt

          mov   rdi, 0
          lea   rsi, [rsp+16]
          call  clock_gettime wrt ..plt

          mov   rax, 1000000000
          cvtsi2sd xmm4, rax
          mov   rax, [rsp]
          cvtsi2sd xmm0, rax
          mov   rax, [rsp+8]
          cvtsi2sd xmm1, rax
          mov   rax, [rsp+16]
          cvtsi2sd xmm2, rax
          mov   rax, [rsp+24]
          cvtsi2sd xmm3, rax
          subpd    xmm3, xmm1
          subpd    xmm2, xmm0
          divpd    xmm3, xmm4
          movapd   xmm0, xmm2    ; 
          addpd    xmm0, xmm3

          lea   rdi, [format]
          mov   eax, 1
          call  printf wrt ..plt

          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

; C 0-terminated strings
          section .rodata
format:   db "time dif is %g sec", 10, 0

From the man page, the clock_gettime() retrieves the time in the structure:

           struct timespec {
               time_t   tv_sec;        /* seconds */
               long     tv_nsec;       /* nanoseconds */
           };

After compiling a dummy C program which calls such function, we discover that (in our system) both fields are long integers using 8 bytes, compactly stored as 16 consecutive bytes. So we reserve 32 bytes for storing two structures used in two calls to the function:

          sub   rsp, 32

The structures are located at addresses rsp and rsp+16, corresponding to the tv_sec fields. At rsp+8 and rsp+24 we’ll get the tv_nsec fields.

Then we make the following calculation using double precision arithmetic, using pseudo-code for the times t1 and t2:

(t2.tv_sec-t1.tv_sec) + (t2.tv_nsec-t1.tv_nsec)/1e9

The resulting time difference is shown using printf():

diego@dataone:~/devel/ASM$ ./timing
time dif is 0.0502603 sec
diego@dataone:~/devel/ASM$ ./timing
time dif is 0.0502895 sec
diego@dataone:~/devel/ASM$ ./timing
time dif is 0.0502988 sec
diego@dataone:~/devel/ASM$ ./timing
time dif is 0.0502095 sec

Note about MMX

For floating point calculations, the x87 FPU is (in practice) superseded by the SSE2 extension’s instructions. Before SSE there was the MMX extension which provided some 64-bit registers.

From a draft for the first edition of The Art of Assembly (Randall Hyde):

"Although MM0..MM7 appear as separate registers in the Intel Architecture, the Pentium processors alias these registers with the FPU’s registers (ST0..ST7). Each of the eight MMX 64-bit registers is physically equivalent to the L.O. 64-bits of each of the FPU’s registers […​]. The MMX registers overlay the FPU registers in much the same way that the 16-bit general purpose registers overlay the 32-bit general purpose registers."

The combination of MMX and the FPU is discouraged:

"Because the MMX registers overlay the FPU registers, you cannot mix FPU and MMX instructions in the same computation sequence. You can begin executing an MMX instruction sequence at any time; however, once you execute an MMX instruction you cannot execute another FPU instruction until you execute a special MMX instruction, EMMS (Exit MMX Machine State)."

There is a chance that a program is executed in a cpu which does not implements a required extension; in this case it usually crashes because an unsupported instruction; in a later section we comment about the runtime discovery of cpu extensions to better handle these cases.

So, in practice, we’ll use the SSE registers (xmm*) as in the following example:

Inverting a matrix

The following C program shows a simple (dummy) implementation of the high school procedure for matrix inversion, applying "elementary operations" into the matrix and into a helper identity matrix; when the original matrix is converted into an identity, the helper matrix is the desired inverse. Note that we avoid several checks in order to simplify the implementation:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>

void show(double *m, int n) {
	if(n > 10) {
		printf("Too big to show\n");
		return;
	}
	int r, c;
	for(r = 0 ; r < n ; r ++) {
		for(c = 0 ; c < n ; c ++) {
			printf("%10g ", m[r*n+c]);
		}
		putchar('\n');
	}
	putchar('\n');
}

int isIdentity(double *m, int n) {
	int r, c;
	for(r = 0 ; r < n ; r ++) {
		for(c = 0 ; c < n ; c ++) {
			if(r == c) {
				if(fabs(m[r*n+c]-1.0) > 1e-5) return 0;
			} else {
				if(fabs(m[r*n+c]-0.0) > 1e-5) return 0;
			}
		}
	}
	return 1;
}

double matrixTotal(double *m, int n) {
	int r, c;
	double t = 0.0;
	for(r = 0 ; r < n ; r ++) {
		for(c = 0 ; c < n ; c ++) {
			t += m[r*n+c];
		}
	}
	return t;
}

double matrixTrace(double *m, int n) {
	int r;
	double t = 0.0;
	for(r = 0 ; r < n ; r ++) {
		t += m[r*n+r];
	}
	return t;
}

void xcopy(double *a, double *b, double *prod, int n) {
	/* multiply copy by m2 */
	int c, r, x;
	for(c = 0 ; c < n ; c ++) {
		for(r = 0 ; r < n ; r ++) {
			prod[c*n+r]=0.0;
			for(x = 0 ; x < n ; x ++) {
				prod[c+n*r] += a[c*n+x]*b[x*n+r];
			}
		}
	}
}

int main(int argc, char **argv) {
	int r, c, x;
	int n = atoi(argv[1]);
	double rmax = RAND_MAX;
	struct timespec tp1, tp2;
	srand(0);
	double *m1 = calloc(n * n, sizeof(double));
	double *m2 = calloc(n * n, sizeof(double));
	double *prod = calloc(n * n, sizeof(double));
	/* build identity */
	for(r = 0 ; r < n ; r ++) {
		m2[r*n+r] = 1.0;
	}
	/* build random matrix */
	for(r = 0 ; r < n ; r ++) {
		for(c = 0 ; c < n ; c ++) {
			m1[r*n+c] = rand()/rmax;
		}
	}
	show(m1, n);
	show(m2, n);
	/* copy the random matrix in prod for final check */
	memcpy(prod, m1, n * n * sizeof(double));

	/* invert the matrix */
	clock_gettime(CLOCK_REALTIME, &tp1);
	for(c = 0 ; c < n ; c ++) {
		double ipivot = 1.0/m1[c*n+c];
		/* pivot row multiply by 1/pivot */
		for(x = 0 ; x < n ; x ++) {
			m1[c*n+x] *= ipivot;
			m2[c*n+x] *= ipivot;
		}
		/* process rows except pivot */
		for(r = 0 ; r < n ; r ++) {
			if(r != c) {
				double factor = m1[r*n+c];
				for(x = 0 ; x < n ; x ++) {
					m1[r*n+x] -= factor * m1[c*n+x];
					m2[r*n+x] -= factor * m2[c*n+x];
				}
			}
		}
	}
	clock_gettime(CLOCK_REALTIME, &tp2);
	double ds = tp2.tv_sec - tp1.tv_sec;
	double dn = tp2.tv_nsec - tp1.tv_nsec;
	double dif = ds + dn / 1e9;
	printf("time dif is %g sec\n", dif);
	show(m1, n); /* identity */
	show(m2, n); /* inverted matrix */
	/* m1 returns to the original matrix */
	memcpy(m1, prod, n * n * sizeof(double));
	/* multiply m1*m2 into prod for checking */
	xcopy(m1, m2, prod, n);
	printf("is identity = %d\n", isIdentity(prod, n));
	printf("total  = %g\n", matrixTotal(m2, n));
	printf("trace  = %g\n", matrixTrace(m2, n));
	return 0;
}

Additionally we implemented a matrix product routine used to check that the inverse multiplied by the original matrix result in an identity. A helper function checks it is indeed an identity (up to some precision.)

In order to quickly compare the results with an assembly version, we calculate the sum of all the inverse’s elements, and its trace:

diego@dataone:~/devel/ASM$ ./model-inv 500
time dif is 1.26518 sec
is identity = 1
total  = 1.94742
trace  = 3.69845
diego@dataone:~/devel/ASM$ ./model-inv 500
time dif is 1.21957 sec
is identity = 1
total  = 1.94742
trace  = 3.69845

After compiling with optimization level 3 (-O3 option), the performance improves:

diego@dataone:~/devel/ASM$ ./model-inv 500
time dif is 0.531943 sec
is identity = 1
total  = 1.94742
trace  = 3.69845
diego@dataone:~/devel/ASM$ ./model-inv 500
time dif is 0.500119 sec
is identity = 1
total  = 1.94742
trace  = 3.69845

A corresponding assembly version follows:

;
; inv.asm: inverting a matrix
;
          default rel
          extern printf, clock_gettime, rand, srand, calloc, atoi, putchar

%macro    prelude 0
          push   rbp
          mov    rbp, rsp
%endmacro

          section .text
          global main
main:
          prelude
          sub   rsp, 48

          mov   rdi, [rsi+8]       ; get matrix size from argv[1]
          call  atoi wrt ..plt
          mov   r15, rax

          mul   eax                ; n^2
          mov   edi, eax
          mov   rsi, 16            ; alignment for xmm* registers
ccal:
          call  calloc wrt ..plt
          mov   qword [m1], rax

          mov   rax, r15
          mul   eax                ; n^2
          xor   rdi, rdi
          mov   edi, eax
          mov   rsi, 16            ; alignment for xmm* registers
          call  calloc wrt ..plt
          mov   qword [m2], rax

; srand(0)
          xor   rdi, rdi
          call  srand wrt ..plt

; build random on m1
          mov   rax, [randmax]
          cvtsi2sd xmm4, rax

          xor   r12, r12
nextr:
          xor   r13, r13
nextc:
          movdqu  oword [rsp+32], xmm4
;          movdqu  dqword [rsp+32], xmm4
          call  rand wrt ..plt
          movdqu  xmm4, oword [rsp+32]
;          movdqu  xmm4, dqword [rsp+32]
          cvtsi2sd xmm0, rax
          divpd    xmm0, xmm4      ; range is 0-1

          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 16
;         movapd   [m1+rax], xmm0   -> relocation R_X86_64_32
          mov   r14, qword [m1]
          add   r14, rax
          movapd   [r14], xmm0
          inc   r13
          cmp   r13, r15
          jl    nextc
          inc   r12
          cmp   r12, r15
          jl    nextr

; build identity on m2
          xor   r12, r12
nextri:
          mov   rax, r15
          imul  rax, r12
          add   rax, r12
          imul  rax, 16
;         movapd   [m2+rax], xmm0   -> relocation R_X86_64_32
          mov   r14, qword [m2]
          add   r14, rax
          mov   rax, 1
          cvtsi2sd xmm0, rax
          movapd   [r14], xmm0
          inc   r12
          cmp   r12, r15
          jl    nextri
          
; printing
          mov   r14, qword [m1]
          call  show
          mov   r14, qword [m2]
          call  show

; get start time
          mov   rdi, 0
          mov   rsi, rsp
          call  clock_gettime wrt ..plt

; begin invertion
          mov   rax, 1
          cvtsi2sd xmm5, rax       ; xmm5 contains 1.0

          xor   r13, r13           ; main loop: column iteration
inextc:
          mov   rax, r15
          imul  rax, r13
          add   rax, r13
          imul  rax, 16
          mov   r14, qword [m1]
          add   r14, rax
          movapd   xmm0, [r14]
          movapd   xmm3, xmm5
          divpd    xmm3, xmm0      ; xmm3 has ipivot = 1.0/m1[c*n+c];

          xor   rbx, rbx           ; normalize pivot cell: row factor loop
inextx:
          mov   rax, r15
          imul  rax, r13
          add   rax, rbx
          imul  rax, 16

          mov   r14, qword [m1]
          add   r14, rax
          movapd   xmm0, [r14]     ; get m1[c*n+x]
          mulpd    xmm0, xmm3
          movapd   [r14], xmm0     ; m1[c*n+x] *= ipivot;

          mov   r14, qword [m2]
          add   r14, rax
          movapd   xmm0, [r14]     ; get m1[c*n+x]
          mulpd    xmm0, xmm3
          movapd   [r14], xmm0     ; m2[c*n+x] *= ipivot;

          inc   rbx
          cmp   rbx, r15
          jl    inextx
          xor   r12, r12
inextr:
          cmp   r12, r13           ; if(r != c)
          je    inextre

          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 16
          mov   r14, qword [m1]
          add   r14, rax
          movapd   xmm3, [r14]     ; factor = m1[r*n+c];

          xor   rbx, rbx
inextxx:
          mov   rax, r15
          imul  rax, r13
          add   rax, rbx
          imul  rax, 16
          mov   r14, qword [m1]
          add   r14, rax
          movapd   xmm2, [r14]     ; m1[c*n+x]
          mulpd    xmm2, xmm3      ; factor * m1[c*n+x]
          mov   r14, qword [m2]
          add   r14, rax
          movapd   xmm1, [r14]     ; m2[c*n+x]
          mulpd    xmm1, xmm3      ; factor * m2[c*n+x]
          mov   rax, r15
          imul  rax, r12
          add   rax, rbx
          imul  rax, 16
          mov   r14, qword [m1]
          add   r14, rax
          movapd   xmm0, [r14]     ; m1[r*n+x]
          subpd    xmm0, xmm2
          movapd   [r14], xmm0     ; m1[r*n+x] -= factor * m1[c*n+x]
          mov   r14, qword [m2]
          add   r14, rax
          movapd   xmm0, [r14]     ; m2[r*n+x]
          subpd    xmm0, xmm1
          movapd   [r14], xmm0     ; m2[r*n+x] -= factor * m2[c*n+x]

          inc   rbx
          cmp   rbx, r15
          jl    inextxx
inextre:
          inc   r12
          cmp   r12, r15
          jl    inextr

          inc   r13
          cmp   r13, r15
          jl    inextc


; get end time
          mov   rdi, 0
          lea   rsi, [rsp+16]
          call  clock_gettime wrt ..plt

; calculate time difference
          mov   rax, 1000000000
          cvtsi2sd xmm4, rax
          mov   rax, [rsp]
          cvtsi2sd xmm0, rax
          mov   rax, [rsp+8]
          cvtsi2sd xmm1, rax
          mov   rax, [rsp+16]
          cvtsi2sd xmm2, rax
          mov   rax, [rsp+24]
          cvtsi2sd xmm3, rax
          subpd    xmm3, xmm1
          subpd    xmm2, xmm0
          divpd    xmm3, xmm4
          movapd   xmm0, xmm2    ; 
          addpd    xmm0, xmm3

; show time difference
          lea   rdi, [format]
          mov   eax, 1
          call  printf wrt ..plt

; printing again
          mov   r14, qword [m1]
          call  show
          mov   r14, qword [m2]
          call  show

; get total
          mov   r14, qword [m2]
          call  matrixtotal
          
          lea   rdi, [ftotal]
          mov   eax, 1
          call  printf wrt ..plt

; get trace
          mov   r14, qword [m2]
          call  matrixtrace
          
          lea   rdi, [ftrace]
          mov   eax, 1
          call  printf wrt ..plt

          leave
          ret

; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtotal:
          prelude
          sub   rsp, 16
          mov   qword [rsp], r14
          pxor  xmm0, xmm0
          xor   r12, r12
tonextr:
          xor   r13, r13
tonextc:
          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 16
          mov   r14, qword [rsp]
          add   r14, rax
          movapd   xmm1, [r14]
          addpd    xmm0, xmm1
          inc   r13
          cmp   r13, r15
          jl    tonextc
          inc   r12
          cmp   r12, r15
          jl    tonextr
          leave
          ret

; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtrace:
          prelude
          sub   rsp, 16
          mov   qword [rsp], r14
          pxor  xmm0, xmm0
          xor   r12, r12
trnextr:
          mov   rax, r15
          imul  rax, r12
          add   rax, r12
          imul  rax, 16
          mov   r14, qword [rsp]
          add   r14, rax
          movapd   xmm1, [r14]
          addpd    xmm0, xmm1
          inc   r12
          cmp   r12, r15
          jl    trnextr
          leave
          ret

; r15 has matrix size, r14 has pointer to the data
show:
          prelude
          sub   rsp, 16
          cmp   r15, 10
          jg    toobig
          mov   qword [rsp], r14
          xor   r12, r12
snextr:
          xor   r13, r13
snextc:
          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 16
          mov   r14, qword [rsp]
          add   r14, rax
          movapd   xmm0, [r14]
          lea   rdi, [mformat]
          mov   eax, 1
          call  printf wrt ..plt
          inc   r13
          cmp   r13, r15
          jl    snextc
          mov   rdi, 10
          call  putchar wrt ..plt
          inc   r12
          cmp   r12, r15
          jl    snextr
          mov   rdi, 10
          call  putchar wrt ..plt

          leave
          ret
toobig:   lea   rdi, [ftoobig]
          xor   eax, eax
          call  printf wrt ..plt
          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits


; C 0-terminated strings
          section .rodata
format:   db "time dif is %g sec", 10, 0
ftoobig:  db "Too big to show", 10, 0
ftotal:   db "total  = %g", 10, 0
ftrace:   db "trace  = %g", 10, 0
mformat:  db "%10g ", 0
randmax:  dq 2147483647
          section .bss
n:        resb 8
m1:       resb 8
m2:       resb 8
  1. We made strong use of the "preserved" registers r12-r15 and rbx

  2. The RAND_MAX constant is implementation dependent; the value was located in /usr/include/stdlib.h (an alternative way to look it up is generating the assembly of a dummy C program which uses such value)

  3. The pxor instruction was used to set xmm0 to zero; see https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and for more information regarding the zeroing of the registers

  4. Note that we aligned all the matrix elements to 16 bytes in memory, since the xmm* instructions does require it for load/store operations; we’re wasting half of the memory

  5. The returned calloc memory block was 16-bytes aligned. This is not guaranteed, and we should take some precautions for a distinct scenery (see below)

Some results:

diego@dataone:~/devel/ASM$ ./inv 500
time dif is 1.17459 sec
total  = 1.94742
trace  = 3.69845
diego@dataone:~/devel/ASM$ ./inv 500
time dif is 1.21626 sec
total  = 1.94742
trace  = 3.69845

Using the debugger

The debugger is a powerful tool for assembly programs. For example, we may start a debugging session of the inv program:

diego@dataone:~/devel/ASM$ gdb ./inv
...
(gdb)

Usually, we instruct the debugger to run until some point in the program identified by a label (breakpoint.) For example, to mark the inextc label as breakpoint:

(gdb) break inextc
Breakpoint 1 at 0x913: file inv.asm, line 104.

Then the program is executed, providing command line arguments:

(gdb) run 4
Starting program: /home/diego/devel/ASM/inv 4
  0.840188   0.394383   0.783099    0.79844
  0.911647   0.197551   0.335223    0.76823
  0.277775    0.55397   0.477397   0.628871
  0.364784   0.513401    0.95223   0.916195

         1          0          0          0
         0          1          0          0
         0          0          1          0
         0          0          0          1

Breakpoint 1, inextc () at inv.asm:104
104	          mov   rax, r15
(gdb)

Here we may list the surrounding code:

(gdb) list
99	          mov   rax, 1
100	          cvtsi2sd xmm5, rax       ; xmm5 contains 1.0
101
102	          xor   r13, r13           ; main loop: column iteration
103	inextc:
104	          mov   rax, r15
105	          imul  rax, r13
106	          add   rax, r13
107	          imul  rax, 16
108	          mov   r14, qword [m1]
(gdb)

The xmm5 register was loaded with 1.0 from rax (line 100.) It may be shown this way:

(gdb) print $xmm5
$1 = {v4_float = {0, 1.875, 0, 0}, v2_double = {1, 0},
    v16_int8 = {0, 0, 0, 0, 0, 0, -16, 63, 0, 0, 0, 0,
    0, 0, 0, 0}, v8_int16 = {0, 0, 0, 16368, 0, 0, 0, 0},
    v4_int32 = {0, 1072693248, 0, 0}, v2_int64 =
    {4607182418800017408, 0}, uint128 = 4607182418800017408}
(gdb)

Since the xmm* registers have 128 bits, they are able to store two double precision floating point values (8 bytes each one.) This is shown as v2_double = {1, 0} in the previous output. The cvtsi2sd just filled the "low" 64 bits of the register.

Now, we execute the following instructions with the next command. Note that the step command here is equivalent; it is useful when we are interested in the execution of the called functions, since next just steps over them.

(gdb) next
105	          imul  rax, r13
(gdb) next
106	          add   rax, r13
(gdb) next
107	          imul  rax, 16
(gdb) next
108	          mov   r14, qword [m1]
(gdb) next
109	          add   r14, rax
(gdb) next
110	          movapd   xmm0, [r14]
(gdb) next
111	          movapd   xmm3, xmm5
(gdb) print $xmm0
$3 = {v4_float = {8.64733415e+15, 1.83504689, 0, 0},
    v2_double = {0.84018771715470952, 0}, v16_int8 =
    {-93, -59, -11, 89, -47, -30, -22, 63, 0, 0, 0, 0, 0, 0, 0, 0},
    v8_int16 = {-14941, 23029, -7471, 16362, 0, 0, 0, 0},
    v4_int32 = {1509279139, 1072358097, 0, 0}, v2_int64 = {
    4605742957725074851, 0}, uint128 = 4605742957725074851}
(gdb)

After the execution of line 110 (when the 111 is shown) then the xmm0 register is printed, showing the double values v2_double = {0.84018771715470952, 0}. This is the result from the movapd instruction which did read a 16 bytes block, which was previously filled with a random double and (implicitly) with a zero.

We may avoid the memory waste by the use of the full 128 bits xmm* registers, by working in "pairs" of double precision values in the memory.

The best news of this scheme is that the arithmetical operations are automatically applied to both parts of the register, so we effectively duplicate the computation speed.

Loading information in packed form

The xmm* and ymm* registers may load floating point data from memory aligned at 16 and 32 bytes (resp.)

;
; packed.asm: loading packed doubles
;
          default rel

%macro    prelude 0
          push   rbp
          mov    rbp, rsp
%endmacro

          section .text
          global main
main:
          prelude
xload:
          movapd   xmm0, [xdata1]
          vmovapd  ymm0, [ydata1]
          movapd   xmm0, [xdata2]

          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

; C 0-terminated strings
          section .rodata
align 16
xdata1:   dq 1.1
          dq 2.2
          dq 3.3
          dq 4.4
align 32
ydata1:   dq 5.5
          dq 6.6
          dq 7.7
          dq 8.8
align 16
xdata2:   dq 11.1
          dq 22.2

We’ll run this with a debugger. In the following display we show only the "double precision" section:

diego@dataone:~/devel/ASM$ gdb ./packed
...
(gdb) break xload
Breakpoint 1 at 0x5e4: file packed.asm, line 16.
(gdb) run
Starting program: /home/diego/devel/ASM/packed

Breakpoint 1, xload () at packed.asm:16
16	          movapd   xmm0, [xdata1]
(gdb) print $xmm0
$1 = {v2_double = { 7.0632744564454665e-304, 0} ...}
(gdb) print $ymm0
$2 = { v4_double = {7.0632744564454665e-304, 0, 0, 0} ...}
(gdb) next
17	          vmovapd  ymm0, [ydata1]
(gdb) print $xmm0
$3 = { v2_double = {1.1000000000000001, 2.2000000000000002} ...}
(gdb) print $ymm0
$4 = { v4_double = {1.1000000000000001, 2.2000000000000002, 0, 0} ...}
(gdb) next
18	          movapd   xmm0, [xdata2]
(gdb) print $xmm0
$5 = { v2_double = {5.5, 6.5999999999999996} ...}
(gdb) print $ymm0
$6 = { v4_double = {5.5, 6.5999999999999996, 7.7000000000000002,
                    8.8000000000000007} ...}
(gdb) next
20	          leave
(gdb) print $xmm0
$7 = { v2_double = {11.1, 22.199999999999999} ...}
(gdb) print $ymm0
$8 = { v4_double = {11.1, 22.199999999999999, 7.7000000000000002,
                    8.8000000000000007} ...}
(gdb)

The last instruction shows how the movapd fully loads the xmm0 register and only touches the lower part of the ymm0 register (the higher part is not altered.)

Shuffling registers

Here we show the shufpd instruction which "shuffles" the parts of two registers into the first one, upon two controlling bits passwd as third parameter:

;
; shuffle.asm: shuffling registers
;
          default rel

%macro    prelude 0
          push   rbp
          mov    rbp, rsp
%endmacro

          section .text
          global main
main:
          prelude
xload:
          movapd   xmm0, [xdata3]
          movapd   xmm1, [xdata4]
          shufpd   xmm0, xmm1, 0x00
          movapd   xmm0, [xdata3]
          movapd   xmm1, [xdata4]
          shufpd   xmm0, xmm1, 0x01
          movapd   xmm0, [xdata3]
          movapd   xmm1, [xdata4]
          shufpd   xmm0, xmm1, 0x02
          movapd   xmm0, [xdata3]
          movapd   xmm1, [xdata4]
          shufpd   xmm0, xmm1, 0x03
          movapd   xmm0, [xdata3]
          shufpd   xmm0, xmm0, 0x00

          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits

; C 0-terminated strings
          section .rodata
align 16
xdata3:   dq 37.0
          dq 41.0
align 16
xdata4:   dq 43.0
          dq 53.0

Again, we use gdb to inspect the behavior:

Breakpoint 1, xload () at shuffle.asm:16
16	          movapd   xmm0, [xdata3]
(gdb) next
17	          movapd   xmm1, [xdata4]
(gdb) next
18	          shufpd   xmm0, xmm1, 0x00
(gdb) print $xmm0.v2_double
$7 = {37, 41}
(gdb) print $xmm1.v2_double
$8 = {43, 53}
(gdb) next
19	          movapd   xmm0, [xdata3]
(gdb) print $xmm0.v2_double
$9 = {37, 43}
(gdb) print $xmm1.v2_double
$10 = {43, 53}
(gdb) next
20	          movapd   xmm1, [xdata4]
(gdb)
21	          shufpd   xmm0, xmm1, 0x01
(gdb)
22	          movapd   xmm0, [xdata3]
(gdb) print $xmm0.v2_double
$11 = {41, 43}
(gdb) next
23	          movapd   xmm1, [xdata4]
(gdb)
24	          shufpd   xmm0, xmm1, 0x02
(gdb)
25	          movapd   xmm0, [xdata3]
(gdb) print $xmm0.v2_double
$12 = {37, 53}
(gdb) next
26	          movapd   xmm1, [xdata4]
(gdb)
27	          shufpd   xmm0, xmm1, 0x03
(gdb)
29	          leave
(gdb) print $xmm0.v2_double
$13 = {41, 53}
(gdb)

So, we have the initial values xmm0 = {37, 41}, xmm1 = {43, 53}; after the shuffles, we get the following results for xmm0 (xmm1 does not change), dependent on the control bits:

0x00 → bit-0 = 0 ; bit-1 = 0

The bit-0 controls the lower lower part of the destination. If zero, it gets the lower part of the first argument; else, the higher part of the first argument. Here we are in the former case, effectively doing nothing, so the lower part of xmm0 remains in 37.

The bit-1 controls the higher part in a similar way as before, but using the second argument as source. So the higher part of xmm0 gets the lower part of the second argument (43.) So we got as result: xmm0 = {37, 43}.

0x01 → bit-0 = 1 ; bit-1 = 0

Only the lower part gets the higher part of the first argument, so we get xmm0 = {41, 43}.

0x02 → bit-0 = 0 ; bit-1 = 1

The lower part remains from the first argument (37) and the higher part gets the higher part of the second argument, so we get xmm0 = {37, 53}.

0x03 → bit-0 = 1 ; bit-1 = 1

Combining the previous cases, we get xmm0 = {41, 53}.

The last shuffling is the instruction shufpd xmm0, xmm0, 0x00, which from the previous explanations results in the copy of the lower part of the register into the upper part.

Matrix Inversion revisited

To avoid unhelpful complications, we’ll require the matrix dimensions to be an even number, in order to always store the numbers in pairs; the following version improves the inner loop of the computation, which is where the performance benefits at most.

;
; inv2.asm: inverting a matrix
;
          default rel
          extern printf, clock_gettime, rand, srand, calloc, atoi, putchar

%macro    prelude 0
          push   rbp
          mov    rbp, rsp
%endmacro

          section .text
          global main
main:
          prelude
          sub   rsp, 48

          mov   rdi, [rsi+8]       ; get matrix size from argv[1]
          call  atoi wrt ..plt
          mov   r15, rax

          mov   edi, eax
          and   edi, 1
          jz    is_even

          lea   rdi, [noteven]
          mov   eax, eax
          call  printf wrt ..plt
          leave
          ret

is_even:
          mul   eax                ; n^2
          mov   edi, eax
          mov   rsi, 8             ; no wasted space
ccal:
          call  calloc wrt ..plt
          mov   qword [m1], rax

          mov   rax, r15
          mul   eax                ; n^2
          xor   rdi, rdi
          mov   edi, eax
          mov   rsi, 8             ; no wasted space
          call  calloc wrt ..plt
          mov   qword [m2], rax

; srand(0)
          xor   rdi, rdi
          call  srand wrt ..plt

; build random on m1
          mov   rax, [randmax]
          cvtsi2sd xmm4, rax

          xor   r12, r12
nextr:
          xor   r13, r13
nextc:
          movdqu  oword [rsp+32], xmm4
          call  rand wrt ..plt
          movdqu  xmm4, oword [rsp+32]
          cvtsi2sd xmm0, rax
          divpd    xmm0, xmm4      ; range is 0-1

          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 8
;         movapd   [m1+rax], xmm0   -> relocation R_X86_64_32
          mov   r14, qword [m1]
          add   r14, rax
          movlpd   [r14], xmm0     ; low quadword
          inc   r13
          cmp   r13, r15
          jl    nextc
          inc   r12
          cmp   r12, r15
          jl    nextr

; build identity on m2
          xor   r12, r12
nextri:
          mov   rax, r15
          imul  rax, r12
          add   rax, r12
          imul  rax, 8
;         movapd   [m2+rax], xmm0   -> relocation R_X86_64_32
          mov   r14, qword [m2]
          add   r14, rax
          mov   rax, 1
          cvtsi2sd xmm0, rax
          movlpd   [r14], xmm0     ; low quadword
          inc   r12
          cmp   r12, r15
          jl    nextri
          
; printing
          mov   r14, qword [m1]
          call  show
          mov   r14, qword [m2]
          call  show

; get start time
          mov   rdi, 0
          mov   rsi, rsp
          call  clock_gettime wrt ..plt

; begin invertion
          mov   rax, 1
          cvtsi2sd xmm5, rax       ; xmm5 contains 1.0

          xor   r13, r13           ; main loop: column iteration
inextc:
          mov   rax, r15
          imul  rax, r13
          add   rax, r13
          imul  rax, 8
          mov   r14, qword [m1]
          add   r14, rax
          movlpd   xmm0, [r14]
          movapd   xmm3, xmm5
          divpd    xmm3, xmm0      ; xmm3 has ipivot = 1.0/m1[c*n+c];

          xor   rbx, rbx           ; normalize pivot cell: row factor loop
inextx:
          mov   rax, r15
          imul  rax, r13
          add   rax, rbx
          imul  rax, 8

          mov   r14, qword [m1]
          add   r14, rax
          movlpd   xmm0, [r14]     ; get m1[c*n+x]
          mulpd    xmm0, xmm3
          movlpd   [r14], xmm0     ; m1[c*n+x] *= ipivot;

          mov   r14, qword [m2]
          add   r14, rax
          movlpd   xmm0, [r14]     ; get m1[c*n+x]
          mulpd    xmm0, xmm3
          movlpd   [r14], xmm0     ; m2[c*n+x] *= ipivot;

          inc   rbx
          cmp   rbx, r15
          jl    inextx
          xor   r12, r12
inextr:
          cmp   r12, r13           ; if(r != c)
          je    inextre

          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 8
          mov   r14, qword [m1]
          add   r14, rax
          movlpd   xmm3, [r14]     ; factor = m1[r*n+c];
          shufpd   xmm3, xmm3, 0x00

          xor   rbx, rbx
inextxx:
          mov   rax, r15
          imul  rax, r13
          add   rax, rbx
          imul  rax, 8
          mov   r14, qword [m1]
          add   r14, rax
          movapd   xmm2, [r14]     ; m1[c*n+x] and next
          mulpd    xmm2, xmm3      ; factor * m1[c*n+x]
          mov   r14, qword [m2]
          add   r14, rax
          movapd   xmm1, [r14]     ; m2[c*n+x] and next
          mulpd    xmm1, xmm3      ; factor * m2[c*n+x]
          mov   rax, r15
          imul  rax, r12
          add   rax, rbx
          imul  rax, 8
          mov   r14, qword [m1]
          add   r14, rax
          movapd   xmm0, [r14]     ; m1[r*n+x] and next
          subpd    xmm0, xmm2
          movapd   [r14], xmm0     ; m1[r*n+x] -= factor * m1[c*n+x]
          mov   r14, qword [m2]
          add   r14, rax
          movapd   xmm0, [r14]     ; m2[r*n+x]
          subpd    xmm0, xmm1
          movapd   [r14], xmm0     ; m2[r*n+x] -= factor * m2[c*n+x]

          inc   rbx
          inc   rbx                ; rbx+=2
          cmp   rbx, r15
          jl    inextxx
inextre:
          inc   r12
          cmp   r12, r15
          jl    inextr

          inc   r13
          cmp   r13, r15
          jl    inextc


; get end time
          mov   rdi, 0
          lea   rsi, [rsp+16]
          call  clock_gettime wrt ..plt

; calculate time difference
          mov   rax, 1000000000
          cvtsi2sd xmm4, rax
          mov   rax, [rsp]
          cvtsi2sd xmm0, rax
          mov   rax, [rsp+8]
          cvtsi2sd xmm1, rax
          mov   rax, [rsp+16]
          cvtsi2sd xmm2, rax
          mov   rax, [rsp+24]
          cvtsi2sd xmm3, rax
          subpd    xmm3, xmm1
          subpd    xmm2, xmm0
          divpd    xmm3, xmm4
          movapd   xmm0, xmm2    ; 
          addpd    xmm0, xmm3

; show time difference
          lea   rdi, [format]
          mov   eax, 1
          call  printf wrt ..plt

; printing again
          mov   r14, qword [m1]
          call  show
          mov   r14, qword [m2]
          call  show

; get total
          mov   r14, qword [m2]
          call  matrixtotal
          
          lea   rdi, [ftotal]
          mov   eax, 1
          call  printf wrt ..plt

; get trace
          mov   r14, qword [m2]
          call  matrixtrace
          
          lea   rdi, [ftrace]
          mov   eax, 1
          call  printf wrt ..plt

          leave
          ret

; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtotal:
          prelude
          sub   rsp, 16
          mov   qword [rsp], r14
          pxor  xmm0, xmm0
          xor   r12, r12
tonextr:
          xor   r13, r13
tonextc:
          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 8
          mov   r14, qword [rsp]
          add   r14, rax
          movlpd   xmm1, [r14]
          addpd    xmm0, xmm1
          inc   r13
          cmp   r13, r15
          jl    tonextc
          inc   r12
          cmp   r12, r15
          jl    tonextr
          leave
          ret

; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtrace:
          prelude
          sub   rsp, 16
          mov   qword [rsp], r14
          pxor  xmm0, xmm0
          xor   r12, r12
trnextr:
          mov   rax, r15
          imul  rax, r12
          add   rax, r12
          imul  rax, 8
          mov   r14, qword [rsp]
          add   r14, rax
          movlpd   xmm1, [r14]
          addpd    xmm0, xmm1
          inc   r12
          cmp   r12, r15
          jl    trnextr
          leave
          ret

; r15 has matrix size, r14 has pointer to the data
show:
          prelude
          sub   rsp, 16
          cmp   r15, 10
          jg    toobig
          mov   qword [rsp], r14
          xor   r12, r12
snextr:
          xor   r13, r13
snextc:
          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 8
          mov   r14, qword [rsp]
          add   r14, rax
          movlpd   xmm0, [r14]
          lea   rdi, [mformat]
          mov   eax, 1
          call  printf wrt ..plt
          inc   r13
          cmp   r13, r15
          jl    snextc
          mov   rdi, 10
          call  putchar wrt ..plt
          inc   r12
          cmp   r12, r15
          jl    snextr
          mov   rdi, 10
          call  putchar wrt ..plt

          leave
          ret
toobig:   lea   rdi, [ftoobig]
          xor   eax, eax
          call  printf wrt ..plt
          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits


; C 0-terminated strings
          section .rodata
noteven:  db "Matrix must have even dimensions", 10, 0
format:   db "time dif is %g sec", 10, 0
ftoobig:  db "Too big to show", 10, 0
ftotal:   db "total  = %g", 10, 0
ftrace:   db "trace  = %g", 10, 0
mformat:  db "%10g ", 0
randmax:  dq 2147483647
          section .bss
n:        resb 8
m1:       resb 8
m2:       resb 8
  1. When creating the random and identity matrices, we explicitly ask for the low part of the xmm* register; for example with movlpd [r14], xmm0

  2. Also, we calculate memory locations by a factor of 8 bytes (imul rax, 8)

  3. In the inner loop we use memory aligned access of 16 bytes (128 bits)

  4. The addresses are obtained with the formula (r15 * RC + rbx)*8 where RC=r12 for rows and RC=r13 for columns. Since r15 and rbx are even, the result is 16-byte aligned

  5. We use the shufpd instruction to copy the lower part of xmm3 into the higher one, since this same value is to be multiplied by other two-valued registers.

The time is improved:

diego@dataone:~/devel/ASM$ ./inv2 500
time dif is 0.61319 sec
total  = 1.94742
trace  = 3.69845
diego@dataone:~/devel/ASM$ ./inv2 500
time dif is 0.587854 sec
total  = 1.94742
trace  = 3.69845

Matrix Inversion with AVX

The following assembly implementation builds on the previous one, making use of the so-called AVX (advanced vector extensions) instructions, which allow some "vectorized" operations to be carried on. This is employed only in the most inner loop of the matrix inversion, where the 256-bit registers ymm* are used to store four double-precision values, pointing to 32-bytes aligned memory regions.

Note that we reserve 32 additional bytes in the calloc calls in order to align the memory block for the matrices.

To avoid unhelpful complications, we’ll require the matrix dimensions to be a multiple of four.

;
; inv3.asm: inverting a matrix
;
          default rel
          extern printf, clock_gettime, rand, srand, calloc, atoi, putchar

%macro    prelude 0
          push   rbp
          mov    rbp, rsp
%endmacro

          section .text
          global main
main:
          prelude
          sub   rsp, 48

          mov   rdi, [rsi+8]       ; get matrix size from argv[1]
          call  atoi wrt ..plt
          mov   r15, rax

          mov   edi, eax
          and   edi, 3
          jz    is_m4

          lea   rdi, [notm4]
          mov   eax, eax
          call  printf wrt ..plt
          leave
          ret

is_m4:
          mul   eax                ; n^2
          add   eax, 4             ; 32 extra bytes
          mov   edi, eax
          mov   rsi, 8             ; no wasted space
ccal:
          call  calloc wrt ..plt
          mov   edi, eax
          and   edi, 0x1f
          jz    a32fixed1
          mov   esi, 32
          sub   esi, edi
          add   rax, rsi
a32fixed1:
          mov   qword [m1], rax

          mov   rax, r15
          mul   eax                ; n^2
          add   eax, 4             ; 32 extra bytes
          xor   rdi, rdi
          mov   edi, eax
          mov   rsi, 8             ; no wasted space
          call  calloc wrt ..plt
          mov   edi, eax
          and   edi, 0x1f
          jz    a32fixed2
          mov   esi, 32
          sub   esi, edi
          add   rax, rsi
a32fixed2:
          mov   qword [m2], rax

; srand(0)
          xor   rdi, rdi
          call  srand wrt ..plt

; build random on m1
          mov   rax, [randmax]
          cvtsi2sd xmm4, rax

          xor   r12, r12
nextr:
          xor   r13, r13
nextc:
          movdqu  oword [rsp+32], xmm4
          call  rand wrt ..plt
          movdqu  xmm4, oword [rsp+32]
          cvtsi2sd xmm0, rax
          divpd    xmm0, xmm4      ; range is 0-1

          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 8
;         movapd   [m1+rax], xmm0   -> relocation R_X86_64_32
          mov   r14, qword [m1]
          add   r14, rax
          movlpd   [r14], xmm0     ; low quadword
          inc   r13
          cmp   r13, r15
          jl    nextc
          inc   r12
          cmp   r12, r15
          jl    nextr

; build identity on m2
          xor   r12, r12
nextri:
          mov   rax, r15
          imul  rax, r12
          add   rax, r12
          imul  rax, 8
;         movapd   [m2+rax], xmm0   -> relocation R_X86_64_32
          mov   r14, qword [m2]
          add   r14, rax
          mov   rax, 1
          cvtsi2sd xmm0, rax
          movlpd   [r14], xmm0     ; low quadword
          inc   r12
          cmp   r12, r15
          jl    nextri
          
; printing
          mov   r14, qword [m1]
          call  show
          mov   r14, qword [m2]
          call  show

; get start time
          mov   rdi, 0
          mov   rsi, rsp
          call  clock_gettime wrt ..plt

; begin invertion
          mov   rax, 1
          cvtsi2sd xmm5, rax       ; xmm5 contains 1.0

          xor   r13, r13           ; main loop: column iteration
inextc:
          mov   rax, r15
          imul  rax, r13
          add   rax, r13
          imul  rax, 8
          mov   r14, qword [m1]
          add   r14, rax
          movlpd   xmm0, [r14]
          movapd   xmm3, xmm5
          divpd    xmm3, xmm0      ; xmm3 has ipivot = 1.0/m1[c*n+c];

          xor   rbx, rbx           ; normalize pivot cell: row factor loop
inextx:
          mov   rax, r15
          imul  rax, r13
          add   rax, rbx
          imul  rax, 8

          mov   r14, qword [m1]
          add   r14, rax
          movlpd   xmm0, [r14]     ; get m1[c*n+x]
          mulpd    xmm0, xmm3
          movlpd   [r14], xmm0     ; m1[c*n+x] *= ipivot;

          mov   r14, qword [m2]
          add   r14, rax
          movlpd   xmm0, [r14]     ; get m1[c*n+x]
          mulpd    xmm0, xmm3
          movlpd   [r14], xmm0     ; m2[c*n+x] *= ipivot;

          inc   rbx
          cmp   rbx, r15
          jl    inextx
          xor   r12, r12
inextr:
          cmp   r12, r13           ; if(r != c)
          je    inextre

          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 8
          mov   r14, qword [m1]
          add   r14, rax
          movlpd   xmm3, [r14]     ; factor = m1[r*n+c];
          shufpd   xmm3, xmm3, 0x00
          vinserti128 ymm3, ymm3, xmm3, 1

          xor   rbx, rbx
inextxx:
          mov   rax, r15
          imul  rax, r13
          add   rax, rbx
;          imul  rax, 8
          shl   rax, 3
          mov   r14, qword [m1]
          add   r14, rax
          vmovapd  ymm2, [r14]     ; m1[c*n+x] and next
          vmulpd   ymm2, ymm3      ; factor * m1[c*n+x]
          mov   r14, qword [m2]
          add   r14, rax
          vmovapd  ymm1, [r14]     ; m2[c*n+x] and next
          vmulpd   ymm1, ymm3      ; factor * m2[c*n+x]
          mov   rax, r15
          imul  rax, r12
          add   rax, rbx
;          imul  rax, 8
          shl   rax, 3
          mov   r14, qword [m1]
          add   r14, rax
          vmovapd  ymm0, [r14]     ; m1[r*n+x] and next
          vsubpd   ymm0, ymm2
          vmovapd  [r14], ymm0     ; m1[r*n+x] -= factor * m1[c*n+x]
          mov   r14, qword [m2]
          add   r14, rax
          vmovapd  ymm0, [r14]     ; m2[r*n+x]
          vsubpd   ymm0, ymm1
          vmovapd  [r14], ymm0     ; m2[r*n+x] -= factor * m2[c*n+x]

          add   rbx, 4
          cmp   rbx, r15
          jl    inextxx
inextre:
          inc   r12
          cmp   r12, r15
          jl    inextr

          inc   r13
          cmp   r13, r15
          jl    inextc


; get end time
          mov   rdi, 0
          lea   rsi, [rsp+16]
          call  clock_gettime wrt ..plt

; calculate time difference
          mov   rax, 1000000000
          cvtsi2sd xmm4, rax
          mov   rax, [rsp]
          cvtsi2sd xmm0, rax
          mov   rax, [rsp+8]
          cvtsi2sd xmm1, rax
          mov   rax, [rsp+16]
          cvtsi2sd xmm2, rax
          mov   rax, [rsp+24]
          cvtsi2sd xmm3, rax
          subpd    xmm3, xmm1
          subpd    xmm2, xmm0
          divpd    xmm3, xmm4
          movapd   xmm0, xmm2    ; 
          addpd    xmm0, xmm3

; show time difference
          lea   rdi, [format]
          mov   eax, 1
          call  printf wrt ..plt

; printing again
          mov   r14, qword [m1]
          call  show
          mov   r14, qword [m2]
          call  show

; get total
          mov   r14, qword [m2]
          call  matrixtotal
          
          lea   rdi, [ftotal]
          mov   eax, 1
          call  printf wrt ..plt

; get trace
          mov   r14, qword [m2]
          call  matrixtrace
          
          lea   rdi, [ftrace]
          mov   eax, 1
          call  printf wrt ..plt

          leave
          ret

; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtotal:
          prelude
          sub   rsp, 16
          mov   qword [rsp], r14
          pxor  xmm0, xmm0
          xor   r12, r12
tonextr:
          xor   r13, r13
tonextc:
          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 8
          mov   r14, qword [rsp]
          add   r14, rax
          movlpd   xmm1, [r14]
          addpd    xmm0, xmm1
          inc   r13
          cmp   r13, r15
          jl    tonextc
          inc   r12
          cmp   r12, r15
          jl    tonextr
          leave
          ret

; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtrace:
          prelude
          sub   rsp, 16
          mov   qword [rsp], r14
          pxor  xmm0, xmm0
          xor   r12, r12
trnextr:
          mov   rax, r15
          imul  rax, r12
          add   rax, r12
          imul  rax, 8
          mov   r14, qword [rsp]
          add   r14, rax
          movlpd   xmm1, [r14]
          addpd    xmm0, xmm1
          inc   r12
          cmp   r12, r15
          jl    trnextr
          leave
          ret

; r15 has matrix size, r14 has pointer to the data
show:
          prelude
          sub   rsp, 16
          cmp   r15, 10
          jg    toobig
          mov   qword [rsp], r14
          xor   r12, r12
snextr:
          xor   r13, r13
snextc:
          mov   rax, r15
          imul  rax, r12
          add   rax, r13
          imul  rax, 8
          mov   r14, qword [rsp]
          add   r14, rax
          movlpd   xmm0, [r14]
          lea   rdi, [mformat]
          mov   eax, 1
          call  printf wrt ..plt
          inc   r13
          cmp   r13, r15
          jl    snextc
          mov   rdi, 10
          call  putchar wrt ..plt
          inc   r12
          cmp   r12, r15
          jl    snextr
          mov   rdi, 10
          call  putchar wrt ..plt

          leave
          ret
toobig:   lea   rdi, [ftoobig]
          xor   eax, eax
          call  printf wrt ..plt
          leave
          ret

          section .note.GNU-stack noalloc noexec nowrite progbits


; C 0-terminated strings
          section .rodata
notm4:    db "Matrix must have dimensions multiple of four", 10, 0
format:   db "time dif is %g sec", 10, 0
ftoobig:  db "Too big to show", 10, 0
ftotal:   db "total  = %g", 10, 0
ftrace:   db "trace  = %g", 10, 0
mformat:  db "%10g ", 0
randmax:  dq 2147483647
          section .bss
n:        resb 8
m1:       resb 8
m2:       resb 8

The times were barely improved from the previous version, maybe because of memory bandwith limitations.

diego@dataone:~/devel/ASM$ ./inv3 500
time dif is 0.523782 sec
total  = 1.94742
trace  = 3.69845
diego@dataone:~/devel/ASM$ ./inv3 500
time dif is 0.534893 sec
total  = 1.94742
trace  = 3.69845

To copy the lower part of ymm3 into its high part, we used the VINSERTI128 instruction, effectively quadrupling the factor value. From https://www.felixcloutier.com/x86/vinserti128:vinserti32x4:vinserti64x2:vinserti32x8:vinserti64x4 its behavior is:

TEMP[255:0] <-SRC1[255:0]
CASE (imm8[0]) OF
    0: TEMP[127:0]<-SRC2[127:0]
    1: TEMP[255:128]<-SRC2[127:0]
ESAC
DEST <-TEMP

In this reference this instruction is listed as AVX2 in the col "CPUID Feature Flag". That the "AVX2" extension is present in our test cpu can be seen with the /proc/cpuinfo file:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 101
model name	: AMD A8-9600 RADEON R7, 10 COMPUTE CORES 4C+6G
...
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
 lm constant_tsc rep_good acc_power nopl nonstop_tsc cpuid extd_apicid aperfmper
f pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx
f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefet
ch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perf
ctr_nb bpext ptsc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep
 bmi2 xsaveopt arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid
 decodeassists pausefilter pfthreshold avic vgif overflow_recov

It is interesting to note that this program crashed in another equipment which doesn’t have such extension:

$ ./inv3 500
Too big to show
Too big to show
Illegal instruction (core dumped)
$ more /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz
...
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm cons
tant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmp
erf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid ss
e4_1 sse4_2 popcnt tsc_deadline_timer xsave avx lahf_lm epb pti ssbd ibrs ibpb s
tibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts md_clear
Note
The core dumped message is misleading. See for example https://stackoverflow.com/questions/2065912

Running a session of gdb:

$ gdb ./inv3
GNU gdb (Debian 13.1-3) 13.1
...
(gdb) run 500
Starting program: /home/diego/git/asm-repo/inv3 500
...
Program received signal SIGILL, Illegal instruction.
inextr () at inv3.asm:176
176	          vinserti128 ymm3, ymm3, xmm3, 1

The culprit seems to be the vinserti128 instruction, which is part of the AVX2 extension.

Now how to figure out the supported cpu features (like AVX2, SSE, etc.) in runtime? in brief, the cpuid instruction is in charge (see for example https://stackoverflow.com/a/17754393/4876728); for the AVX2 case, the Intel Software Developer’s Manual (section 14.7.1: Detection of Intel® AVX2) provides sample code which we adapted for an inv4.asm version that improves over the previous inv3.asm (we only show the differences between those source files):

$ diff -c inv3.asm inv4.asm
*** inv3.asm
--- inv4.asm
***************
*** 15,20 ****
--- 15,38 ----
            prelude
            sub   rsp, 48

+           mov   eax, 1
+           cpuid
+           and   ecx, 0x018000000
+           cmp   ecx, 0x018000000
+           jne   n_avx2
+           mov   eax, 7
+           mov   ecx, 0
+           cpuid
+           and   ebx, 0x20
+           cmp   ebx, 0x20
+           je    w_avx2
+ n_avx2:
+           lea   rdi, [notavx2]
+           mov   eax, eax
+           call  printf wrt ..plt
+           leave
+           ret
+ w_avx2:
            mov   rdi, [rsi+8]       ; get matrix size from argv[1]
            call  atoi wrt ..plt
            mov   r15, rax
***************
*** 365,370 ****
--- 383,389 ----

  ; C 0-terminated strings
            section .rodata
+ notavx2:  db "This cpu does not support the AVX2 extension", 10, 0
  notm4:    db "Matrix must have dimensions multiple of four", 10, 0
  format:   db "time dif is %g sec", 10, 0
  ftoobig:  db "Too big to show", 10, 0

This way, in the cpu missing AVX2 we get:

$ ./inv4 500
This cpu does not support the AVX2 extension

Appendix: Makefile

PROGS = test1 test2 test3 test4 exit32 model write \
	triangle test5 test6 test7 test8 test9 timing \
	sizes inv model-inv inv2 inv3 inv4 packed shuffle \
	xdump1

AS = nasm
ifeq ($(AS),nasm)
ASOPTS = -f elf64 -g -l
else
ASOPTS = -f elf64 -g dwarf2 -l
endif

# for yasm, run with:
# make "AS=yasm"

all: $(PROGS) assembler.html

.SUFFIXES: .o .asm

assembler.html: assembler.adoc
	asciidoctor assembler.adoc

sizes: sizes.asm xdump.o
	$(AS) $(ASOPTS) sizes.lst sizes.asm
	gcc -O0 -o sizes sizes.o xdump.o

xdump1: xdump1.c xdump.o
	gcc -g -c xdump1.c
	gcc -o xdump1 xdump1.o xdump.o

model-inv: model-inv.c
	gcc -o model-inv model-inv.c 

model: model.c
	gcc -S model.c
	gcc -o model model.c

test1: test1.asm
	$(AS) $(ASOPTS) test1.lst test1.asm
	ld --dynamic-linker=/lib64/ld-linux-x86-64.so.2  -o test1 test1.o -lc

test4: test4.asm
	$(AS) $(ASOPTS) test4.lst test4.asm
	gcc -no-pie -o test4 test4.o

test3: test3.asm
	$(AS) $(ASOPTS) test3.lst test3.asm
	gcc -o $@ test3.o

exit32: exit32.asm
	$(AS) $(ASOPTS) exit32.lst exit32.asm
	ld -o exit32 exit32.o

.o:
	gcc -o $@ $<

.asm.o:
	$(AS) $(ASOPTS) $*.lst $<

clean:
	rm -f $(PROGS) *.o *.lst *.s assembler.html