Post #0: In Which I Set Out to Write a Terrible Compiler for a Useless Language

... and accidentally write a quine generator (keep reading)

Hi! I'm oreganoli. I write code and pseudo-intellectual scribblings. This post is going to be mostly the latter, with snippets of the former throughout, not all actually mine.

I recently got the idea to write a toy x86 compiler for Brainfuck. Brainfuck is a programming language of the type some call a "Turing tarpit" - any calculation is theoretically doable in it, but it has no practical use because of just how simple it is. A standard implementation of Brainfuck gives you as much as 30 kilobytes of memory to play with, a pointer, an if instruction, an increment/decrement instruction, and what amounts to a single-byte fputs and a getchar() function. As we can see, the possibilities are endless, and knowledge of the language and its internals is a valuable addition to any resumé.

Have you even read the dragon book?

I've heard it exists.

Why do such a damn fool thing?

First of all, it just seemed like a good idea at the time. Second, with university classes picking up soon, I felt like I needed to flex my programming muscles and warm up. Third, I was intrigued by the lower-level topics of software engineering - CPU-specific assembly code and the inner workings of programming languages, which just aren't in my curriculum. I could have maybe gone to a polytechnic uni, but they have physics there.

What exactly are you trying to do here?

The goal of this exercise is straightforward: to write a program that takes a string of valid Brainfuck source code and produces the equivalent x86-64 native Linux executable - in other words, a compiler.

But keep in mind, dear reader, that I don't actually know x86 assembly or the theory behind programming language parsers and I'm going to try to learn by looking things up as I go. Therefore, I will first walk you through implementing a much simpler and necessarily Turing-incomplete programming language - one in which only "Hello, world!" executables displaying a piece of text and exiting can be written (but for any arbitrary piece of text!).

Implementing The Hello World Display Programming Language

To figure out how to generate x86 machine code that when run displays a string and exits, we should first find and analyze an example. Luckily, there is no shortage of such examples on the internet. We're going to analyze the latter, from the book "Linux Assembly How-to", chapter 6.1. Let's dig in:

section .text ; section declaration
;we must export the entry point to the ELF linker or
global  _start ; loader. They conventionally recognize (...)

_start:
    ; write our string to stdout
    mov edx, len ; third argument: message length
    mov ecx, msg ; second argument: pointer to message to write
    mov ebx, 1 ; first argument: file handle (stdout)
    mov eax, 4 ; system call number (sys_write)
    int 0x80 ; call kernel
    ; and exit
  	mov ebx, 0  ; first syscall argument: exit code
    mov eax, 1 ; system call number (sys_exit)
    int 0x80 ; call kernel

section .data ; section declaration
msg db "Hello, world!", 0xa ;our dear string
len equ $ - msg ; length of our dear string

What exactly are we seeing here? I don't know, let's find out by looking everything up!

The first words we see are:

section .text

As Rideau, Boldyshev and Noordergraaf have already explained in the comments delimited by ;, an .asm source code file is broken into sections for the assembler to treat differently. text is the code itself and is the only required section. On the next line, to appease Linux we must write:

global _start

Now, global, like section, isn't actual assembly code. It's an assembler directive that lets other code call the function whose name comes after it. _start is our equivalent to the fn main() of higher-level programming languages - it's the function that convention dictates be run first when an executable launches. Next follows the actual function declaration:

_start:

Only a colon? That's all? Reminds you of Python, doesn't it? But where are the parentheses and the list of arguments, you might ask?

The answer is: nowhere. In assembly code, we get our arguments, or pointers to where they are in memory, directly from the CPU's registers (small pieces of memory built into a processor that can hold single numbers) or the stack (we'll get to that later, maybe). The standard for how exactly this corresponds to the concept of functions in a higher-level language like C is called a calling convention, and on the x86 architecture there is more than one. Thankfully, we're only going to be writing a compiler for one OS - modern x86-64 Linux, so we don't need to learn them. Moving on:

As per the Wikibook on x86 assembly, mov is the x86 instruction for copying a number from one place in memory to another, provided the numbers are of the same size. As we've been using NASM/Intel syntax, the first "argument", or rather operand to mov is the destination and the second is the source rather than the other way around. Let's look at how we're supposed to go about writing a string of bytes to the standard output:

mov edx, len
mov ecx, msg
mov ebx, 1
mov eax, 4
int 0x80

The first obvious question to ask is what eax, ebx, ecx and edx are. They are registers - general-purpose registers to be precise. Each of them has a designated conventional use:

The 8 GPRs are:

Accumulator register (AX). Used in arithmetic operations.
Counter register (CX). Used in shift/rotate instructions and loops.
Data register (DX). Used in arithmetic operations and I/O operations.
Base register (BX). Used as a pointer to data (located in segment register DS, >when in segmented mode).
Stack Pointer register (SP). Pointer to the top of the stack.
Stack Base Pointer register (BP). Used to point to the base of the stack.
Source Index register (SI). Used as a pointer to a source in stream operations.
Destination Index register (DI). Used as a pointer to a destination in stream >operations.
The order in which they are listed here is for a reason: it is the same order that is used in a push-to-stack operation, which will be covered later.

The "E" in front of these names indicates that we're dealing with a 32-bit register. The x86 architecture started with the Intel 8086, which was a 16-bit processor, so a 32-bit register would have been an extended one from an 8086 programmer's point of view, thus the E. But wait - 32 bits? Doesn't this remind us of something?

If you answered "a four-byte C int", you guessed correctly! We'll get back to that in a minute.

len and msg are obviously variable names of some sort, and indeed we find declarations for them in the .data section of our program:

msg db "Hello, world!",0xa
len equ $ - msg

Long story short, db means that we're dealing with a string of bytes here. Unfortunately, NASM doesn't do C-like escape sequences like \n, so we must write our customary newline byte as a raw number - 0xA - and concatenate the string and the newline with a comma. Seems reasonable enough. len uses some more NASM directive magic - equ is a sort of equivalent to a #define in C, $ is the address of the place in memory we should be in right now and we subtract the position of len to get the length of the string that starts there. So far this has been pretty similar to C arrays.

And now back to the Unix system call we're making - what's the signature of the arguments of the sys_write function?

ssize_t sys_write(unsigned int fd, const char * buf, size_t count)

Because the Linux calling convention goes right-to-left, it should be easy to see where this is going. len corresponds to count, msg to the pointer buf, and the file descriptor number, fd, is a constant 1 - which on Linux is the standard output's number as per the POSIX standard.

Okay, but what about the next two lines? int 0x80 might look familiar to us - at first glance we might think it's some sort of integer declaration. We couldn't be further from the truth, however - int stands not for "integer", but for "software interrupt"! Without getting into the nitty-gritty of it, interrupts temporarily "save our progress" by safely storing away the program counter - essentially the pointer to the current instruction - and hand control of the CPU over to somewhere else, that "somewhere else" being in our case the Linux kernel. On Unix systems, 0x80 is conventionally the only interrupt programs are supposed to make - it tells the kernel we want to make a system call, or syscall for short, ie. our program needs something from the OS that it can't manage itself. By loading 4 into eax beforehand, we tell Linux that we want to make syscall number 4, that is: we want to write some bytes to a file.

But int 0x80 is an obsolete way to make syscalls!

I know. We're not going to be doing this when we actually get around to writing the Brainfuck compiler proper, but this is what the example "hello world" programs I found had and it Works on My Machine™. If you want to read ahead already, look up the syscall instruction.

Moving on, the exit call looks pretty similar:

mov ebx, 0
mov eax, 1
int 0x80

sys_exit is call number 1 and we want to exit with the status code 0, meaning our program has executed just fine. The calling convention again dictates we put those numbers in in the reverse order. That's it - that's a whole "hello world" program. To turn it into actual machine code, we'll need NASM and ld. The former should be in your distro's repos, and the latter you should have already. We invoke:

$ nasm -f elf64 hello.asm

This will produce a Unix object file we'll have to link for it to become an actual executable, a whole topic in itself we probably won't be covering anytime soon or ever:

$ ld -s -o hello hello.o

And finally, we should see:

$ ./hello
Hello, world!

Whew!

Implementing our first compiler

Now that we understand what makes a native hello world application tick, we can start making our own. As a card-carrying member of the Rust Evangelism Strike Force, I must note that we will be doing this in Rust, as it is my favorite language the best language for all use cases ever and you should use it whenever possible, even if you can't play to its strengths, and if you disagree with me, you are simply wrong. Jokes aside, compilers should be fast and interpreted or GC-reliant languages aren't what many people first think of when asked to write systems software, but you'll eventually find out as you keep reading that the choice of language really isn't important here. Nevertheless, Rust is a good language for implementing other languages.

A cargo new hwdpc later, we have the template for a minimal Rust program. Appropriately, it's a hello world:

fn main() {
    println!("Hello, world!");
}

If we were implementing a real programming language with rules, capable of actual computation, the first thing our program would need to do would be parsing the source file(s), and I'd need to explain things like abstract syntax trees, or worse - implement them. Luckily, we're not! What we do need to do, however, is transform the input string into a NASM-compatible form.

fn sanitize_string(input: &str) -> String { unimplemented!() }
#[test]
fn test_sanitizer() {
    assert_eq!(
        sanitize_string("Hello!\n"),
        r#""Hello!",0xA"#
    );
    assert_eq!(
        sanitize_string("Hello!"),
        r#""Hello!",0xA"#
    );
    assert_eq!(
        sanitize_string("Hello...\n...world!\n"),
        r#""Hello...",0xA,"...world!",0xA"#
    );
    assert_eq!(
        sanitize_string(""),
        "\"\""
    );
}

The r# syntax lets me use quote characters freely without escape sequences as long as there's a # after the final quote char. Intuitive readers will see where I'm going with this function.

Isn't there a more efficient way to do this? Shouldn't NASM support escape sequences somehow?

I think there is and it does, but I'd have to go back and write more words. I want to be writing code right now. Behold, crappy code that solves our problem well enough:

const BYTES: [u8; 5] = [b'\n', b'\r', b'\t', b'\\', b'\0'];
const CODES: [&'static str; 5] = ["0xA", "0xD", "0x9", "0x5C", "0x0"];
// there are more escape sequences, yes, but we'll just say they're undefined behavior
fn escape_sequence(input: u8) -> Option<&'static str> {
    for pair in BYTES.iter().zip(CODES.iter()) {
        if input == *pair.0 {
            return Some(*pair.1)
        }
    }
    None
}
use std::char;
fn sanitize_string(input: &str) -> String {
    // in the case of an empty input string, we return just an empty string:
    // empty source code can probably be a sort of artistic statement or something
    if input.is_empty() {
        return "\"\"".to_string()
    }
    let mut output = String::new();
    let mut within_quote = false; // are we already within a pair of quotes?
    let in_bytes = input.as_bytes();
    let in_len = in_bytes.len();
    for (index, ch) in in_bytes.iter().enumerate() {
        match escape_sequence(*ch) {
            None => {
                // normal character
                if !within_quote {
                    output.push('"');
                    within_quote = true;
                }
                output.push(char::from_u32(*ch as u32).expect("Invalid character"))
            },
            Some(esc) => {
                if within_quote {
                    output.push_str("\",");
                    within_quote = false;
                    output.push_str(esc);
                    if index < in_len - 1 {
                        output.push(',');
                    }
                }
            }
        }
        // if we're at the last segment
        if index == in_len - 1 {
            if within_quote {
                output.push('"'); // close quote
            }
            if *ch != b'\n' {
                // add newline if missing
                output.push_str(",0xA");
            }
        }
    }
    output
}

The above code generates a NASM-friendly "string literal" we'll use later. Let us now move on to code generation.

Code generation

Remember the hello world code I copied from a tutorial? I saved it in the source folder of the Rust project as template.asm and made some alterations to it.

section .text
global  _start
_start:
mov edx,len
mov ecx,msg
mov ebx,1
mov eax,4
int 0x80
mov eax,1
int 0x80
section .data
msg db {}
len equ $-msg

The filename should be a hint as to where I made the most significant change. Now for some more tangential Rust stuff we need:

Command-line argument parsing

What do big-boy compilers like GCC have that we don't? An actual CLI. We need to take the name of our source file and optionally that of an output file. We could roll our own argument parser, but there's a crate for that. We add the following lines to Cargo.toml:

[dependencies]
argh = "0.1.4"

Here's all the code necessary to use argh:

use argh::FromArgs;
#[derive(FromArgs)]
/// HWDPC - Hello World Display Program Compiler, v0.1
struct App {
    #[argh(positional, description = "source filename")]
    source_filename: String,
    #[argh(option, short = 'o', description = "output filename")]
    output: Option<String>
}

fn main() {
    let opts: App = argh::from_env();
    println!("{}", generate_nasm("Hello, world, from our compiler"));
}

Should we now invoke hwdpc with a --help flag, we'll get a helpful message:

Usage: hwdpc <source_filename> [-o <output>]

HWDPC - Hello World Display Program Compiler, v0.1

Options:
  -o, --output      output filename
  --help            display usage information

So beautiful. So professional.

Compilation

Finally. We're almost there.

use std::fs;
use std::process::Command;
fn main() {
    let opts: App = argh::from_env();
    let source_code = fs::read_to_string(&opts.source_filename).expect("Could not open source file");
    fs::create_dir_all("build").expect("Could not create build directory");
    let asm_filename = format!("build/{}.asm", opts.source_filename.split('.').next().expect("Expected non-empty source filename"));
    let output_filename = match opts.output {
        None => format!("{}.out", opts.source_filename.split('.').next().expect("Expected non-empty source filename")),
        Some(s) => s
    };
    let nasm = generate_nasm(&source_code);
    fs::write(&asm_filename, nasm).expect("Could not write assembly file");
    let assembler = Command::new("nasm")
    .arg("-f")
    .arg("elf64")
    .arg(&asm_filename)
    .arg("-o")
    .arg("build/build.o")
    .status().expect("Could not launch NASM assembler. Are you sure it's installed?");
    if !assembler.success() {
        eprintln!("Assembler exited with non-0 status code. This is likely a compiler error.");
        std::process::exit(1);
    }
    let linker = Command::new("ld")
    .arg("-s")
    .arg("-o")
    .arg(&output_filename)
    .arg("build/build.o")
    .status().expect("Could not launch `ld`.");
    if !linker.success() {
        eprintln!("Linker exited with non-0 status code.");
        std::process::exit(1);
    }
}
fn generate_nasm(source_code: &str) -> String {
    format!(include_str!("template.asm"), sanitize_string(source_code))
    // template.asm is just a string literal for the format! macro, but in a separate file
    // the empty {} curly braces on line 12 are where our data gets inserted
    // Rust is cool like that
}

Now let's create a file named main.hwdp:

Hello
world
from
HWDP lang!

And, after building it with cargo build, invoke our compiler:

$ target/build/hwdpc main.hwdp -o main
$ ./main
Hello
world
from
HWDP lang!

Working as intended! In fact, our compiler will work even with empty files:

$ touch zero.hwdp
$ target/build/hwdpc zero.hwdp -o zero
$ ./zero
$

Isn't this...

Could it really be...

Maybe I should check the definition...

Wikipedia: A quine is a computer program which takes no input and produces a copy of its own source code as its only output. The standard terms for these programs in the computability theory and computer science literature are "self-replicating programs", "self-reproducing programs", and "self-copying programs".

Get this. Every single program in our toy "language" is a quine. Where in another language you'd have to come up with something clever, with HWDP you don't have to do anything. In fact, it can produce the shortest quines possible - quines of length 0! And I didn't even explicitly set out to do this.

HWDP makes automatic what elsewhere would require actual effort.

Wait, so your "compiler" doesn't actually generate machine code, it generates textual NASM and then just feeds it into nasm and then the output of that into ld?

Dear reader, did you really expect me to learn how assembly code and executable linkage work? That would require time, time that could be spent writing smug blogposts in which I show off my mediocre code! Besides, don't big-boy compilers like Clang or rustc also produce intermediate representations whose actual conversion into machine code they outsource to other projects?

It still isn't a real compiler.

Hey, it fits my definition of "compiler". It's a program that eats text and shits out machine code. Now I get to say I've written an x86 compiler and I get to call myself a Real Programmer™.

Quines only count if they're in a Turing-complete language.

At least allow me the small victory of a 0-byte quine.

I've tried this with a non-ASCII source file and it isn't actually a quine.

Shhhhhhhhhh. I know. I tried. Now go and write an ASCII-only one.

I don't like your writing style. It's smug, vulgar, presumptous and try-hard. Self-deprecation and sarcasm aren't funny.

I know, but I don't know anything else. It just comes naturally to me.

Your code makes Rust look bad and ugly. Shouldn't you be evangelizing for a language you don't like?

Damn, you're right!

Conclusion

We now know how to write a toy "compiler" for a "language" arguably even more useless than Brainfuck, and in the process we have learned how to make Linux system calls in x86 assembly (in the unsupported, legacy way). Next time, I'll be writing and describing a Brainfuck parser that will transform a string of text into something understandable to a compiler. I'll post the Git repo for the code written up until now later™.