Post #3: In Which I Implement CTO for Even Faster Uselessness

Recap

This is the third and final post in the Terrible Compiler for a Useless Language series, in which I walk you through implementing BFC-RS, a Brainfuck compiler for x86-64 Linux written in Rust.

As of now, our compiler is essentially feature-complete and technically, there's nothing wrong with the code it produces. Its output is a one-to-one (or one-to-many in the case of I/O and loop instructions) translation of the following Brainfuck instructions:

  • [X] move pointer backward
  • [X] move pointer forward
  • [X] decrement byte at pointer
  • [X] increment byte at pointer
  • [X] get byte
  • [X] put byte
  • [X] begin while loop
  • [X] end while loop

However, as we've noticed, the NASM code we produce is really hard to read, as it is very repetitive and thus quite inefficient. Consider the following program, which just says "Hi!":

++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++
++
Increment current byte to 72 (= 'H' in ASCII)

> Move forward

++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
+++++
Increment current byte to 105 (= 'i')

> Move forward

++++++++++++++++++++
++++++++++
+++
Increment current byte to 33 (= '!')

> Move forward

++++++++++ Increment current byte to 10 (= '\n')

<<< Move backward 3 times in a row

.>.>.>. Print current byte; move forward and print current byte 3 times in a row; effectively print a string of length 4

If I was an insane savant I could maybe figure out something short and clever with loops but I'm not

Our compiler is quite naïve - it will produce 72 + 1 + 105 + 1 + 33 + 1 + 10 + 3 + (4 + 3 * (1 + 4)) = a whopping 245 individual assembly instructions, or 557 bytes of active code when assembled, as the text column in the output of the size command will tell us. Remember, every "put byte" instruction is its own syscall, and thus requires four instructions to fully set up and call. Let's compile our file, hi.bf (we'll need it later) to hi_naive (we'll need it later too, for benchmarking):

$ target/debug/bfc-rs hi.bf -o hi_naive
$ ./hi_naive
Hi!
$ size hi_naive
   text    data     bss     dec     hex filename
    557   30000       0   30557    775d hi_naive

And how are we going to make this more efficient?

Easy! Read hi.bf again carefully. What patterns do we notice that correspond to constructs we know from higher-level programming languages?

  • + repeated n times. That's a pretty obvious *pointer += n.
  • < repeated n times. That's a pointer -= n.
  • ., followed by >. repeated n times. That's printing a string starting at pointer of length n + 1.

All we need to do is catch these repetitions at parse time and implement our own += n, -= n and sys_write(1, pointer, n + 1) instructions at compile time.

Virtual instructions

Let us then declare a submodule of parser.rs, named cto and a tests submodule for cto.rs itself:

/// Compile-time optimization module.
mod cto;
#[cfg(test)]
mod tests;

use super::BrainfuckInstr;

/// Optimizes a list of Brainfuck instructions to be less repetitive.
pub fn optimize(code: &mut[BrainfuckInstr]) {
    unimplemented!()
}
const HI_BF: &str = include_str!("../../../hi.bf");

use super::{
    super::Parser,
    BrainfuckInstr,
    optimize
};

#[test]
fn test_hi_bf() {
    let mut parser = Parser::new();
    let mut code = parser
    .parse(HI_BF)
    .expect("Parser shat the bed: hi.bf should be a valid Brainfuck program.");
    optimize(&mut code);
    /* We haven't yet written the "virtual" instructions
    to expect repetitive patterns in hi.bf to be converted into. */
    unimplemented!()
}

Unsurprisingly, the test will fail before it even reaches unimplemented!(), because there's one inside optimize itself. Let us then define our virtual (as opposed to those naturally occurring) instructions way back in BrainfuckInstr. We'll call them "virtual" from now on because "made up" doesn't sound serious, boring and programmerly enough. There does exist, however, a concept of fantasy hardware, mostly in homebrew console gaming, so we could call them fantasy instructions for a fantasy CPU if we really wanted to.

/// ...
EndWhile, // ← add comma here
/* The instructions below are our own virtual instructions that exist for optimization purposes.
They DO NOT occur naturally in Brainfuck source code and have no corresponding text characters. */
/// Subtract from the pointer.
PointerSub(u16),
/// Add to the pointer.
PointerAdd(u16),
/// Subtract from the current numer.
DataSub(u8),
/// Add to the current number.
DataAdd(u8),
/// Print a number of bytes to the standard output **at once**.
Print(u16)

As you've cleverly noticed when we were writing the assembly generator proper, a Brainfuck program only has 30 thousand bytes of memory at its disposal and thus any operations on memory can fit into a u16. What is more, each memory cell is only one byte, so there's no reason arithmetic operations should get bigger than a u8, either.

Rust enums - algebraic data types

If you're new to Rust, you'll notice we're still within a plain old enum - that's because enums in Rust can be more like "tagged" unions from other languages than simple enums - they can contain any one of multiple types of data at runtime, but they're "tagged" so they "know" which type they are. Rust enums can also behave like structs, in that they can have an impl block and we can give them their own methods or implement traits (think interfaces on steroids) just like we would with a regular data structure.

A downside of this is that an enum is always the size you need to use to represent its biggest variant, plus the size of the "tag"; otherwise they could only be constant and would not be very useful, as Rust wouldn't know how much memory to allocate for an enum whose variant and thus size might change at runtime. For bigger types, however, we can get around this with references or Boxes, which manage memory on the heap for us and which we use where we might need malloc in C or new in C++; the pointers they compile down to are always the same predictable and (relatively, depending on what we're doing) small size.

Let's consider a type like Option, which effectively replaces the null of JS values or C pointers (manually creating a raw null pointer is a Bad Idea unless you're writing an OS or other bare-metal code yourself or you're dealing with external code Rust can't work its borrow-checking magic on and Rust will yell at you for it unless you say you know what you're doing with unsafe).

Option has an empty None variant and a Some(T) variant containing any other type T, thus it might look like this in memory when None:

0 (as many padding bytes as a T takes up on its own)

And like this when Some(T):

1 (a value of type T)

When we match on an enum like this, as you may have noticed me doing with Results and Options already, we're really matching (or switching) on that number in front. Of course, if we have just a plain old enum without this algebraic data type trickery, it will be represented as a plain old integer without padding, like simple enums in other languages.

(I haven't actually checked if that's exactly how it works, but those are the broad strokes. Feel free to correct me in the comments on Reddit or wherever else I outsource my comment section.)

But what about integer overflows? Shouldn't they happen all the time when we're dealing with just 8 bits a number?

Do we really care? Let's say it's undefined behavior and the computer might summon demons up your nose or fortituously do just what you want it to. The people who decide on the C++ standards do this every three years, say something about backwards compatibility and get away with it, why shouldn't we?

Appeasing the Rust compiler

compiler.rs will now throw a compile error and hopefully glow a menacing red color in your IDE, as the match in the instruction translation function no longer covers all cases of BrainfuckInstr. Let's make it cover them:

/*We'll be writing out the full names of our enum's variants from now on
to get proper type hints to show in our IDE, like (x: &u16).
rust-analyzer, the good Rust language server, doesn't deal well with imported enum variants yet,
unless we're talking about Options and Results - those are built-in. */
BrainfuckInstr::PointerSub(x) => {
    unimplemented!()
},
BrainfuckInstr::PointerAdd(x) => {
    unimplemented!()
},
BrainfuckInstr::DataSub(x) => {
    unimplemented!()
},
BrainfuckInstr::DataAdd(x) => {
    unimplemented!()
},
BrainfuckInstr::Print(x) => {
    unimplemented!()
}

We could have also just written _ => unimplemented!(), but this gives us an obvious visual reminder of what we need to do. We can now write our optimization tests and have them all fail:

const HI_BF: &str = include_str!("../../../hi.bf");

use super::{
    super::Parser,
    BrainfuckInstr,
    optimize
};

#[test]
/// Tests the optimization of a representative Brainfuck program, "hi.bf".
fn test_hi_bf() {
    let mut parser = Parser::new();
    let mut code = parser
    .parse(HI_BF)
    .expect("Parser shat the bed: hi.bf should be a valid Brainfuck program.");
    optimize(&mut code);
    let expected_code: Vec<BrainfuckInstr> = vec![
        DataAdd(72),
        PointerInc,
        DataAdd(105),
        PointerInc,
        DataAdd(33),
        PointerInc,
        DataAdd(10),
        PointerInc,
        PointerSub(3),
        Print(4)
    ];
    assert_eq!(&code, &expected_code)
}

use BrainfuckInstr::*;
#[test]
/// Test for pointer arithmetic optimization.
fn test_ptr_arithmetic() {
    let mut code: Vec<BrainfuckInstr> = vec![
        PointerInc, PointerInc, PointerInc,
        PointerDec,
        PointerInc,
        PointerDec, PointerDec, PointerDec, PointerDec
    ];
    optimize(&mut code);
    let expected_code: Vec<BrainfuckInstr> = vec![
        PointerAdd(3),
        PointerDec,
        PointerInc,
        PointerSub(4)
    ];
    assert_eq!(&code, &expected_code)
}
#[test]
/// Test for data arithmetic optimization.
fn test_data_arithmetic() {
    let mut code: Vec<BrainfuckInstr> = vec![
        DataDec, DataDec,
        DataInc,
        PointerDec,
        Print(5),
        DataDec,
        DataInc, DataInc, DataInc
    ];
    optimize(&mut code);
    let expected_code: Vec<BrainfuckInstr> = vec![
        DataSub(2),
        DataInc,
        PointerDec,
        Print(5),
        DataDec,
        DataAdd(3)
    ];
    assert_eq!(&code, &expected_code)
}

#[test]
/// Test for printing optimization.
fn test_print() {
    let mut code: Vec<BrainfuckInstr> = vec![
        PutByte, PointerInc, PutByte,
        PointerDec,
        PutByte, PointerInc, PutByte, PointerInc, PutByte,
        PointerDec,
        PutByte
    ];
    optimize(&mut code);
    let expected_code: Vec<BrainfuckInstr> = vec![
        Print(2),
        PointerDec,
        Print(3),
        PointerDec,
        PutByte
    ];
    assert_eq!(&code, &expected_code)
}

Optimization passes - data and pointer arithmetic

Let's start with implementing the pointer arithmetic passes in cto.rs:

#[cfg(test)]
mod tests;

use super::BrainfuckInstr;

/// Optimizes a list of Brainfuck instructions to be less repetitive.
pub fn optimize(code: &mut Vec<BrainfuckInstr>) {
    optimize_arithmetic(code);
}
/// Arithmetic optimization pass.
fn optimize_arithmetic(code: &mut Vec<BrainfuckInstr>) {
    use BrainfuckInstr::*;
    // new, optimized output:
    let mut opt = Vec::new(); /* Yes, we're cheating to keep the same function signature.
    We're going to just replace `code` with the new vector.
    We could optimize things in place, but I tried, believe me, I did. */
    // How many times the last instruction has been repeated.
    let mut repeats: u16 = 0;
    // Instruction last seen.
    let mut last_op = match code.get(0) {
        Some(op) => op.clone(),
        None => return // no instructions to optimize
    };
    let last = code.len() - 1;
    for (index, op) in code.iter().enumerate() {
        if *op == last_op {
            repeats += 1;
        }
        if *op != last_op || index == last {
            if repeats > 1 {
                match last_op {
                    DataDec | DataInc | PointerDec | PointerInc => {
                        opt.push(squash_arithmetic(&last_op, repeats));
                        repeats = 1;
                    },
                    _ => {
                        repeats = 1;
                    }
                }
            } else {
                opt.push(last_op.clone());
                repeats = 1;
            }
        }
        last_op = op.clone();
    }
    *code = opt;
}
/// "Compress" standard Brainfuck arithmetic operations repeated `x` times into our own virtual ones.
fn squash_arithmetic(op: &BrainfuckInstr, x: u16) -> BrainfuckInstr {
    use BrainfuckInstr::*;
    use std::cmp::min;
    let max_u16 = std::u16::MAX;
    match op {
        PointerDec => PointerSub(min(max_u16, x)),
        PointerInc => PointerAdd(min(max_u16, x)),
        DataDec => DataSub(min(255, x) as u8),
        DataInc => DataAdd(min(255, x) as u8),
        _ => {
            panic!("Tried to convert the non-arithmetic instruction {:?} into a virtual arithmetic instruction!", op)
        }
    }
}

Now, the only tests that won't pass will be the ones for print optimization and the one for hi.bf.

Weren't you warning me against the dangers of frequent allocation a few posts back? Why are we creating a whole new Vec for the arithmetic pass, pushing tons of stuff to it and then substituting it for the original?

I'd say "Hypocrite" is my middle name, but I'd be lying, so I'll say that if I had gone to Confirmation as a teenager, I'd have picked some Latinized form of the Greek "Hypokrites" for my third name. Knowing the history of early Roman Christianity (tl;dr: it had lots of saints), there probably was some obscure saint with a name like that.

I looked it up. There wasn't. You'd have had to pick something less indicative of what kind of person you are.

Seriously, though, trying to modify the instruction list in-place (Vec::drain was involved, among other things) resulted in a horrifying mess, so I decided against it.

You could be doing proper functional programming, you know. Like fn optimize(code: &[BrainfuckInstr]) -> Vec<BrainfuckInstr>.

Bite me. I love mutable state.

Are you telling me we're going do the same thing with the print optimization?

I am. Let's move on to the second optimization pass, the one where we detect strings of instructions that in the big picture look like printing a string longer than one byte and squash each into one virtual "print" instruction.

In the arithmetic optimization pass, all we had to do was look for groups of instructions that were exactly the same. The moment the pattern broke, we pushed what we had found up until that moment - either a single, unprocessed raw Brainfuck instruction or one of our virtual aggregate instructions. Here, however, we're looking for a very specific and heterogeneous pattern:

. (PutByte) ← This occurs exactly once
>. (PointerInc, PutByte) ← If we're printing a string of length n > 1, this occurs exactly n - 1 times.

Ooh! Are we going to use some sophisticated string-search algorithm? Regular expressions, perhaps, this looks like a job for those! Weren't they in some weird way Turing-complete, too?

Close, but no cigar. We're going to use a slightly more generalized concept to understand our problem. The human mind can recall with great clarity painful memories, and the one semester of orthodox, classical CS education I got at a university I don't recommend was a long string of those. One particularly painful memory of mine involves finite state automata, and our problem looks like something we can model as a finite state automaton. Go click the link I just gave you and look at the pretty diagrams, then come back.

Ready?

Look at my awesome diagram, which I drew and described for you by hand. Those canned Wikipedia diagrams made of computer-generated SVG can't possibly compete with it. It shows all the states our optimization pass loop can be in:

Sort-of-FSM diagram

One, I can't read your handwriting, and even if I could, I wouldn't like it. Two, this looks more like a regular program flow diagram.

Your comments about my handwriting wound me deeply and painfully. I'll pretend I didn't hear them. As for whether or not it's a real finite state automaton, I think it kind of is one, if you squint. It's definitely inspired by the idea, I once had to draw something similar to model a regular expression. In any case, since you can't read it (or in case the image goes down, or you can't read it for other reasons), let me describe in words my approach to the problem:

First, we assume we're not on any level of Print(u16). We set a variable that tracks this, let's call it print_lvl, to 0. A print_lvl of 1 means we're just putting out one byte and should keep the PutByte instruction as is, one of n > 1 means we should generate a Print(n) instruction instead. Every time we stumble upon a PutByte, we raise our print_lvl by 1 - every time it's not followed by a PointerInc, we check what our level is.

If it's 1, we produce a single PutByte, followed by whatever interrupted us. Ditto for a situation where we get a PutByte, a PointerInc and then something else. If our level is higher than 1, we generate the appriopriate Print. In either case, we reset our print_level to 0 after the pattern gets interrupted. If we don't get a PutByte at all, we just copy what we do get and keep trying until we reach the end of the program.

All that and some implementation details.

That's pretty clever.

See, that's what I've missed hearing. Finally, some well-deserved praise. Without further ado, let's implement the printing optimization pass and add it to optimize:

fn optimize_printing(code: &mut Vec<BrainfuckInstr>) -> () {
    use BrainfuckInstr::{PointerInc, PutByte, Print};
    let mut last_op = match code.get(0) {
        Some(op) => op.clone(),
        None => return // no instructions to optimize
    };
    let mut print_lvl = 0u16;
    let mut opt = Vec::new();
    for op in code.iter() {
        match op {
            PutByte => {
                if print_lvl == 0 {
                    print_lvl += 1;
                } else {
                    match last_op {
                        PointerInc  => print_lvl += 1,
                        _ => {
                            opt.push(PutByte);
                            print_lvl = 1
                        }
                    }
                }
            },
            PointerInc => {
                match last_op {
                    PutByte => (),
                    _ => {
                        opt.push(PointerInc);
                        print_lvl = 0;
                    }
                }
            }
            other => {
                match print_lvl {
                    0 => {
                        opt.push(other.clone());
                        print_lvl = 0;
                    },
                    1 => {
                        opt.push(PutByte);
                        opt.push(other.clone());
                        print_lvl = 0;
                    },
                    n => {
                        opt.push(Print(n));
                        opt.push(other.clone());
                        print_lvl = 0;
                    }
                }
            }
        }
        last_op = op.clone();
    }
    match print_lvl {
        0 => (),
        1 => {
            opt.push(PutByte);
        },
        n => {
            opt.push(Print(n));
        }
    }
    *code = opt;
}

That looks eldritch.

And here I was so proud of myself. You're right, though, I should figure out a way to at least comment that code clearly.

While testing the code, it turned out that one of our tests had a minor typo/brainfart that made it always fail. Another thing we discovered was what seemed to be a strange interaction between optimize_arithmetic and optimize_printing, but upon investigation it turned out to have been completely optimize arithmetic's fault - I'll spare you the gory details, but it boiled down to me having forgotten to push any remaining non-aggregate instructions after the loop was done.

All that's left now is to implement the x86_64 assembly for our compiler to generate from our fancy new virtual Brainfuck instructions. This is going to be incredibly easy in comparison.

Virtual instruction translation

Here are our translations for the virtual instructions:

BrainfuckInstr::PointerSub(x) => {
    let asm = format!("sub rsi,{x}\n", x = x);
    output.push_str(&asm);
    return
},
BrainfuckInstr::PointerAdd(x) => {
    let asm = format!("add rsi,{x}\n", x = x);
    output.push_str(&asm);
    return
},
BrainfuckInstr::DataSub(x) => {
    let asm = format!("sub byte [rsi],{x}\n", x = x);
    output.push_str(&asm);
    return
},
BrainfuckInstr::DataAdd(x) => {
    let asm = format!("add byte [rsi],{x}\n", x = x);
    output.push_str(&asm);
    return
},
BrainfuckInstr::Print(x) => {
    // Copypasted from the PutByte arm, except rdx is now set dynamically
    let asm = format!("mov rdx,{x}\nmov rdi,1\nmov rax,1\nsyscall\n", x = x);
    output.push_str(&asm);
    return
}

The two new instructions I've introduced now, add and sub, should be pretty easy to figure out just by looking at them. This time, knowing you'd laugh at me if I got it wrong again, dear reader, I remembered to specify our data's size when we directly access memory with [rsi], ha!

See? I knew some ridicule would do you good.

Anyway, Print is the exact same syscall as PutByte, except rdx, corresponding to its count argument in C, is set to the x in Print(x), not a hardcoded 1.

Bolting CTO onto our CLI

And now for something even more obvious and straightforward, let's add a command-line switch to optionally disable CTO. Serious compilers for big boys have it, BFC-RS should too. I won't bore you with the details of adding fields to our App struct, making things public and hooking things up to main.rs, but the final result should look like this:

$ cargo build
   Compiling bfc-rs v1.1.0 (/home/jan/Kod/bfc-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 0.92s
$ target/debug/bfc-rs --help
Usage: bfc-rs <source_filename> [-o <output-filename>] [--no-cto] [--dump-nasm] [--no-cleanup]

BFC-RS v1.1.0

Options:
  -o, --output-filename
                    output filename (must be provided unless --dump-nasm is
                    explicitly passed)
  --no-cto          disable compile-time optimizations
  --dump-nasm       instead of compiling, print raw NASM output to stdout for
                    debugging
  --no-cleanup      do not clean up build directory after successful build
  --help            display usage information

Is that a new version number?

That it is. We've implemented a major new feature in a backwards-compatible way, so the Semantic-Versioningist Manifesto says we should bump the middle part of our version number.

Benchmarking hi.bf

This is it, this is what we've been working towards all this time: making the binary compiled from hi.bf smaller and faster.

Let's first take a look at the NASM generated from hi.bf with CTO on:

$ bfc-rs --dump-nasm hi.bf
section .data
memory times 30000 db 0
section .text
global _start
_start:
mov rsi,memory
add byte [rsi],72
inc rsi
add byte [rsi],105
inc rsi
add byte [rsi],33
inc rsi
add byte [rsi],10
sub rsi,3
mov rdx,4
mov rdi,1
mov rax,1
syscall
mov rdi,0
mov rax,60
syscall
$

Nice! This is something we can actually read and understand! Let's compile hi.bf with and without CTO and compare the size in bytes.

$ /target/debug/bfc-rs hi.bf -o hi_naïve --no-cto
$ /target/debug/bfc-rs hi.bf -o hi_optimized
$ size hi_naïve
   text    data     bss     dec     hex filename
    557   30000       0   30557    775d hi_naïve
$ size hi_optimized
   text    data     bss     dec     hex filename
     64   30000       0   30064    7570 hi_optimized

Would you look at that! CTO has saved us 493 bytes' worth of active machine code instructions! That's an 88.5% reduction in size, by the way.

Now let's look at how fast these two programs go. We're going to use hyperfine. It's written in Rust and published on crates.io, so we can install it with cargo, but the major Linux distros all have it in their package managers, the choice is yours. Let's see, then:

$ hyperfine --min-runs 2000 ./hi_naïve ./hi_optimized
Benchmark #1: ./hi_naïve
  Time (mean ± σ):       0.2 ms ±   0.3 ms    [User: 0.3 ms, System: 0.2 ms]
  Range (min … max):     0.0 ms …   2.3 ms    2000 runs
 
  Warning: Command took less than 5 ms to complete. Results might be inaccurate.
 
Benchmark #2: ./hi_optimized
  Time (mean ± σ):       0.3 ms ±   0.2 ms    [User: 0.3 ms, System: 0.3 ms]
  Range (min … max):     0.0 ms …   2.1 ms    2000 runs
 
  Warning: Command took less than 5 ms to complete. Results might be inaccurate.
 
Summary
  './hi_naïve' ran
    1.25 ± 1.73 times faster than './hi_optimized'

What the fuck?

I'm asking myself that question too. Possible explanations:

  • the program is just too small for any difference in runtime to be measurable with the precision available
  • my processor, a Ryzen 3 1200 AF, is somehow less efficient at executing the add and sub instructions than it is at executing stupidly long series of inc and dec instructions (not that I'm a fanboy, but I highly doubt it)
  • other programs running on my system are taking up the CPU at times inopportune for hi_optimized or affecting the time system calls take (which should, however, obviously disadvantage hi_naïve, as it makes a few more of them)

Actually, nevermind.

I put hyperfine on for a hundred thousand runs of each program before I took a break, and when I came back, the tables had turned:

$ hyperfine --min-runs 100000 ./hi_naïve ./hi_optimized
Benchmark #1: ./hi_naïve
  Time (mean ± σ):       0.1 ms ±   0.1 ms    [User: 0.2 ms, System: 0.3 ms]
  Range (min … max):     0.0 ms …   3.5 ms    100000 runs
 
  Warning: Command took less than 5 ms to complete. Results might be inaccurate.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark #2: ./hi_optimized
  Time (mean ± σ):       0.1 ms ±   0.1 ms    [User: 0.2 ms, System: 0.3 ms]
  Range (min … max):     0.0 ms …   3.6 ms    100000 runs
 
  Warning: Command took less than 5 ms to complete. Results might be inaccurate.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  './hi_optimized' ran
    1.01 ± 1.20 times faster than './hi_naïve'

Law of large numbers and all that, I guess. We really have gotten to Even Faster Uselessness.

Publication

Now that we know BFC-RS 1.1.0 works as intended we can set it up for publication on crates.rs, push it to Github, and write a self-congratulatory blogpost about the whole process. Maybe even post it to Reddit so people look at it. I'm doing these things right now.

Conclusion

Over the course of the Terrible Compiler for a Useless Language series, I have used you, my dear reader, as a rubber ducky for the following four things:

  • Implementing a toy quine-generating language all the way back in post 0 in order to understand x86 assembly well enough to write BFC-RS itself
  • Implementing an abstraction for Brainfuck instructions, BrainfuckInstr, and a parser that transforms a Brainfuck source file into a list of those, in post 1
  • Making our compiler generate functional x86 assembly that runs on a modern Linux system in post 2
  • Improving our compiler's output through compile-time optimizations for time and memory savings, achieving the titular Even Faster Uselessness of this post

However, BFC-RS is of course not perfect.

What more can be done with this project?

I have a few ideas.

More targets

Currently, BFC-RS only supports Linux running on newish, 64-bit x86 processors. People not running Linux, people running Linux on old x86 CPUs and people running not-Linux on not-x86 CPUs are all left out by BFC-RS. Seeing as there is a nice, thick, insulating layer of abstraction between Brainfuck and Linux-targeted NASM in this project - the BrainfuckInstr enum type - handling more compiler backends should prove to be quite easy. Perhaps a Compiler trait could be specified with a compile function and implemented by different concrete types like Elf64NasmCompiler (the one we already have) or Win32Compiler.

More compile-time optimizations

The most obvious ones we missed that wouldn't be much more difficult to implement than the ones we have would be:

  • reading a string of size n > 1
  • writing a string backwards

More difficult to implement would be dead code elimination, which would require at least statically predicting the value of the number at the data pointer before a while loop is opened, or memory-usage optimization, which I know other Brainfuck interpreters and compilers have done in the past.

More comments, especially in the Print Optimization Pass of the Shadow of Death

Some code in this project is not as easy to understand as it should be, at least not without meticulously handmade illustrations that still don't show the whole picture. Perhaps one way around this would be to use better, cleaner algorithms and designs.

Compiling the whole thing to WebAssembly and putting it in a web UI

This seems like a neat way of making the project more accessible and off-showable. Everything has a browser UI these days. Maybe BFC-RS could even compile Brainfuck to WebAssembly, then run it in the same browser tab and show the output.

Are you actually going to do any of these things?

Nope!

It's Three Magi Day today, the last day of the winter holidays, and uni starts tomorrow. I'm not going to have the time for this stuff, and I will probably have lost interest by the time I do have the time to develop BFC-RS more.

But I'm not signing BFC-RS' death warrant here. If you're interested in implementing some of these features, just do it yourself and send me a pull request. I could even make you a maintainer if you want to do it so badly.

The repo is here. Have fun!