Post #1: In Which I Write a Hopefully Passable Parser for a Useless Language

Welcome back. This is the second post in my Terrible Compiler for a Useless Language series, in which I write about my attempt to implement a native Brainfuck compiler for x64 Linux. In this post, I'll be covering a basic parser for Brainfuck. Previous posts in the series:

Is this "italic heading" thing where you set up a strawman to cut down so you can feel good about winning an imaginary argument going to be a regular convention from now on?

Yes, how could you tell?

I got the acronym "joke" in the last post and I didn't like it. It's in poor taste/personally offends me.

I don't like that joke either. Some kind of police force is necessary for society to function and we shouldn't say mean things about cops who aren't doing anything wrong (notice the lack of a comma before "who" here before you jump to conclusions). Worse yet, the acronym is badly spelled. But I just couldn't resist.

Get to the point.

But of course. Let's examine how Brainfuck works. What does Wikipedia have to say about it?

Notable for its extreme minimalism, the language consists of only eight simple commands and an instruction pointer.

What are the eight commands, then?

< and > respectively move the pointer backward and forward by one memory cell. As we've already mentioned at the very beginning, Brainfuck's memory cells are one byte in size and there's thirty thousand of them. These are our variables. Those of you, dear readers, coming from dynamically-typed languages like Python and JS, should be happy to hear that Brainfuck has no types.
- and + decrement and increment the number in the memory cell at our pointer.
, and . respectively get and put out one byte of data from/to the standard input/output.
[ and ] are where things get interesting and Turing-complete - the former means "if the value at the pointer is 0, skip over to the instruction after the matching ]". The latter means "if the value isn't zero, jump back to the instruction after the matching [". If this looks like a while loop to you, that's because it is one. Obviously, all square brackets must be paired up.

All other characters in a Brainfuck source file are considered comments and can be safely ignored. To work, then! Let's declare a new Cargo project with cargo new bfc-rs. Because we're trying to be slightly more professional this time, we won't stuff all the code in main.rs, we'll also create a lib.rs to hold the program's logic.

Defining the instructions

Our first step should be to define a data type for Brainfuck commands:

/// Type representing a standard Brainfuck instruction.
#[derive(Debug, PartialEq)] // this is for tests
enum BrainfuckInstr {
    /// Move the data pointer back one cell.
    PointerDec,
    /// Move the data pointer forward one cell.
    PointerInc,
    /// Decrement the value of the current memory cell.
    DataDec,
    /// Increment the value of the current memory cell.
    DataInc,
    /// Get a byte from the standard input and store it in the current memory cell.
    GetByte,
    /// Write the current memory cell's value to the standard output.
    PutByte,
    /// Begin a while loop conditional on the current value not being zero.
    WhileNonzero,
    /// Close the while loop.
    EndWhile
}

But why do we need this intermediate representation, exactly? Can't we just write assembly code at the same time as we're reading the source code?

That's a secret. You'll find out two blogposts from now.

Anyway, that was easy. Now let's write a Brainfuck program that writes a newline to the standard output, because typing out a full Hello World would be a stupidly large amount of work:

++++++++++ Add 10 to the current byte to make a newline or 0x0A byte
. Print it

I've saved it as newline.bf in our project's directory for later use. Now let's define a Parser struct, a parse function and write a test for it before we implement anything, because I think "Test-Driven Development" looks good on a resumé. From lib.rs:

#[cfg(test)] // only declare a "tests" module when the Rust compiler is in test mode
mod tests;

/// The syntax errors possible.
#[derive(Debug)]
enum SyntaxError {
    /// A closing square bracket was found at the contained line:index position, but there was no opening square bracket before it.
    PrematureEndWhile(usize, usize),
    /// The last while loop opened, at the contained line:index position, has no closing bracket.
    UnclosedWhile(usize, usize)
}
/// Struct responsible for parsing Brainfuck.
struct Parser; // empty because it has no internal state yet
impl Parser {
    fn parse(&mut self, code: &str) -> Result<Vec<BrainfuckInstr>, SyntaxError> {
        unimplemented!()
    }
}

The contents of our tests.rs file will be as follows:

const NEWLINE_PRINTING_PROGRAM: &str = include_str!("../newline.bf"); // same thing as we did with the NASM template while writing HWDPC
use super::*; // import all symbols from our parent module, the library root
use BrainfuckInstr::*;
#[test]
fn test_newline_program() {
    let mut parser = Parser;
    let output = parser
    .parse(NEWLINE_PRINTING_PROGRAM)
    .expect("The newline program contains no syntax errors, something is wrong with the parser.");
    let expected_output: Vec<BrainfuckInstr> = vec![
        DataInc, DataInc, DataInc, DataInc, DataInc,
        DataInc, DataInc, DataInc, DataInc, DataInc,
        PutByte
    ];
    assert_eq!(&output, &expected_output)
}

Writing the parser logic

If we invoke cargo test right now, the test will unsurprisingly fail, because the unimplemented!() macro in parse makes the compiler think a value of the right type gets returned, but at runtime panics (=crashes) the thread it's running on, thus failing any test we put it in. Let's get to work actually implementing parse, then.

fn parse(&mut self, code: &str) -> Result<Vec<BrainfuckInstr>, SyntaxError> {
    use BrainfuckInstr::*;
    let mut output = Vec::new();
    for (line_number, line) in code.lines().enumerate() {
        for (ch_number, ch) in line.chars().enumerate() {
            output.push(match ch {
                '<' => PointerDec,
                '>' => PointerInc,
                '-' => DataDec,
                '+' => DataInc,
                ',' => GetByte,
                '.' => PutByte,
                '[' => WhileNonzero,
                ']' => EndWhile,
                _ => { continue; /* skip this iteration if the character is something else */}
            });
        }
    }
    // Don't worry about the line and character numbers yet.
    Ok(output)
}

Now, our tests pass, the whole one. This is too good, we need more tests.

Test programs meant to fail

Let's write two faulty programs, premature.bf and unmatched.bf, and the corresponding tests.

++++.,
]+. The "premature" program is invalid because there's what amounts to a closing brace without an opening one
+++--.+[+
++[] This should crash the compiler because we never close the outer while loop.
Hey, this means we can use proper punctuation here, cause this code won't run.

The tests:

#[test]
fn test_premature_program() {
    let mut parser = Parser;
    let output = parser
    .parse(PREMATURE_PROGRAM)
    .expect_err("The code in premature.bf shouldn't produce a valid instruction list.");
    assert_eq!(output, SyntaxError::PrematureEndWhile(2, 1)) // we expect an error on line 2, character 1
}
#[test]
fn test_unmatched_program() {
    let mut parser = Parser;
    let output = parser
    .parse(UNMATCHED_PROGRAM)
    .expect_err("The code in unmatched.bf shouldn't produce a valid instruction list.");
    assert_eq!(output, SyntaxError::UnclosedWhile(1, 8)) // we expect an error or line 1, character 8
}

These will of course fail:

running 3 tests
test tests::test_newline_program ... ok
test tests::test_unmatched_program ... FAILED
test tests::test_premature_program ... FAILED

failures:

---- tests::test_unmatched_program stdout ----
thread 'tests::test_unmatched_program' panicked at 'The code in unmatched.bf shouldn't produce a valid instruction list.: [DataInc, DataInc, DataInc, DataDec, DataDec, PutByte, DataInc, WhileNonzero, DataInc, DataInc, DataInc, WhileNonzero, EndWhile, PutByte, GetByte, PutByte]', src/tests.rs:32:6
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- tests::test_premature_program stdout ----
thread 'tests::test_premature_program' panicked at 'The code in premature.bf shouldn't produce a valid instruction list.: [DataInc, DataInc, DataInc, DataInc, PutByte, GetByte, EndWhile, DataInc, PutByte]', src/tests.rs:24:6


failures:
    tests::test_premature_program
    tests::test_unmatched_program

test result: FAILED. 1 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

error: test failed, to rerun pass '--lib'

We haven't actually implemented any bracket-matching behavior, so the parser just keeps producing "valid" lists of instructions wrapped in the Ok variant of Rust's Result. Because we're expecting an error, this fails our test. What we want is error data that contains the location of the mistake in the source code.

How could we do this? Well, you see, early on in uni I had this one guy teach one of my courses. He was probably a better researcher than he was a teacher, as he was openly rude to his students and didn't really care to explain anything. His default answer to any question was "it's all in The Book", and his favorite joke was letting someone muddle their way through an exercise, nod along and encourage them, then, when they finished, proclaiming smugly:

Of course, this is all completely wrong. Go read the relevant chapter at home, class is over.

Looking back on it, he was kind of an asshole. Or maybe he was going through hard times, but that would mean I don't get to be self-righteous and talk smack about him.

Does this sob story emotionally-colored digression lead anywhere?

Oh yes, it does. The only thing he actually managed to explain that I remember until now was how one might implement an arithmetic expression parser capable of dealing with the ordinary "infix" mathematical notation most numerate people actually use in their daily life, in which binary operators are written between their operands, which produces ambiguities that need parentheses to be resolved. Reverse Polish notation, in which the expression 3 - 4 would be written 3 4 -, does not have this problem and a parser for it is much less interesting to implement. Anyway, making sure pairs of parentheses match up is the same thing as what we're trying to do - making sure pairs of square brackets match up. How do we do this?

Bracket-counting logic

Easy. We count the brackets of each type and make sure the counts balance out. In fact, we don't even need to keep two separate counts.

/// Struct responsible for parsing Brainfuck.
struct Parser {
    /// The 1-indexed position (line number, position within line) of the earliest `[` without a matching `]`.
    earliest_unclosed: (usize, usize),
    /// The number that keeps track of whether or not the brackets are balanced so far.
    /// Opening a while loop with `[` bumps this number up by 1. Closing one with `]` drops it by 1.
    bracket_balance: isize /* In theory, there could be as many closing brackets as opening ones, and they could make up the whole program.
    Therefore, it makes sense to use `isize`, which can fit in half as big a number as `usize`, but either positive or negative.
    Remember that we want `bracket_balance` to end up at 0.*/
}
impl Parser {
    // Now that our struct isn't empty, we should make a constructor for it.
    // Structs can be declared directly ad hoc, but only if the current module can see all of the struct's fields.
    // Rust has no language-level concept of a constructor; we just put a struct declaration into an ordinary function. Convention dictates we name it "new".
    fn new() -> Self {
        // "Self" is just a way to tell the Rust compiler "put this type's actual name here".
        // We could write "fn new() -> Parser", but this is neater and would require less searching-and-replacing if we wanted to manually rename `Parser`.
        Self {
            earliest_unclosed: (0, 0),
            bracket_balance: 0
        }
    }
    fn parse(&mut self, code: &str) -> Result<Vec<BrainfuckInstr>, SyntaxError> {
        use BrainfuckInstr::*;
        self.earliest_unclosed = (0, 0);
        self.bracket_balance = 0; // Zero out the parser's state, just in case we end up reusing one for multiple files.
        let mut output = Vec::new();
        for (line_number, line) in code.lines().enumerate() {
            for (ch_number, ch) in line.chars().enumerate() {
                output.push(match ch {
                    '<' => PointerDec,
                    '>' => PointerInc,
                    '-' => DataDec,
                    '+' => DataInc,
                    ',' => GetByte,
                    '.' => PutByte,
                    '[' => {
                        // If we're not within (hopefully) a pair of braces already:
                        if self.bracket_balance == 0 {
                            self.earliest_unclosed = (line_number + 1, ch_number + 1);
                        }
                        self.bracket_balance += 1;
                        WhileNonzero // in Rust, code blocks are expressions that evaluate to the last expression within them.
                        // We can do all sorts of stuff within an "if" or "match" block.
                        // This is why we rarely have to write "return" in functions.
                    },
                    ']' => {
                        self.bracket_balance -= 1;
                        if self.bracket_balance < 0 {
                            // The moment we have one more ] than there have been [s, it no longer makes sense to parse the rest of the program.
                            // This is one of those situations where "return" is useful:
                            return Err(SyntaxError::PrematureEndWhile(line_number + 1, ch_number + 1))
                        }
                        EndWhile
                    },
                    _ => {
                        continue // Skip this iteration if the character is something else.
                    }
                });
            }
        }
        if self.bracket_balance == 0 {
            Ok(output)
        } else {
            // We've already returned the appropriate error if there was an extra ], so this is the only remaining possibility.
            Err(SyntaxError::UnclosedWhile(self.earliest_unclosed.0, self.earliest_unclosed.1))
        }
    }
}

Our tests now pass:

warning: 13 warnings emitted

    Finished test [unoptimized + debuginfo] target(s) in 0.48s
     Running target/debug/deps/bfc_rs-28017cff71c3a4a8

running 3 tests
test tests::test_newline_program ... ok
test tests::test_premature_program ... ok
test tests::test_unmatched_program ... ok

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Those warnings? Ignore them. The compiler is just blind to our glorious vision of the future and thinks we don't need any code yet, because none of it gets used by main.rs.

Conclusion

We have written a parser for Brainfuck and proven it to work as intended with automated unit tests built into Rust (tests which we had written before we implemented the actual functionality, meaning we can now proudly insert "test-driven development" into the "skills" section of our CVs). On our way there, we have been helped by Rust's algebraic data types, which allow us among other things to treat an "expected output" type and an error-data type as variants of a higher-order generic type.

The code we have written so far is available here.

Wait, you said there would be abstract syntax trees! Where's my tree?

You remembered that from the first post?

I was only stating my fears. Luckily for me and unluckily for the satisfaction of your curiosity, Brainfuck doesn't have binary arithmetic operators, function calls with lists of parameters, custom data types with named fields, named variables, or anything else that a practical, usable language would be expected to have. All these things can be represented as some sort of tree or other graph, but there's just no reason for me to do any of it here, because at the end of the day all a Brainfuck program is is a flat list of instructions, not unlike the assembly code we'll be translating it into. This makes my job much easier.

What comes next

In the next post, we'll try to learn enough x86 assembly instructions and NASM directives to convert our list of Brainfuck instructions into assembly and finally, after two long blogposts, write a working Brainfuck compiler.

Didn't you imply there's going to be one more post after that?

As I've said, that's a secret. However you might already suspect it will be about something that starts with a C and ends with an O. That's all I'm saying.