The Digital Cat - compilers

Call conventions, ownership, and mutability in Rust

2025-01-09T18:00:00+01:00

My roots as a developer are in C and Assembly, and I remember fondly the time spent to learn those languages using Borland TASM or Turbo C under MS-DOS. Those languages, and the technology that surrounded them, forced me to understand the low-level architecture of computers, and I still consider those details extremely interesting.

Rust, just like C and C++, is a low-level language that exposes details of the underlying architecture, and to use it proficiently requires a certain understanding of concepts like memory management that are mostly ignored by high-level language like Python or JavaScript.

This post has been written for beginners who are used to dynamically typed languages. I will explore the intricacies of passing arguments to Rust functions, trying to clarify exactly what happens behind the scenes and why the language provides syntactic devices like & or mut.

For the sake of clarity, I will split the discussion into three separate topics: call conventions, ownership, and mutability. However, it is important to keep in mind that those concepts are not independent, and that the separation is purely a strategy to make them more digestible.

Computer architecture¶

Throughout the article, I will mention and show what happens at CPU level when code is written in a certain way.

You do not need to have previous knowledge of computer architecture to follow those sections, as I will explain what happens step by step, but I highly recommend to eventually become familiar with some concepts like the stack and CPU calling conventions. You will find some useful links in the Resources section at the end of the article.

Assembly and machine code

Formally, machine code is the sequence of binary values that are given to the CPU: everything in machine code is just a binary number. Assembly is a slightly higher-level language that uses mnemonics and register names to make it simpler for humans to read and write machine code.

For all practical purposes, in this article we can assume that Assembly and machine code are the same thing and use the two words interchangeably.

Call conventions¶

When we define a function, we list a set of parameters, that is values that have to be given to the function when we call it. As Rust is a statically typed language, functions need to declare the type of each parameter, and arguments need to match that.

Arguments and parameters

Strictly speaking, arguments and parameters are different. However, this explanation from the Rust book looks like a good informal approach.

We can define functions to have parameters, which are special variables that are part of a function’s signature. When a function has parameters, we can call it providing concrete values for those parameters. Technically, the concrete values are called arguments, but in casual conversation, people tend to use the words parameter and argument interchangeably for either the variables in a function’s definition or the concrete values passed in when you call a function.

In Rust, there are many built-in types like integers, floats, booleans, and an infinite amount of user-defined types, but when we define function parameters there are two macro categories in which they can fall: values and references.

This is the first and most important concept to learn. We can call a function passing values in only two different ways, regardless of the data type. Depending on the language and the speaker, such techniques are named in different ways, and in this article I will use call by value and call by reference. We call these two different strategies call conventions.

Values and references

Let's review this important distinction between values and references before we jump into the details of function calling.

Consider this situation: I want to give a friend a card for their birthday. I can hand them the card itself, passing the physical object.

Let's consider a different scenario: a friend wants to borrow my car to go shopping. It would be crazy for me to try and get my car to hand it physically. I will probably just tell my friend where the car is parked.

These two examples are exactly what happens when a computer manages data. Ultimately, data is just bytes stored somewhere in memory (typically in RAM, but the same is true for caches or long-term storage like hard drives). When I want to send data to another component in the system I can either give the data itself (value) or tell the component where to find the value (reference).

Memory location

Technically, a memory location is called address. Ultimately, it's just a number that identifies a specific location in memory that hosts data. A pointer or a reference is the type of a variable that contains a memory address. It is usually an unsigned integer whose size depends on the computer architecture. For an Intel x86-64 machine it's a 64 bit number (8 bytes).

Strictly speaking, while C pointers are literally just an integer, Rust references are structures that contain additional data, but for the sake of simplicity in this post we will ignore the difference. If you want to dig into this you can read the documentation.

This means that, strictly speaking, arguments can be passed only by value. References are values (addresses) that are used to find other values.

Why two different ways?

Why do computer provide two ways to pass arguments to functions?

Let's consider a real world example. I won a huge sum of money at the national lottery, and decided to get a safe deposit box at the bank and put the money there. To make sure I don't forget the code to open the safe I write it on a piece of paper. One day I need to get some money from the safe, but I cannot go myself, so I ask a friend to go, open the safe, and get the money for me. To do this, I have to give this person the piece of paper with the code, and they will give it back once they are done. As this happens multiple times, I decide that it might be simpler to copy the code on another piece of paper and to give that to them.

Everything works well, until one day the bank asks us to change the code every time we open the safe, as a security measure. Now, the one who opens the safe has to come up with a new code and update their own piece of paper. This means that the copy owned by the other person is instantly invalidated. Assuming we don't have any other way to communicate the change, we need to review our strategy. In this case there is only one way to deal with the situation. We need a single copy of the piece of paper and each one of us knows where it is hidden. This way, the one who changes the code can go and update the only copy of the note in existence.

At CPU level, when we call a function (the friend) we might need to pass data (the code to open the safe). As long as data is not changed we can safely copy it into the function's memory space (copy the code on another piece of paper), but as soon as the value changes we need a different method that allows the function to change the original value (the only copy of the note). We can pass the location of the data (the place where the note is hidden) so that the function can use it and change it.

In the first two parts of the article we will deal with read-only data, so for a while it will look like there is no need to use references. In the third part, we will introduce mutability that will finally justify them.

Pass by value in Rust

The following code shows how we can pass function arguments by value in Rust

struct Item {
    pub value: u16,
}

fn process(item: Item) -> u16{
    item.value 3
}

fn main() {
    let i = Item { value: 42 }; 1

    process(i); 2
}

As you can see, the code is extremely simple. The function main creates a variable of type Item 1 and calls the function process 2. This in turn extracts the field value 3 and returns it (which is ignored by main).

The function prototype accepts an argument of type Item. This means that the function wants to receive the physical data (the birthday card).

In this case we say that the caller passes the variable i by value.

Behind the scenes

Traditionally, we assume that the compiler will byte copy the value into the function stack, and this is still a valid mental model. However, remember that the compiler's task is to optimise code, so it might decide to do something completely different at machine language level.

We can see what happens in this case using the amazing Compiler Explorer. Please note that to use it you need to make the function main public, as the compiler expects a library, not an executable. See the section "Disassemble Rust" at the end of the post if you want to do it locally.

Using rustc 1.83.0, the code above becomes

example:process:
        mov     ax, di 6
        ret 7

example:main:
        push    rax 1
        mov     edi, 42 3
        call    example:process 5
        pop     rax 2
        ret

For those unfamiliar with (Intel) Assembly, let me clarify what happens here:

First of all, there is no concept of struct in machine code. Here, the compiler figures out there is only one field so the struct i is just the 16 bit value 42.
In example::main the code saves 1 and restores 2 the 64-bit register rax, which is used by convention to store a function's return value (low-level calling convention).
The value of the variable i is stored in the register edi 3. Here, the compiler decided not to use the stack, as the value of the variable is ultimately just a single u16 that can be hosted by a register.
The function example:process is called 5. The code of the function in Rust is very simple, and it needs only to extract the field value. In Assembly, there is no field, and the return value of the function is basically just its input. The function stores the output in the register ax 6 which is the lower 16-bit part of rax.
The function returns 7.

Leaving aside the complexity of the low-level architecture and conventions, the main concept we need to retain is: passing parameters by value copies the value of the arguments.

Again, keep in mind that the compiler writes machine code with optimisation in mind, so what we do in Rust is not always reflected in what happens at CPU level. Later, we will see an example of this.

Pass by reference in Rust

The following code is a slight modification of the previous example and shows how we can pass function arguments by reference in Rust

struct Item {
    pub value: u16,
}

fn process(item: &Item) -> u16{ 1
    item.value
}

pub fn main() {
    let i = Item { value: 42 };

    process(&i); 2
}

As you can see, the only difference is that the function process accepts &Item 1, that is a reference to Item. The function call is modified accordingly, passing &i 2, which is the reference to i.

Here, we are doing what we did when we lent a car. Instead of handing the car itself we tell the recipient where to find it.

Automatic referencing and dereferencing

The code above might surprise those who are used to code in C/C++, as the function process receives &Item but then reads one of the field as if the variable was an Item.

Formally, we should first dereference the pointer, which in Rust can be done with *item and then access the field. Indeed, the following code works

fn process(item: &Item) -> u16{
    (*item).value
}

In C/C++ the syntax (*item).value can be written item->value, but Rust doesn't have the arrow operator.

Rust has a feature called automatic referencing and dereferencing that can automatically add &, &mut, or * to match the required data type. In this case, it automatically transforms item into *item.

It is extremely important to remember that Rust silently adjusts calls. The feature is useful, as it simplifies the syntax of the language, but if we want to understand what happens behind the scenes we need to be aware of this behaviour.

You can read more about automatic referencing and dereferencing in chapter 5.3 of the Rust book.

Behind the scenes

This time, the mental model is that we are handing the function the location of our data, without copying it to new memory locations. Let's see what happens at CPU level.

Again, using the Compiler Explorer and rustc 1.83.0, the code above becomes

example:process:
        mov     ax, word ptr [rdi] 6
        ret 7

example:main:
        push    rax 1
        mov     word ptr [rsp + 6], 42 3
        lea     rdi, [rsp + 6] 4
        call    example:process 5
        pop     rax 2
        ret

The machine code is clearly different from the previous version. Let's have a deep look:

As happened before, the code saves 1 and restores 2 the 64-bit register rax, which is used by convention to store a function's return value (calling convention).
The value 42 is pushed onto the stack 3. This call looks complicated because of byte alignment, but it basically moves the value 42 at the stack address stored into the stack register rsp. Let's decompose it
- In Intel Assembly mov [rsp], 42 would move the value 42 to the address stored in the register rsp. The value is not moved into the register. Rather, the value in the register is used as an address.
- The CPU is very picky when it comes to the size of data that we want to move, so the code clarifies that we want to treat the address as the space that hosts 2 bytes (a word that corresponds to 16 bits) with mov word ptr [rsp], 42.
- Last, the Intel calling convention wants the stack pointer rsp to be constantly aligned to a 16-byte boundary. The function main already pushed rax, which advanced the stack pointer by 8 bytes, and we are going to store 2 bytes (word), so we are missing 6 bytes to keep the stack pointer aligned. This is the reason why the code above uses [rsp + 6] instead of just [rsp].
The address stored at rsp + 6 is loaded into the register rdi 4. The instruction lea (Load Effective Address) calculates the result of [rsp + 6] as an address and loads it into rdi. This is different from what mov would do, which is to copy the value at that address.
As before, the function process is called 5.
The code of the function is different as well because now rdi doesn't contain a value but the address of the value. Thus, the function uses word ptr [rdi] 6 to move the 16 bits value at that address into ax (which is still the conventional register for the return value) and returns 7.

Once again, the Assembly code is complicated because of conventions and low-level architectural details, but the main concept is: passing parameters by address gives the function the location of the data and not the data itself.

The compiler's role

I mentioned several times that the compiler's task is to optimise code, which means that the machine code might or might not correspond to what we wrote in Rust (or any other high level language). A simple example is the following

struct LargeItem {
    value: [u16; 1024 * 1024], 2
}

fn process(item: LargeItem) -> u16 {
    item.value[0]
}

pub fn main() {
    let li = LargeItem {
        value: [0; 1024 * 1024],
    };

    process(li); 1
}

Here, we are passing the variable li by value 1, but the size of the type is rather huge. It's a struct of 2 MiB (2 bytes * 2³²) 2 which cannot be stored into a register. This means that, even though in Rust we pretend we pass the value, the compiler is forced to call the function passing the address. The relevant part of the machine code this time is

example:process:
        mov     ax, word ptr [rdi]
        ret

example:main:
        # Calls to memset and memcpy
        # to set up the large struct
        ...
        lea     rdi, [rsp + 8] 1
        call    example:process
        ...
        ret

where I omitted the rest of the code that deals with the memory initialisation of the large struct.

As you can see, the compiler ignores our "directive" to pass by value and uses the address of the struct 1 as it did with the Rust code that uses references. The function extracts the first value of the array item.value, so the machine code reads the value at the address 2.

The bottom line is: both passing by value and passing by reference can result in the same machine code.

This might sound surprising, but if you think about it, that's exactly the reason why we use a compiler and why it's such a fascinating and important piece of software.

As I mentioned before, the reason why we should pass arguments by reference instead of by value will be clear once we introduce mutability. To get there, however, we first need to discuss another features of the language: ownership.

Ownership¶

This is the second aspect of function calls that I want to discuss, together with call conventions and mutability. Once again, such concepts are interconnected and are separated here just for the sake of clarity.

To discuss ownership, let's once again have a look at real world examples.

In the first scenario, I have a car (once again!) that I don't use any more, so I sell it. From today, I cannot drive that car any more, as it belongs to someone else.

The second situation is: I own a holiday home in a beautiful place, and a friend asked me to use it for a coupe of weeks. The ownership of the house doesn't change, but for a while it will be used by another person.

The third case is similar to a previous example. I annotated the address of a shop on a piece of paper. A friend expresses interest in the same shop, so I copy the note and give it to them. Now, we both have the same information in two different physical locations.

The difference between the cases is clear. Once the car is sold it's not part of my possession any more. I can't drive it, but at the same time I don't have to pay insurance or to deal with it when it's time to demolish it. In the second case, the house is still mine, and I am responsible for council tax and repairs, but for a while it will be given to another. In the third case, each one of us owns their own copy of the information and is responsible for the data.

These three cases can be connected with the way function arguments are treated in Rust. When we pass variables to a function we need to consider the ownership of those variables (actually, the ownership of the data stored in those variables).

In Rust, each argument can be passed in one of the three possible ways:

Copy: we make a copy of value and end up with two owners (like the address of the shop).
Move: we transfer ownership (like selling the car).
Borrow: we lend the value to a function but we want to have it back (like we did with the house).

It is important to understand that ownership is an additional check introduced by the Rust compiler, and that there is no such a thing at CPU level.

Copy: two values and two owners

Let's consider the following code

fn process(value: u16) -> u16 {
    value * 2
}

fn main() {
    let v: u16 = 42; 1

    process(v); 2

    println!("Value {}", v); 3
}

Here, we initialise a value v 1 and we pass it by value to the function process 2.

The value of v is copied into the variable value when the function is called. This means that when we call process there are two independent areas of memory:

The area labelled v which is owned by main.
The area labelled value which is owned by process.

Both areas of memory (variables) contain the same value initially, but they are independent so they can change without affecting each other. We will see an example when we discuss mutability later.

There are two important facts to consider here:

We can use v 3 after we called process. As we copied the value into the function, our variable is still accessible.
The variable has been copied automatically by Rust. This happens because v is a simple type and implements the trait Copy out of the box.

The Copy trait

In Rust, the Copy trait is associated with bitwise copy, that is a trivial memory copy between memory areas. Going back to real world examples we can think of a photocopy of a document, where the two copies are exactly identical at the end of the process.

This is not always what we want and this is where the trait Clone comes into play. I won't discuss it further in this article, make sure you read what the Rust book says about Memory and Allocation.

Some simple types in Rust implement the trait Copy out of the box: integers, floats, booleans, and char are among those. If you are unsure, you can always check the documentation. For example, u32 implements Copy as stated here.

It's important to remember that a struct doesn't implement Copy automatically. This was the case originally, but it was changed around 2014, and you can read a long and detailed explanation in RFC #19.

Move: transfer ownership

What happens when a variable is passed by value and its type doesn't implement the Copy trait? In Rust, the variable is moved to the function and the ownership is transferred (selling the car).

Let's have a look at the following code

struct Item {
    pub value: u16,
}

fn process(item: Item) -> u16 {
    item.value
}

fn main() {
    let i = Item { value: 42 };

    process(i);

    // This fails: value moved to process()
    println!("Item value {}", i.value);
}

The compiler won't accept it and will return the following error

    |
    |     let i = Item { value: 42 };
    |         - move occurs because `i` has type `Item`,
    |           which does not implement the `Copy` trait
    |
    |     process(i);
    |             - value moved here
...
    |     println!("Item value {}", i.value);
    |                               ^^^^^^^ value borrowed
    |                                       here after move
    |

After what we said in the previous sections, I believe the error messages are very clear.

When we called process passing i by value we gave up ownership of the variable, so it is not acceptable to use it after that instruction. We are basically trying to drive the car after we sold it.

Implementing Copy for structs

As you see, the Copy trait is explicitly mentioned. If we implement it for Item we should be able to go back to the previous case, with two copies of the value (i in main and item in process). When we define a struct that contains only types that implement Copy we can ask Rust to implement the trait for us using #[derive(Copy, Clone)].

#[derive(Copy, Clone)]
struct Item {
    pub value: u16,
}

fn process(item: Item) -> u16 {
    item.value
}

fn main() {
    let i = Item { value: 42 };

    process(i);

    // This succeeds: value copied to process()
    println!("Item value {}", i.value);
}

This code compiles as now Item can be copied when passed as an argument to process. The only field of Item is a u16, a type that implements the trait Copy, so it is sufficient to derive the trait to implement it for the structure.

To prove that ownership is a protection mechanism that exists only in Rust, it is interesting to compare the Assembly code for the two cases of a struct that doesn't implement Copy and for one that does. The code is exactly the same for both cases (we saw it in the first example of the article).

Borrow: lend the value to a function

So far we saw two cases that happen when we pass arguments by value. We learned however that we can also pass arguments by reference, so let's have a look at what happens in that case.

struct Item {
    pub value: u16,
}

fn process(item: &Item) -> u16 {
    item.value
}

fn main() {
    let i = Item { value: 42 };

    process(&i);

    println!("Item value {}", i.value);
}

Here, we pass to the function a reference to the value, just like we did with the house in the example. We are basically telling the function that it is allowed to access the value for a while, but that we retain ownership. The code compiles without errors.

This seems to be exactly what happened when we passed by value a type that implements Copy, so what is the difference? As we said before, when data is read-only passing arguments by value and by reference might produce identical results, and using references might look like an unnecessary complication.

Sooner or later, however, we will need to change the values of our variables. Time to discuss mutability.

Mutability¶

In Rust, variables are immutable unless they are declared as mutable. However, as it happened for ownership, it is important to remember that at CPU level everything is mutable.

Mutability, like ownership, is a feature that the compiler introduces to help us to write code that is more correct. Forcing us to declare a variable as mutable gives us the chance to ask ourselves if we need that and to avoid bugs. Mutability and ownership work together to ensure that we write safer code.

While the role of mut in front of variables is usually simple to grasp, mutable references might prove more complicated. For this reason, to discuss mutability we will consider separately the case of arguments passed by value and arguments passed by reference.

Mutability and arguments passed by value

Looking back at two of the real world examples in the previous section, we can quickly understand what happens if we introduce mutability with parameters passed by value.

When I sell the car (pass by value, move) or give away the address of the shop (pass by value, copy), the new owner is free to do whatever they want with the object they receive. They can decide to paint the car, or to burn the note, and those actions are not affecting me at all.

The same happens with Rust variables. If the variable is moved (because it doesn't implement Copy), the ownership is transferred to the function, which is free to do whatever it wants.

struct Item {
    pub value: u16,
}

fn process(mut item: Item) {
    item.value += 1;
}

fn main() {
    let i = Item { value: 42 };

    // Variable moved
    process(i);
}

If the variable is copied (because it implements Copy) the function receives a copy of the value, and once again it's free to do whatever it wants with it.

#[derive(Copy, Clone)]
struct Item {
    pub value: u16,
}

fn process(mut item: Item) {
    item.value += 1;
}

fn main() {
    let i = Item { value: 42 };

    // Variable copied
    process(i);

    // i.value is 42
    println!("Value {}", i.value);
}

Here, we can access i after the call because it implements Copy, but the function changed the value of the copied value and not that of the original variable.

The two examples show what we said at the beginning of the article: passing arguments by value doesn't allow us to change the original variables. In Rust, when a variable is passed by value mutability can be ignored.

Declaring an argument as mutable

There is an important point to clarify. Let's have a look at the following code

fn process(mut value: u16) {
    value += 1;
}

fn main() {
    let i: u16 = 42;

    process(i);
}

As you see, the variable i is passed by value. This means that the function process is free to do whatever it wants with the argument value. However, i is not mutable, so how can the function increment value?

The syntax mut value in the function signature means: create a local mutable variable that will host a value coming from the outside.

This is paramount to understand: mut value doesn't mean that the argument passed in main has to be mutable. It just means that the argument will be hosted by a mutable local variable.

As a matter of fact, the following code compiles

fn process(mut value: u16) {
    value += 1;
}

fn main() {
    let mut i: u16 = 42;

    process(i);
}

But the compiler gives us a warning

warning: variable does not need to be mutable
    |
    |     let mut i: u16 = 42;
    |         ----^
    |         |
    |         help: remove this `mut`

Mutable arguments by reference

When a friend borrows my house, I am expecting them to vacate it after a while, and its management is my responsibility. While they live there, I might not like the fact that they paint the walls or change furniture. I need to be clear if I am giving them an object that they can alter or not.

This is why in Rust we can pass arguments to functions by reference in two different ways. We can pass a normal reference (&) or a mutable reference (&mut).

Mutable references

The name "mutable reference" can be misleading, as &mut is a reference that allows to mutate the referenced variable. The reference itself, as any other variable, is immutable unless stated otherwise.

We already saw how to pass a normal reference

fn process(value: &u16) -> u16 {
    *value
}

fn main() {
    let i: u16 = 42;

    process(&i);

    // i is 42
    println!("Item value {}", i);
}

Please note that everything is coherent here. The variable i is not mutable (because we don't need it), and is passed by (immutable) reference to the function. This means that the function receives a reference to a value and Rust knows that it is not allowed to change the referenced value.

If we want to change the value inside the function we need to alter the code in many places

fn process(value: &mut u16) { 1
    *value += 1
}

fn main() {
    let mut i: u16 = 42; 3

    process(&mut i); 2

    // i is 43
    println!("Item value {}", i);
}

The function process needs to receive &mut u16 1 because we want to change the referenced value. This means that main has to call the function passing &mut i 2 and not &i, as the types need to be coherent. We cannot create a mutable reference to an immutable value, though, so i has to be mutable as well 3.

The functional way¶

There is a last option that is a standard strategy in functional languages, where often variables cannot be declared as mutable at all.

struct Item {
    pub value: u16,
}

fn process(mut item: Item) -> Item { 2
    item.value += 1;
    item  3
}

fn main() {
    let mut i = Item { value: 42 };

    // Variable moved
    // Reassigned to retain ownership
    i = process(i); 1

    println!("Value {}", i.value)
}

Here, we pass the variable i by value 1, which will copy or move the variable (depending on Copy). The function process declares the parameter as mutable 2 and changes its value. Then, it returns the same type it accepted 3 and the caller reassigns the old variable 1, that at this point has to be mutable.

A quick recap¶

As you can see, it is very hard to separate call conventions, ownership, and mutability. They are all different aspects of function arguments in Rust, and they work together to ensure the code is safe. Before wrapping up, it might be useful to summarise how we decide the correct strategy.

Do you want to modify the original value?

YES: Variable has to be mutable. Pass by mutable reference OR return and reassign.
NO: Do you need to retain ownership?
- YES: Implement Copy. Pass by value OR pass by reference.
- NO: Pass by value OR pass by reference.

Disassemble Rust¶

If you want to disassemble the Rust code on your machine (for example to explore a different architecture) you can follow these steps.

Install cargo-binutils.
Create a project with Cargo: cargo new proto
Open Cargo.toml and turn off debug for the profile dev

Cargo.toml

[package]
name = "proto"
version = "0.1.0"
edition = "2021"

[dependencies]

[profile.dev]
debug=0

Use llvm-objdump options and awk to get the output you want. For example

$ cargo objdump -- \
   --disassemble \
   --x86-asm-syntax=intel \
   --demangle \
   --no-show-raw-insn \
   --no-print-imm-hex \
   | awk -v RS="" '/^[[:xdigit:]]+ <proto::/'

The awk script is useful to isolate the functions in your code only, skipping the boilerplate that the compiler has to put into an executable. Make sure you mention the correct name of your script if you use something else than proto.

Final words¶

This was quite a ride! I hope it was useful, it definitely was to me to clarify in my head the available options and the reasons behind them. Happy coding!

Resources¶

I highly recommend to watch this video by James Sherman that explains in detail how function code is called at CPU level.
Make sure you are familiar with the following concepts:
- The stack
- The calling convention of a specific CPU, such as the x86 calling convention.

A game of tokens: write an interpreter in Python with TDD - Part 5

2020-08-09T18:00:00+01:00

Introduction¶

This is part 5 of A game of tokens, a series of posts where I build an interpreter in Python following a pure TDD methodology and engaging you in a sort of a game: I give you the tests and you have to write the code that passes them. After part 4 I had a long hiatus because I focused on other projects, but now I resurrected this series and I'm moving on.

First of all I reviewed the first 4 posts, merging the posts that contained the solutions. While this is definitely better for me, I think it might be better for the reader as well, this way it should be easier to follow along. Remember however that you learn if you do, not if you read!

Secondly, I was wondering in which direction to go, and I decided to shamelessly follow the steps of Ruslan Spivak, who first inspired this set of posts and who set off to build an Pascal interpreter; you can find the impressive series of posts Ruslan wrote on his website. Thank you Ruslan for the great posts!

So, let's go Pascal!

Tools update¶

I introduced black into my development toolset, so I used it to reformat the code

black smallcalc/*.py tests/*.py

And added a configuration file .flake8 for Flake8 to avoid the two tools to clash

[flake8]
# Recommend matching the black line length (default 88),
# rather than using the flake8 default of 79:
max-line-length = 100
ignore = E231 E741

Level 17 - Reserved keywords and new assignment¶

Since Pascal has reserved keywords, I need tokens that have the keyword itself as value (something similar to Erlang's atoms). For this reason I changed test_empty_token_has_length_zero into

def test_empty_token_has_the_length_of_the_type_itself():
    t = token.Token("sometype")

    assert len(t) == len("sometype")
    assert bool(t) is True

and modified the code in the class Token to pass it

   def __len__(self):
        return len(self.value) if self.value else len(self.type)

The keywords I will introduce in this post are BEGIN and END, so I need a test that shows they are supported

def test_get_tokens_understands_begin_and_end():
    l = clex.CalcLexer()

    l.load("BEGIN END")

    assert l.get_tokens() == [
        token.Token(clex.BEGIN),
        token.Token(clex.END),
        token.Token(clex.EOL),
        token.Token(clex.EOF),
    ]

The block BEGIN ... END is a generic compound block in Pascal (more on this later), and a Pascal program is made of that plus a final dot. Since the dot is already used for floats I need a test that shows it is correctly lexed.

def test_get_tokens_understands_final_dot():
    l = clex.CalcLexer()

    l.load("BEGIN END.")

    assert l.get_tokens() == [
        token.Token(clex.BEGIN),
        token.Token(clex.END),
        token.Token(clex.DOT),
        token.Token(clex.EOL),
        token.Token(clex.EOF),
    ]

Last, Pascal assignments are sligthly different from what we already implemented, as they use the symbol := instead of just =. We face a choice here, as we have to decide where to put the logic of our programming language: shall the lexer identify : and = separately, and let the parser deal with the two tokens in sequence, or shall we make the lexer emit an ASSIGNMENT token directly? I went for the first one, so that the lexer can be kept simple (no lookahead in it), but you are obviously free to try something different. For me the test that checks the assignment is

def test_get_tokens_understands_assignment_and_semicolon():
    l = clex.CalcLexer()

    l.load("a := 5;")

    assert l.get_tokens() == [
        token.Token(clex.NAME, "a"),
        token.Token(clex.LITERAL, ":"),
        token.Token(clex.LITERAL, "="),
        token.Token(clex.INTEGER, "5"),
        token.Token(clex.LITERAL, ";"),
        token.Token(clex.EOL),
        token.Token(clex.EOF),
    ]

You may have noticed I also decided to check for the semicolon in this test. Even here, we might discuss if it's meaningful to test two different things together, and generally speaking I'm in favour of a high granularity in tests, which however means that I try to avoid testing unrelated and complicated features together. In Pascal, the semicolon is used to separate statements, so it is likely be found at the end of something like an assignment. For this reason, and considering that it's a small feature, I put it in a context inside this test, and will extract it if more complex requirements arise in the future.

The parser has to be changed to support the new assignment, and to do that we first need to change the tests. The symbol = has to be replaced with := in the following tests: test_parse_assignment, test_parse_assignment_with_expression, test_parse_assignment_expression_with_variables, and test_parse_line_supports_assigment.

Solution¶

Supporting reserved keywords is just a matter of defining specific token types for them

BEGIN = "BEGIN"
DOT = "DOT"

RESERVED_KEYWORDS = [BEGIN, END]

and changing the method _process_name in order to detect them

def _process_name(self):
    regexp = re.compile(r"[a-zA-Z_]+")

    match = regexp.match(self._text_storage.tail)

    if not match:
        return None

    token_string = match.group()

    if token_string in RESERVED_KEYWORDS:
        tok = token.Token(token_string)
    else:
        tok = token.Token(NAME, token_string)

    return self._set_current_token_and_skip(tok)

I decided to put the logic in this method because after all reserved keywords are exactly names with a specific meaning. I might have created a dedicated method _process_keyword but it would basically have been a copy of _process_name so this solution makes sense to me.

To support the final dot I added a token for it

DOT = "DOT"

and a processing method

   def _process_dot(self):
        regexp = re.compile(r"\.$")

        match = regexp.match(self._text_storage.tail)

        if match:
            return self._set_current_token_and_skip(token.Token(DOT))

which is then introduced with a high priority in get_token

    def get_token(self):
        eof = self._process_eof()
        if eof:
            return eof

        eol = self._process_eol()
        if eol:
            return eol

        dot = self._process_dot()
        if dot:
            return dot

        self._process_whitespace()

        name = self._process_name()
        if name:
            return name

        number = self._process_number()
        if number:
            return number

        literal = self._process_literal()
        if literal:
            return literal

To pass the parser tests I just need to change the implementation of parse_assignment

def parse_assignment(self):
        variable = self._parse_variable()
        self.lexer.discard(token.Token(clex.LITERAL, ":"))
        self.lexer.discard(token.Token(clex.LITERAL, "="))
        value = self.parse_expression()

Level 18 - Statements and compound statements¶

In Pascal a compound statement is a list of statements enclosed between BEGIN and END, so the final grammar we want to have in this post is

compound_statement : BEGIN statement_list END

statement_list : statement | statement SEMI statement_list

statement : compound_statement | assignment_statement | empty

assignment_statement : variable ASSIGN expr

As you can see this is a recursive definition, as the statement_list contains one or more statement, and each of them can be a compound_statement. The following is indeed a valid Pascal program

BEGIN
    BEGIN
        BEGIN
            writeln("Valid!")
        END
    END
END.

Recursive algorithms are not simple, and it takes some time to tackle them properly. Let's try to implement one small feature at a time. The first test is that parse_statement should be able to parse assignments

def test_parse_statement_assignment():
    p = cpar.CalcParser()
    p.lexer.load("x := 5")

    node = p.parse_statement()

    assert node.asdict() == {
        "type": "assignment",
        "variable": "x",
        "value": {"type": "integer", "value": 5},
    }

In future, statements will be more than just assignments, so this test is the first of many others that we will eventually have for parse_statement. The second test we need is that a compound statement can contain an empty list of statements.

def test_parse_empty_compound_statement():
    p = cpar.CalcParser()
    p.lexer.load("BEGIN END")

    node = p.parse_compound_statement()

    assert node.asdict() == {"type": "compound_statement", "statements": []}

After this is done, I want to test that the compound statement can contains one single statement

def test_parse_compound_statement_one_statement():
    p = cpar.CalcParser()
    p.lexer.load("BEGIN x:= 5 END")

    node = p.parse_compound_statement()

    assert node.asdict() == {
        "type": "compound_statement",
        "statements": [
            {
                "type": "assignment",
                "variable": "x",
                "value": {"type": "integer", "value": 5},
            }
        ],
    }

and multiple statements separated by semicolon

def test_parse_compound_statement_multiple_statements():
    p = cpar.CalcParser()
    p.lexer.load("BEGIN x:= 5; y:=6; z:=7 END")

    node = p.parse_compound_statement()

    assert node.asdict() == {
        "type": "compound_statement",
        "statements": [
            {
                "type": "assignment",
                "variable": "x",
                "value": {"type": "integer", "value": 5},
            },
            {
                "type": "assignment",
                "variable": "y",
                "value": {"type": "integer", "value": 6},
            },
            {
                "type": "assignment",
                "variable": "z",
                "value": {"type": "integer", "value": 7},
            },
        ],
    }

Solution¶

To pass the first test it is sufficient to add a method parse_statement that calls parse_assignment

    def parse_statement(self):
        with self.lexer:
            return self.parse_assignment()

The second test requires a bit more code. I need to define a method parse_compound_statement and this has to return a specific new type of node. A compound statement is s list of statements that have to be executed in order, so it's time to define a class CompoundStatementNode

class CompoundStatementNode(Node):

    node_type = "compound_statement"

    def __init__(self, statements=None):
        self.statements = statements if statements else []

    def asdict(self):
        return {
            "type": self.node_type,
            "statements": [statement.asdict() for statement in self.statements],
        }

and at this point parse_compound_statement is trivial, at least for now

    def parse_compound_statement(self):
        self.lexer.discard(token.Token(clex.BEGIN))
        self.lexer.discard(token.Token(clex.END))

        return CompoundStatementNode()

With the third test we have to add the processing of a single statement. As this is optional, it's a good use case for our lexer as a context manager

    def parse_compound_statement(self):
        nodes = []

        self.lexer.discard(token.Token(clex.BEGIN))

        with self.lexer:
            statement_node = self.parse_statement()
            if statement_node:
                nodes.append(statement_node)

        self.lexer.discard(token.Token(clex.END))

        return CompoundStatementNode(nodes)

And finally, for the fourth test, I have to process optional further statements separated by semicolons. For this, I make use of the method peek_token to look ahead and see if there is another statement to process

    def parse_compound_statement(self):
        nodes = []

        self.lexer.discard(token.Token(clex.BEGIN))

        with self.lexer:
            statement_node = self.parse_statement()
            if statement_node:
                nodes.append(statement_node)

            while self.lexer.peek_token() == token.Token(clex.LITERAL, ";"):
                self.lexer.discard(token.Token(clex.LITERAL, ";"))

                statement_node = self.parse_statement()

                if statement_node:
                    nodes.append(statement_node)

        self.lexer.discard(token.Token(clex.END))

        return CompoundStatementNode(nodes)

Level 19 - Recursive compound statements¶

To verify that compound statements are actually recursive, we can add this test

def test_parse_compound_statement_multiple_statements_with_compund_statement():
    p = cpar.CalcParser()
    p.lexer.load("BEGIN x:= 5; BEGIN y := 6 END ; z:=7 END")

    node = p.parse_compound_statement()

    assert node.asdict() == {
        "type": "compound_statement",
        "statements": [
            {
                "type": "assignment",
                "variable": "x",
                "value": {"type": "integer", "value": 5},
            },
            {
                "type": "compound_statement",
                "statements": [
                    {
                        "type": "assignment",
                        "variable": "y",
                        "value": {"type": "integer", "value": 6},
                    }
                ],
            },
            {
                "type": "assignment",
                "variable": "z",
                "value": {"type": "integer", "value": 7},
            },
        ],
    }

where the second statement is a compound statement itself. After this is done we can test the visitor (tests/test_calc_visitor.py) and see if we can process single statements

def test_visitor_compound_statement_one_statement():
    ast = {
        "type": "compound_statement",
        "statements": [
            {
                "type": "assignment",
                "variable": "x",
                "value": {"type": "integer", "value": 5},
            }
        ],
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) is None
    assert v.isvariable("x") is True
    assert v.valueof("x") == 5
    assert v.typeof("x") == "integer"

Multiple statements

def test_visitor_compound_statement_multiple_statements():
    ast = {
        "type": "compound_statement",
        "statements": [
            {
                "type": "assignment",
                "variable": "x",
                "value": {"type": "integer", "value": 5},
            },
            {
                "type": "assignment",
                "variable": "y",
                "value": {"type": "integer", "value": 6},
            },
            {
                "type": "assignment",
                "variable": "z",
                "value": {"type": "integer", "value": 7},
            },
        ],
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) is None

    assert v.isvariable("x") is True
    assert v.valueof("x") == 5
    assert v.typeof("x") == "integer"

    assert v.isvariable("y") is True
    assert v.valueof("y") == 6
    assert v.typeof("y") == "integer"

    assert v.isvariable("z") is True
    assert v.valueof("z") == 7
    assert v.typeof("z") == "integer"

and recursive compound statements

def test_visitor_compound_statement_multiple_statements_with_compund_statement():
    ast = {
        "type": "compound_statement",
        "statements": [
            {
                "type": "assignment",
                "variable": "x",
                "value": {"type": "integer", "value": 5},
            },
            {
                "type": "compound_statement",
                "statements": [
                    {
                        "type": "assignment",
                        "variable": "y",
                        "value": {"type": "integer", "value": 6},
                    }
                ],
            },
            {
                "type": "assignment",
                "variable": "z",
                "value": {"type": "integer", "value": 7},
            },
        ],
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) is None

    assert v.isvariable("x") is True
    assert v.valueof("x") == 5
    assert v.typeof("x") == "integer"

    assert v.isvariable("y") is True
    assert v.valueof("y") == 6
    assert v.typeof("y") == "integer"

    assert v.isvariable("z") is True
    assert v.valueof("z") == 7
    assert v.typeof("z") == "integer"

Solution¶

Before I added the first test I quickly refactored the code to follow the grammar a bit more closely, introducing parse_statement_list and calling it from parse_compound_statement. This is just a matter of isolating the part of the code that deals with the list of statements in its own method

    def parse_statement_list(self):
        nodes = []

        statement_node = self.parse_statement()
        if statement_node:
            nodes.append(statement_node)

        while self.lexer.peek_token() == token.Token(clex.LITERAL, ";"):
            self.lexer.discard(token.Token(clex.LITERAL, ";"))

            statement_node = self.parse_statement()

            if statement_node:
                nodes.append(statement_node)

        return nodes

    def parse_compound_statement(self):
        nodes = []

        self.lexer.discard(token.Token(clex.BEGIN))

        with self.lexer:
            nodes = self.parse_statement_list()

        self.lexer.discard(token.Token(clex.END))

        return CompoundStatementNode(nodes)

after this I introduce the new test, and to pass it I need to change parse_statement so that it parses either an assignment or a compound statement

    def parse_statement(self):
        with self.lexer:
            return self.parse_assignment()

        return self.parse_compound_statement()

Before I move to the visitor, I want to discuss a choice that I have here. The current version of the method parse_statement_list

    def parse_statement_list(self):
        nodes = []

        statement_node = self.parse_statement()
        if statement_node:
            nodes.append(statement_node)

        while self.lexer.peek_token() == token.Token(clex.LITERAL, ";"):
            self.lexer.discard(token.Token(clex.LITERAL, ";"))

            statement_node = self.parse_statement()

            if statement_node:
                nodes.append(statement_node)

        return nodes

might be easily written in a recursive way, to better match the grammar, becoming

    def parse_statement_list(self):
        nodes = []

        statement_node = self.parse_statement()
        if statement_node:
            nodes.append(statement_node)

        with self.lexer:
            self.lexer.discard(token.Token(clex.LITERAL, ";"))
            nodes.extend(self.parse_statement_list())

        return nodes

As you can see if you replace the code all the test pass, so the solution is technically correct. While recursive algorithms are elegant and compact, however, in this case I will stick to the first version. Using a recursive approach introduces a limit to the number of calls, and while in this little project we won't probably have this issue, I think it is worth mentioning it. Both solutions are correct, though, so feel free to choose the recursive path if you happen to like it more.

The tests for the visitor can be passed with a minimal change, as the visitor itself just needs to be aware of compound_statement nodes and to know how to process them. So, I added a new condition to the method visit

        if node["type"] == "compound_statement":
            [self.visit(node) for node in node["statements"]]

which passes all the three new tests added for the visitor.

Level 20 - Pascal programs and case insensitive names¶

A Pascal program ends with a dot, so we should introduce a new endpoint parse_program and test that it works. The first test verifies that we can parse an empty program

def test_parse_empty_program():
    p = cpar.CalcParser()
    p.lexer.load("BEGIN END.")

    node = p.parse_program()

    assert node.asdict() == {"type": "compound_statement", "statements": []}

and the second tests that the final dot can't be missing

import pytest

from smallcalc.calc_lexer import TokenError


def test_parse_program_requires_the_final_dot():
    p = cpar.CalcParser()
    p.lexer.load("BEGIN END")

    with pytest.raises(TokenError):
        p.parse_program()

Notice that I imported pytest and the TokenError exception to build a negative test (i.e. to test something that fails). The last test verifies a non-empty program can be parsed

def test_parse_program_with_nested_statements():
    p = cpar.CalcParser()
    p.lexer.load("BEGIN x:= 5; BEGIN y := 6 END ; z:=7 END.")

    node = p.parse_program()

    assert node.asdict() == {
        "type": "compound_statement",
        "statements": [
            {
                "type": "assignment",
                "variable": "x",
                "value": {"type": "integer", "value": 5},
            },
            {
                "type": "compound_statement",
                "statements": [
                    {
                        "type": "assignment",
                        "variable": "y",
                        "value": {"type": "integer", "value": 6},
                    }
                ],
            },
            {
                "type": "assignment",
                "variable": "z",
                "value": {"type": "integer", "value": 7},
            },
        ],
    }

When all these tests pass we are almost done for this post, and we just need to make the parser treat names in a case insensitive way. In Pascal, both variables and keywords are case-insensitive, so BEGIN and begin are the same keyword (or BeGiN, though I think this might be a misinterpretation of the concept of "snake case" =) ), and the same is valid for variables: you can define MYVAR and use myvar.

To test this behaviour I changed the test test_get_tokens_understands_uppercase_letters into test_get_tokens_is_case_insensitive

def test_get_tokens_is_case_insensitive():
    l = clex.CalcLexer()

    l.load("SomeVar")

    assert l.get_tokens() == [
        token.Token(clex.NAME, "somevar"),
        token.Token(clex.EOL),
        token.Token(clex.EOF),
    ]

and added the test for the two keywords we defined so far

def test_get_tokens_understands_begin_and_end_case_insensitive():
    l = clex.CalcLexer()

    l.load("begin end")

    assert l.get_tokens() == [
        token.Token(clex.BEGIN),
        token.Token(clex.END),
        token.Token(clex.EOL),
        token.Token(clex.EOF),
    ]

Solution¶

To parse a program we need to introduce the aptly named endpoint parse_program, which just parses a compound statement (the program) and the final dot.

    def parse_program(self):
        compound_statement = self.parse_compound_statement()
        self.lexer.discard(token.Token(clex.DOT))

        return compound_statement

As for the case insensitive names, it's just a matter of changing the method _process_name

    def _process_name(self):
        regexp = re.compile(r"[a-zA-Z_]+")

        match = regexp.match(self._text_storage.tail)

        if not match:
            return None

        token_string = match.group()

        if token_string.upper() in RESERVED_KEYWORDS:
            tok = token.Token(token_string.upper())
        else:
            tok = token.Token(NAME, token_string.lower())

        return self._set_current_token_and_skip(tok)

Note that I decided to keep internally keywords with uppercase names and variables with lowercase ones. This is really just a matter of personal taste at this point of the project (and probably will always be), so feel free to follow the structure you like the most.

Final words¶

That was something! I was honestly impressed by how easily I could introduce changes in the language and add new feature, a testimony that the TDD methodology is a really powerful tool to have in your belt. Thanks again to Ruslan Spivak for his work and his inspiring posts!

The code I developed in this post is available on the GitHub repository tagged with part5 (link).

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

A game of tokens: solution - Part 4

2018-06-02T13:30:00+00:00

This post originally contained my solution to the challenge posted here. I moved those solutions inside the post itself, under the "Solution" subsections.

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

A game of tokens: write an interpreter in Python with TDD - Part 4

2018-06-02T13:00:00+00:00

Introduction¶

In the first three parts of this series of posts we developed together a calculator using a pure TDD methodology. In the previous post we added support for variables.

In this new post we will first add the exponentiation operation. The operator will be challenging because it has a high priority, so we will need to spice up the peek functions to look at multiple tokens.

Then I will show you how I performed a refactoring of the code introducing a new version of the lexer that greatly simplifies the code of the parser.

Level 15 - Exponentiation¶

That is power. - Conan the Barbarian (1982)

The exponentiation operation is simple, and Python uses the double star operator to represent it

>>> 2**3
8

The main problem that we will face implementing it is the priority of such an operation. Traditionally, this operator has precedence on the basic arithmetic operations (sum, difference, multiplication, division). So if I write

>>> 1 + 2 ** 3
9

Python correctly computes 1 + (2 ** 3) and not (1 + 2) ** 3. As we did with multiplication and division, then, we will need to create a specific step to parse this operation.

In small calc, the exponentiation will be associated to the symbol ^, so 2^3 will mean 2 to the power of 3 (2**3 in Python).

Lexer¶

The lexer has a simple task, that of recognising the symbol '^' as a LITERAL token. The test goes into tests/test_calc_lexer.py

def test_get_tokens_understands_exponentiation():
    l = clex.CalcLexer()

    l.load('2 ^ 3')

    assert l.get_tokens() == [
        token.Token(clex.INTEGER, '2'),
        token.Token(clex.LITERAL, '^'),
        token.Token(clex.INTEGER, '3'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

Does your code already pass the test? If yes, why?

Parser¶

It's time to test the proper parsing of the exponentiation operation. Add this test to tests/test_calc_parser.py

def test_parse_exponentiation():
    p = cpar.CalcParser()
    p.lexer.load("2 ^ 3")

    node = p.parse_exponentiation()

    assert node.asdict() == {
        'type': 'exponentiation',
        'left': {
            'type': 'integer',
            'value': 2
        },
        'right': {
            'type': 'integer',
            'value': 3
        },
        'operator': {
            'type': 'literal',
            'value': '^'
        }
    }

As you can see this test checks directly the method parse_exponentiation, so you just need to properly implement that, at this stage.

To allow the use of the exponentiation operator '^' in the calculator, however, we have to integrate it with the rest of the parse functions, so we will add three tests to the same file. The first one tests that the natural priority of the exponentiation operator is higher than the multiplication

def test_parse_exponentiation_with_other_operators():
    p = cpar.CalcParser()
    p.lexer.load("3 * 2 ^ 3")

    node = p.parse_term()

    assert node.asdict() == {
        'type': 'binary',
        'operator': {
            'type': 'literal',
            'value': '*'
        },
        'left': {
            'type': 'integer',
            'value': 3
        },
        'right': {
            'type': 'exponentiation',
            'left': {
                'type': 'integer',
                'value': 2
            },
            'right': {
                'type': 'integer',
                'value': 3
            },
            'operator': {
                'type': 'literal',
                'value': '^'
            }
        }
    }

The second one checks that the parentheses still change the priority

def test_parse_exponentiation_with_parenthesis():
    p = cpar.CalcParser()
    p.lexer.load("(3 + 2) ^ 3")

    node = p.parse_term()

    assert node.asdict() == {
        'type': 'exponentiation',
        'operator': {
            'type': 'literal',
            'value': '^'
        },
        'left': {
            'type': 'binary',
            'operator': {
                'type': 'literal',
                'value': '+'
            },
            'left': {
                'type': 'integer',
                'value': 3
            },
            'right': {
                'type': 'integer',
                'value': 2
            }
        },
        'right': {
            'type': 'integer',
            'value': 3
        }
    }

And the third one checks that unary operators still have a higher priority than the exponentiation operator

def test_parse_exponentiation_with_negative_base():
    p = cpar.CalcParser()
    p.lexer.load("-2 ^ 2")

    node = p.parse_exponentiation()

    assert node.asdict() == {
        'type': 'exponentiation',
        'operator': {
            'type': 'literal',
            'value': '^'
        },
        'left': {
            'type': 'unary',
            'operator': {
                'type': 'literal',
                'value': '-'
            },
            'content': {
                'type': 'integer',
                'value': 2
            }
        },
        'right': {
            'type': 'integer',
            'value': 2
        }
    }

My advice is to add the first test and to try and pass that one changing the code. If your change doesn't touch too much of the existing parse methods, chances are that the following two tests will pass as well.

Visitor¶

Last, we need to properly expose the exponentiation operation in the CLI, which means to change the Visitor in order to support nodes of type 'exponentiation. The test that we need to add to tests/test_calc_visitor.py is

def test_visitor_exponentiation():
    ast = {
        'type': 'exponentiation',
        'left': {
            'type': 'integer',
            'value': 2
        },
        'right': {
            'type': 'integer',
            'value': 3
        },
        'operator': {
            'type': 'literal',
            'value': '^'
        }
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (8, 'integer')

And the change to the CalcVisitor class should be very easy as we need to simply process a new type of node.

Solution¶

The lexer can process the exponentiation operator ^ out of the box as a LITERAL token, so no changes to the code are needed.

The test test_parse_exponentiation can be passed adding a PowerNode class.

Note: After I wrote and committed the solution I realised that the class called PowerNode should have been called ExponentiationNode, the former being a leftover of a previous incorrect nomenclature. I will eventually fix it in one of the refactoring steps, trying to convert a mistake into a good example of TDD.

class PowerNode(BinaryNode):
    node_type = 'exponentiation'

and a method parse_exponentiation to the parser

    def parse_exponentiation(self):
        left = self.parse_factor()

        next_token = self.lexer.peek_token()

        if next_token.type == clex.LITERAL and next_token.value == '^':
            operator = self._parse_symbol()
            right = self.parse_exponentiation()

            return PowerNode(left, operator, right)

        return left

This allows the parser to explicitly parse the exponentiation operation, but when the operation is mixed with others the parser doesn't know how to deal with it, as parse_exponentiation is not called by any other method.

To pass the test_parse_exponentiation_with_other_operators test we need to change the method parse_term and call parse_exponentiation instead of parse_factor

    def parse_term(self):
        left = self.parse_exponentiation()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL\
                and next_token.value in ['*', '/']:
            operator = self._parse_symbol()
            right = self.parse_exponentiation()

            left = BinaryNode(left, operator, right)

the full code of the CalcParser class is now

class CalcParser:

    def __init__(self):
        self.lexer = clex.CalcLexer()

    def _parse_symbol(self):
        t = self.lexer.get_token()
        return LiteralNode(t.value)

    def parse_integer(self):
        t = self.lexer.get_token()
        return IntegerNode(t.value)

    def _parse_variable(self):
        t = self.lexer.get_token()
        return VariableNode(t.value)

    def parse_factor(self):
        next_token = self.lexer.peek_token()

        if next_token.type == clex.LITERAL and next_token.value in ['-', '+']:
            operator = self._parse_symbol()
            factor = self.parse_factor()
            return UnaryNode(operator, factor)

        if next_token.type == clex.LITERAL and next_token.value == '(':
            self.lexer.discard_type(clex.LITERAL)
            expression = self.parse_expression()
            self.lexer.discard_type(clex.LITERAL)
            return expression

        if next_token.type == clex.NAME:
            t = self.lexer.get_token()
            return VariableNode(t.value)

        return self.parse_integer()

    def parse_exponentiation(self):
        left = self.parse_factor()

        next_token = self.lexer.peek_token()

        if next_token.type == clex.LITERAL and next_token.value == '^':
            operator = self._parse_symbol()
            right = self.parse_exponentiation()

            return PowerNode(left, operator, right)

        return left

    def parse_term(self):
        left = self.parse_exponentiation()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL\
                and next_token.value in ['*', '/']:
            operator = self._parse_symbol()
            right = self.parse_exponentiation()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

    def parse_expression(self):
        left = self.parse_term()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL\
                and next_token.value in ['+', '-']:
            operator = self._parse_symbol()
            right = self.parse_term()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

    def parse_assignment(self):
        variable = self._parse_variable()
        self.lexer.discard(token.Token(clex.LITERAL, '='))
        value = self.parse_expression()

        return AssignmentNode(variable, value)

    def parse_line(self):
        try:
            self.lexer.stash()
            return self.parse_assignment()
        except clex.TokenError:
            self.lexer.pop()
            return self.parse_expression()

The given test test_visitor_exponentiation requires the CalcVisitor to parse nodes of type exponentiation. The code required to do this is

        if node['type'] == 'exponentiation':
            lvalue, ltype = self.visit(node['left'])
            rvalue, rtype = self.visit(node['right'])

            return lvalue ** rvalue, ltype

as a final case for the CalcVisitor class. The full code of the class is is now

class CalcVisitor:

    def __init__(self):
        self.variables = {}

    def isvariable(self, name):
        return name in self.variables

    def valueof(self, name):
        return self.variables[name]['value']

    def typeof(self, name):
        return self.variables[name]['type']

    def visit(self, node):
        if node['type'] == 'integer':
            return node['value'], node['type']

        if node['type'] == 'variable':
            return self.valueof(node['value']), self.typeof(node['value'])

        if node['type'] == 'unary':
            operator = node['operator']['value']
            cvalue, ctype = self.visit(node['content'])

            if operator == '-':
                return - cvalue, ctype

            return cvalue, ctype

        if node['type'] == 'binary':
            lvalue, ltype = self.visit(node['left'])
            rvalue, rtype = self.visit(node['right'])

            operator = node['operator']['value']

            if operator == '+':
                return lvalue + rvalue, rtype
            elif operator == '-':
                return lvalue - rvalue, rtype
            elif operator == '*':
                return lvalue * rvalue, rtype
            elif operator == '/':
                return lvalue // rvalue, rtype

        if node['type'] == 'assignment':
            right_value, right_type = self.visit(node['value'])
            self.variables[node['variable']] = {
                'value': right_value,
                'type': right_type
            }

            return None, None

Intermezzo - Refactoring with tests¶

See? You just had to see it in context. - Scrooged (1988)

So our little project is growing, and the TDD methodology we are following gives us plenty of confidence in what we did. There as for sure bugs we are not aware of, but we are sure that the cases that we tested are correctly handled by our code.

As happens in many projects at a certain point it's time for refactoring. We implemented solutions to the problems that we found along the way, but are we sure we avoided duplicating code, that we chose the best solution for some algorithms, or more simply that the names we chose for the variables are clear?

Refactoring means basically to change the internal structure of something without changing its external behaviour, and tests are a priceless help in this phase. The tests we wrote are there to ensure that the previous behaviour does not change. Or, if it changes, that we are perfectly aware of it.

In this section, thus, I want to guide you through a refactoring guided by tests. If you are following this series and writing your own code this section will not add anything to the project, but I recommend that you read it anyway, as it shows why tests are so important in a software project.

If you want to follow the refactoring on the repository you can create a branch on the tag context-manager-refactoring and work there. From that commit I implemented the steps you will find in the next sections.

Context managers¶

The main issue the current code has is that the lexer cannot automatically recover a past status, that is we cannot easily try to parse something and, when we discover that the initial guess is wrong, go back in time and start over.

Let's consider the method parse_line

    def parse_line(self):
        try:
            self.lexer.stash()
            return self.parse_assignment()
        except clex.TokenError:
            self.lexer.pop()
            return self.parse_expression()

Since a line can contain either an assignment or an expression the first thing this function does is to save the lexer status with stash and try to parse an assignment. If the code is not an assignment somewhere is the code the TokenError exception is raised, and parse_line restores the previous status of the lexer and tries to parse an expression.

The same thing happens in other methods like parse_term

    def parse_term(self):
        left = self.parse_exponentiation()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL\
                and next_token.value in ['*', '/']:
            operator = self._parse_symbol()
            right = self.parse_exponentiation()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

where the use of lexer.peek_token and the while loop show that the lexer class requires too much control from its user.

Back to parse_line, it's clear that the code works, but it is not immediately easy to understand what the function does and when the old status is recovered. I would really prefer something like

# PSEUDOCODE
     def parse_line(self):
        ATTEMPT:
            return self.parse_assignment()

        return self.parse_expression()

where I used a pseudo-keyword ATTEMPT: to signal that somehow the lexer status is automatically stored at the beginning and retrieved at the end of it.

There's a very powerful concept in Python that allows us to write code like this, and it is called context manager. I won't go into the theory and syntax of context managers here, please refer to the Python documentation or your favourite course/book/website to discover how context managers work.

If I can add context manager features to the lexer the code of parse_line might become

     def parse_line(self):
        with self.lexer:
            return self.parse_assignment()

        return self.parse_expression()

Lexer¶

The first move is to transform the lexer into a context manager that does nothing. The test in tests/test_calc_lexer.py is

def test_lexer_as_context_manager():
    l = clex.CalcLexer()
    l.load('abcd')

    with l:
        assert l.get_token() == token.Token(clex.NAME, 'abcd')

When this works we have to be sure that the lexer does not restore the previous state outside the with statement if the code inside the statement ended without errors. The new test is

def test_lexer_as_context_manager_does_not_restore_the_status_if_no_error():
    l = clex.CalcLexer()
    l.load('3 * 5')

    with l:
        assert l.get_token() == token.Token(clex.INTEGER, 3)

    assert l.get_token() == token.Token(clex.LITERAL, '*')

Conversely, we need to be sure that the status is restored when the code inside the with statement fails, which is the whole point of the context manager. This is tested by

def test_lexer_as_context_manager_restores_the_status_if_token_error():
    l = clex.CalcLexer()
    l.load('3 * 5')

    with l:
        l.get_token()
        l.get_token()
        raise clex.TokenError

    assert l.get_token() == token.Token(clex.INTEGER, 3)

When these three tests pass, we have a fully working context manager lexer, that reacts to TokenError exceptions going back to the previously stored status.

Parser¶

If the context manager lexer works as intended we should be able to replace the code of the parser without changing any test. The new code for parse_line is the one I showed before

    def parse_line(self):
        with self.lexer:
            return self.parse_assignment()

        return self.parse_expression()

and it works flawlessly.

The context manager part of the lexer, however, works if the code inside the with statement raises a TokenError exception when it fails. That exception is a signal to the context manager that the parsing path is not leading anywhere and it shall go back to the previous state.

Manage literals¶

The method _parse_symbol is often used after some checks like

        if next_token.type == clex.LITERAL and next_token.value in ['-', '+']:
            operator = self._parse_symbol()

I would prefer to include the checks in the method itself, so that it might be included in a with statement. First of all the method can be renamed to _parse_literal, and being an internal method I don't expect any test to fail

    def _parse_literal(self):
        t = self.lexer.get_token()
        return LiteralNode(t.value)

The method should also raise a TokenError when the token is not a LITERAL, and when the values are not the expected ones

    def _parse_literal(self, values=None):
        t = self.lexer.get_token()

        if t.type != clex.LITERAL:
            raise clex.TokenError

        if values and t.value not in values:
            raise clex.TokenError

        return LiteralNode(t.value)

Note that using the default value for the values parameter I didn't change the current behaviour. The whole battery of tests still passes without errors.

Parsing factors¶

The next method that we can start changing is parse_factor. The first pattern this function tries to parse is an unary node

        if next_token.type == clex.LITERAL and next_token.value in ['-', '+']:
            operator = self._parse_literal()
            factor = self.parse_factor()
            return UnaryNode(operator, factor)

which may be converted to

        with self.lexer:
            operator = self._parse_literal(['+', '-'])
            content = self.parse_factor()
            return UnaryNode(operator, content)

while still passing the whole test suite.

The second pattern are expressions surrounded by parentheses

        if next_token.type == clex.LITERAL and next_token.value == '(':
            self.lexer.discard_type(clex.LITERAL)
            expression = self.parse_expression()
            self.lexer.discard_type(clex.LITERAL)
            return expression

and this is easily converted to the new syntax as well

        with self.lexer:
            self._parse_literal(['('])
            expression = self.parse_expression()
            self._parse_literal([')'])
            return expression

Parsing exponentiation operations¶

To change the method parse_exponentiation we need first to make the _parse_variable return a TokenError in case of wrong token

    def _parse_variable(self):
        t = self.lexer.get_token()

        if t.type != clex.NAME:
            raise clex.TokenError

        return VariableNode(t.value)

then we can change the method we are interested in

    def parse_exponentiation(self):
        left = self.parse_factor()

        with self.lexer:
            operator = self._parse_literal(['^'])
            right = self.parse_exponentiation()

            return PowerNode(left, operator, right)

        return left

Doing this last substitution I also notice that there is some duplicated code in parse_factor, and I replace it with a call to _parse_variable. The test suite keeps passing, giving me the certainty that what I did does not change the behaviour of the code (at least the behaviour that is covered by tests).

Parsing terms¶

Now, the method parse_term will be problematic. To implement this function I used a while loop that keeps using the method parse_exponentiation until the separation token is a LITERAL with value * or /

    def parse_term(self):
        left = self.parse_exponentiation()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL\
                and next_token.value in ['*', '/']:
            operator = self._parse_literal()
            right = self.parse_exponentiation()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

This is not a pure recursive call, then, and replacing the code with the context manager lexer would result in errors, because the context manage doesn't loop. The same situation is replicated in parse_expression.

This is another typical situation that we face when refactoring code. We realise that the required change is made of multiple steps and that multiple tests will fail until all the steps are completed.

There is no single solution to this problem, but TDD gives you some hints to deal with it. The most important "rule" that I follow when I work in a TDD environment is that there should be maximum one failing test at a time. And when a code change makes multiple tests fail there is a simple way to reach this situation: comment out tests.

Yes, you should temporarily get rid of tests, so you can concentrate in writing code that passes the subset of active tests. Then you will add one test at a time, fixing the code or the tests according to your needs. When you refactor it might be necessary to change the tests as well, as sometimes we test part of the code that are not exactly an external boundary.

I can now change the code of the parse_term function introducing the context manager

    def parse_term(self):
        left = self.parse_exponentiation()

        with self.lexer:
            operator = self._parse_literal(['*', '/'])
            right = self.parse_exponentiation()

            return BinaryNode(left, operator, right)

        return left

and the test suite runs with one single failing test, test_parse_term_with_multiple_operations. I have now to work on it and try to understand why the test fails.

I decided to go for a pure recursive approach (no more while loops), which is what standard language parsers do. After working on it the new version of parse_term is

    def parse_term(self):
        left = self.parse_factor()

        with self.lexer:
            operator = self._parse_literal(['*', '/'])
            right = self.parse_term()

            return BinaryNode(left, operator, right)

        return left

And this change makes 3 tests fail. The same test_parse_term_with_multiple_operations that was failing before, plus test_parse_exponentiation_with_other_operators, and test_parse_exponentiation_with_parenthesis. The last two actually test the method parse_exponentiation, which uses parse_term. This means that I can temporarily comment them and concentrate on the first one.

What I discover is that changing the code to use the recursive approach changes the output of the functions. The previous output of parse_term applied to 2 * 3 / 4 was

{
    'type': 'binary',
    'left': {
        'type': 'binary',
        'left': {
            'type': 'integer',
            'value': 2
        },
        'right': {
            'type': 'integer',
            'value': 3
        },
        'operator': {
            'type': 'literal',
            'value': '*'
        }
    },
    'right': {
        'type': 'integer',
        'value': 4
    },
    'operator': {
        'type': 'literal',
        'value': '/'
    }
}

that is, the multiplication was stored first. Moving to a recursive approach makes the parse_term function return this

{
    'type': 'binary',
    'left': {
        'type': 'integer',
        'value': 2
    },
    'right': {
        'type': 'binary',
        'left': {
            'type': 'integer',
            'value': 3
        },
        'right': {
            'type': 'integer',
            'value': 4
        },
        'operator': {
            'type': 'literal',
            'value': '/'
        }
    },
    'operator': {
        'type': 'literal',
        'value': '*'
    }
}

I think it is pretty clear that this structure, once visited, will return the same value as the previous one, as multiplication and division can be swapped. We will have to pay attention to this swap when operators with different priority are involved, like sum and multiplication, but for this test we can agree the result is no different.

This means that we may change the test and make it pass. Let me stress it once more: we have to understand why the test doesn't pass, and once we understood the reason, and decided it is acceptable, we can change the test.

Tests are not immutable, they are mere safeguards that raise alarms when you change the behaviour. It's up to you to deal with the alarm and to decide what to do.

Once the test has been modified and the test suite passes, it's time to uncomment the first of the two remaining tests, test_parse_exponentiation_with_other_operators. This test uses parse_term to parse a string that contains an exponentiation, but the new code of the method parse_term doesn't call the parse_exponentiation function. So the test fails.

Parsing exponentiation¶

That tests tries to parse a string that contains a multiplication and an exponentiation, so the method that we should use to process it is parse_term. The current version of parse_term, however, doesn't consider exponentiation, so the new code is

    def parse_term(self):
        left = self.parse_exponentiation()

        with self.lexer:
            operator = self._parse_literal(['*', '/'])
            right = self.parse_term()

            return BinaryNode(left, operator, right)

        return left

which makes all the active tests pass.

There is still one commented test, test_parse_exponentiation_with_parenthesis, that now passes with the new code.

Parsing expressions¶

The new version of parse_expression has the same issue we found with parse_term, that is the recursive approach changes the output.

    def parse_expression(self):
        left = self.parse_term()

        with self.lexer:
            operator = self._parse_literal(['+', '-'])
            right = self.parse_expression()

            left = BinaryNode(left, operator, right)

        return left

As before, we have to decide if the change is acceptable or if it is a real error. As happened for parse_term the test can be safely changes to match the new code output.

Level 16 - Float numbers¶

So far, our calculator can handle only integer values, so it's time to add support for float numbers. This change shouldn't be complex: floating point numbers are easy to parse as they are basically two integer numbers separated by a dot.

Lexer¶

To test if the lexer understands floating point numbers it's enough to add this to tests/test_calc_lexer.py

def test_get_tokens_understands_floats():
    l = clex.CalcLexer()

    l.load('3.6')

    assert l.get_tokens() == [
        token.Token(clex.FLOAT, '3.6'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

Parser¶

To support float numbers it's enough to add that feature the method that we use to parse integers. We can rename parse_integer to parse_number, fix the test test_parse_integer, and add test_parse_float

def test_parse_integer():
    p = cpar.CalcParser()
    p.lexer.load("5")

    node = p.parse_number()

    assert node.asdict() == {
        'type': 'integer',
        'value': 5
    }


def test_parse_float():
    p = cpar.CalcParser()
    p.lexer.load("5.8")

    node = p.parse_number()

    assert node.asdict() == {
        'type': 'float',
        'value': 5.8
    }

Visitor¶

The same thing that we did for the parser is valid for the visitor. We just need to copy the test for the integer numbers and adapt it

def test_visitor_float():
    ast = {
        'type': 'float',
        'value': 12.345
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (12.345, 'float')

This however leaves a bug in the visitor. The Test-Driven Development methodology can help you writing and changing your code, but cannot completely avoid bugs in the code. Actually, if you don't test cases, TDD can't do anything to find and remove bugs.

The bug I noticed after a while is that the visitor doesn't correctly manage an operation between integers and floats, returning an integer result. For example, if you sum 4 with 5.1 you should get 9.1 with type float. To test this behaviour we can add this code

def test_visitor_expression_sum_with_float():
    ast = {
        'type': 'binary',
        'left': {
            'type': 'float',
            'value': 5.1
        },
        'right': {
            'type': 'integer',
            'value': 4
        },
        'operator': {
            'type': 'literal',
            'value': '+'
        }
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (9.1, 'float')

Solution¶

The first thing the lexer need is a label to identify FLOAT tokens

FLOAT = 'FLOAT'

then the method _process_integer cna be extended to process float numbers as well. To do this the method is renamed to _process_number, the regular expression is modified, and the token_type is managed according to the presence of the dot.

    def _process_number(self):
        regexp = re.compile('[\d\.]+')

        match = regexp.match(
            self._text_storage.tail
        )

        if not match:
            return None

        token_string = match.group()

        token_type = FLOAT if '.' in token_string else INTEGER

        return self._set_current_token_and_skip(
            token.Token(token_type, token_string)
        )

Remember that the get_token function has to be modified to use the new name of the method. The new code is

    def get_token(self):
        eof = self._process_eof()
        if eof:
            return eof

        eol = self._process_eol()
        if eol:
            return eol

        self._process_whitespace()

        name = self._process_name()
        if name:
            return name

        integer = self._process_number()
        if integer:
            return integer

        literal = self._process_literal()
        if literal:
            return literal

First we need to add a new type of node

class FloatNode(ValueNode):
    node_type = 'float'

The new version of parse_integer, renamed parse_number, shall deal with both cases but also raise the TokenError exception if the parsing fails

    def parse_number(self):
        t = self.lexer.get_token()

        if t.type == clex.INTEGER:
            return IntegerNode(int(t.value))
        elif t.type == clex.FLOAT:
            return FloatNode(float(t.value))

        raise clex.TokenError

The change to support float nodes is trivial, we just need to include it alongside with the integer case

    def visit(self, node):
        if node['type'] in ['integer', 'float']:
            return node['value'], node['type']

To fix the missing type promotion when dealing with integers and floats it's enough to add

            if ltype == 'float':
                rtype = ltype

just before evaluating the operator in the binary nodes. The full code of the method visit is then

    def visit(self, node):
        if node['type'] in ['integer', 'float']:
            return node['value'], node['type']

        if node['type'] == 'variable':
            return self.valueof(node['value']), self.typeof(node['value'])

        if node['type'] == 'unary':
            operator = node['operator']['value']
            cvalue, ctype = self.visit(node['content'])

            if operator == '-':
                return - cvalue, ctype

            return cvalue, ctype

        if node['type'] == 'binary':
            lvalue, ltype = self.visit(node['left'])
            rvalue, rtype = self.visit(node['right'])

            operator = node['operator']['value']

            if ltype == 'float':
                rtype = ltype

            if operator == '+':
                return lvalue + rvalue, rtype
            elif operator == '-':
                return lvalue - rvalue, rtype
            elif operator == '*':
                return lvalue * rvalue, rtype
            elif operator == '/':
                return lvalue // rvalue, rtype

        if node['type'] == 'assignment':
            right_value, right_type = self.visit(node['value'])
            self.variables[node['variable']] = {
                'value': right_value,
                'type': right_type
            }

            return None, None

        if node['type'] == 'exponentiation':
            lvalue, ltype = self.visit(node['left'])
            rvalue, rtype = self.visit(node['right'])

            return lvalue ** rvalue, ltype

Final words¶

This post showed you how powerful the TDD methodology is when it comes to refactoring, or in general when the code has to be changed. Remember that tests can be changed if there are good reasons, and that the main point is to understand what's happening in your code and in the cases that you already tested.

The code I developed in this post is available on the GitHub repository tagged with part4 (link).

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

A game of tokens: solution - Part 3

2017-10-31T12:00:00+00:00

This post originally contained my solution to the challenge posted here. I moved those solutions inside the post itself, under the "Solution" subsections.

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

A game of tokens: write an interpreter in Python with TDD - Part 3

2017-10-31T11:00:00+00:00

Introduction¶

This is the third instalment of a series of posts on how to write an interpreter in Python. In the first part we developed together a small command line calculator that could sum and subtract numbers, while in the second part we went further adding multiplication, division and unary plus and minus.

In this third part we wil start adding variables to our calculator, moving towards a proper programming language.

Mezzanine - Refactoring¶

Often, after wroting some code, you realise that some of the original choices you did are not perfect, especially when it comes to variable and function names. Furthermore, you can realise that some of your functions are too long, and you may consider splitting them in mutiple functions to make the code easier to understand and to use.

It is time, then, to reconsider the code of smallcalc and see if we can improve it. Luckily, having all our tests in place, we may refactor it, that is we can change the code with a high degree of confidence, as the tests check that the behaviour of the whole system doesn't change.

The first change is the naming of the method parse_addsymbol, which now can be more aptly named _parse_symbol. As Martin Fowler says in his book "Refactoring" (a recommended reading): "Life being what it is, you won't get your names right the first time. [...] Remember your code is for a human first and a computer second. Humans need good names." The name of the method will be prefixed with an underscore because this method is used only internally, and shouldn't be used by third parties.

The proper way to change the name of a method involves calling the new method from the old one, but in this case we may safely rely on tests to tell us what needs to be fixed (this is because our codebase is small). We may thus open the file smallcalc/calc_parser.py and change the name to _parse_symbol. At this point, running the test suite, you should have 11 failures. You can fix them with a text replace action of your editor of choice, but I recommend you to make the tests fail before replacing the text. The 3 replacements are in the methods parse_factor, parse_term, and parse_expression.

I then wanted to add two methods, discard and discard_type, to the lexer, to better control what gets discarded. At the moment the code is using self.lexer.get_token which doesn't allow to explicitly check what we are dropping. These are the tests that I added to tests/test_calc_lexer.py

import pytest

def test_discard_tokens():
    l = clex.CalcLexer()
    l.load('3 + 5')

    l.discard(token.Token(clex.INTEGER, '3'))
    assert l.get_token() == token.Token(clex.LITERAL, '+')


def test_discard_checks_equality():
    l = clex.CalcLexer()
    l.load('3 + 5')

    with pytest.raises(clex.TokenError):
        l.discard(token.Token(clex.INTEGER, '5'))


def test_discard_tokens_by_type():
    l = clex.CalcLexer()
    l.load('3 + 5')

    l.discard_type(clex.INTEGER)
    assert l.get_token() == token.Token(clex.LITERAL, '+')


def test_discard_type_checks_equality():
    l = clex.CalcLexer()
    l.load('3 + 5')

    with pytest.raises(clex.TokenError):
        l.discard_type(clex.LITERAL)

As you can see the idea is for both methods to require a parameter, either the token or the type. The code that passes these tests is made by a custom exception in smallcalc/calc_lexer.py

class TokenError(ValueError):
    """ The expected token cannot be found """

and, in the CalcLexer class in the same file

    def discard(self, token):
        if self.get_token() != token:
            raise TokenError(
                'Expected token {}, found {}'.format(
                    token, self._current_token
                ))

    def discard_type(self, _type):
        t = self.get_token()

        if t.type != _type:
            raise TokenError(
                'Expected token of type {}, found {}'.format(
                    _type, self._current_token.type
                ))

As I am satisfied with the code that I have now, I will move on to add new features.

Level 13 - Variables¶

I have been assigned by my strength and cunning. - Up (2009)

Variables are labels assigned to values, so what we need to add is a way for the user to make this assignment and then to use variables intead of actual values. The simplest syntax, used by many languages is name = value and we will stick to this. Usually languages allow only a subset of symbols in the name of a variable so we will learn how to use lower- and uppercase names that may also contain an underscore.

Lexer¶

We want the lexer to recognise a new token called NAME, so the test we have to add to tests/test_calc_lexer.py is

def test_get_tokens_understands_letters():
    l = clex.CalcLexer()

    l.load('somevar')

    assert l.get_tokens() == [
        token.Token(clex.NAME, 'somevar'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

This test checks only the support for lowercase letters. Since we want to support also uppercase letters and underscores we need another pair of test

def test_get_tokens_understands_uppercase_letters():
    l = clex.CalcLexer()

    l.load('SomeVar')

    assert l.get_tokens() == [
        token.Token(clex.NAME, 'SomeVar'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]


def test_get_tokens_understands_names_with_underscores():
    l = clex.CalcLexer()

    l.load('some_var')

    assert l.get_tokens() == [
        token.Token(clex.NAME, 'some_var'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

If you wonder why I created two tests instead of just one with both uppercase letters and underscores, the reason is that I generally prefer to have tests that focus on one specific feature. This obviously depends on the level of granularity that you want, and in this case we are discussing very simple features, so I would not argue if I saw both tested at the same time.

Parser¶

To support variables in expressions we need to change the behaviour of parse_factor, which is the method where we parse the building blocks like integers of unary operators. The test you need to add to tests/test_calc_parser.py is

def test_parse_factor_variable():
    p = cpar.CalcParser()
    p.lexer.load("somevar")

    node = p.parse_factor()

    assert node.asdict() == {
        'type': 'variable',
        'value': 'somevar'
    }

After this we want to provide support for variable assignments. Working on the parser we need only to output the correct node so the test is pretty straightforward

def test_parse_assignment():
    p = cpar.CalcParser()
    p.lexer.load("x = 5")

    node = p.parse_assignment()

    assert node.asdict() == {
        'type': 'assignment',
        'variable': 'x',
        'value': {
            'type': 'integer',
            'value': 5
        }
    }

This test tries to assign the value 5 to the variable x, but in general we want to support assignment with expressions, so we should test this behaviour as well, including the presence of variables

def test_parse_assignment_with_expression():
    p = cpar.CalcParser()
    p.lexer.load("x = 4 * (3 + 5)")

    node = p.parse_assignment()

    assert node.asdict() == {
        'type': 'assignment',
        'variable': 'x',
        'value': {
            'type': 'binary',
            'operator': {
                'type': 'literal',
                'value': '*'
            },
            'left': {
                'type': 'integer',
                'value': 4
            },
            'right': {
                'type': 'binary',
                'operator': {
                    'type': 'literal',
                    'value': '+'
                },
                'left': {
                    'type': 'integer',
                    'value': 3
                },
                'right': {
                    'type': 'integer',
                    'value': 5
                }
            }
        }
    }


def test_parse_assignment_expression_with_variables():
    p = cpar.CalcParser()
    p.lexer.load("x = y + 4")

    node = p.parse_assignment()

    assert node.asdict() == {
        "type": "assignment",
        "variable": "x",
        'value': {
            'type': 'binary',
            'operator': {
                'type': 'literal',
                'value': '+'
            },
            'left': {
                'type': 'variable',
                'value': 'y'
            },
            'right': {
                'type': 'integer',
                'value': 4
            },
        }
    }

Visitor¶

It is now time to implement the code that actually stores and retrieves variables, which is what happens in the visitor when an assignment or a variable node are processed. For the moment we do not have specific requirements for variables and we can treat the storage space as a big global dictionary.

The test we want to pass specifies the initial API of the storage space when we assign a value to a variable

def test_visitor_assignment():
    ast = {
        'type': 'assignment',
        'variable': 'x',
        'value': {
            'type': 'integer',
            'value': 5
        }
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (None, None)
    assert v.isvariable('x') is True
    assert v.valueof('x') == 5
    assert v.typeof('x') == 'integer'

We want the visitor to provide three new methods, isvariable, valueof, and typeof, that allow us to interact with the variables we defined.

The last change that the visitor requires is some code that allows it to read the value of variables to be able to use them when computing the result of an expression. The test is then

def test_visitor_variable():
    assignment_ast = {
        'type': 'assignment',
        'variable': 'x',
        'value': {
            'type': 'integer',
            'value': 123
        }
    }

    read_ast = {
        'type': 'variable',
        'value': 'x'
    }

    v = cvis.CalcVisitor()
    v.visit(assignment_ast)
    assert v.visit(read_ast) == (123, 'integer')

where two different ASTs have been created. The first one assigns a value to the variable, the second one reads it and returns its value. Note that the visitor returns both value and type of the variable, which seems reasonable to implement later checks of equality or other operations on variables.

Solution¶

To pass the test_get_tokens_understands_letters test I added a method _process_name to the CalcLexer class

    def _process_name(self):
        regexp = re.compile('[a-z]+')

        match = regexp.match(
            self._text_storage.tail
        )

        if not match:
            return None

        token_string = match.group()

        return self._set_current_token_and_skip(
            token.Token(NAME, token_string)
        )

and then added it to the method get_token. The new version of the latter is then

    def get_token(self):
        eof = self._process_eof()
        if eof:
            return eof

        eol = self._process_eol()
        if eol:
            return eol

        self._process_whitespace()

        name = self._process_name()
        if name:
            return name

        integer = self._process_integer()
        if integer:
            return integer

        literal = self._process_literal()
        if literal:
            return literal

At this point to pass the remaining tests test_get_tokens_understands_uppercase_letters and test_get_tokens_understands_names_with_underscores it is sufficient to change the regular expression we use in _process_name

    def _process_name(self):
        regexp = re.compile('[a-zA-Z_]+')

        match = regexp.match(
            self._text_storage.tail
        )

        if not match:
            return None

        token_string = match.group()

        return self._set_current_token_and_skip(
            token.Token(NAME, token_string)
        )

The required change to parse_factor is simple, but since we will be returning a new type of node we have to define it

class VariableNode(ValueNode):

    node_type = 'variable'

We can then add the required if statement in parse_factor, which becomes

    def parse_factor(self):
        next_token = self.lexer.peek_token()

        if next_token.type == clex.LITERAL and next_token.value in ['-', '+']:
            operator = self._parse_symbol()
            factor = self.parse_factor()
            return UnaryNode(operator, factor)

        if next_token.type == clex.LITERAL and next_token.value == '(':
            self.lexer.discard_type(clex.LITERAL)
            expression = self.parse_expression()
            self.lexer.discard_type(clex.LITERAL)
            return expression

        if next_token.type == clex.NAME:
            t = self.lexer.get_token()
            return VariableNode(t.value)

        return self.parse_integer()

The second test that we have to pass checks if the parser can understand variable assignments. First of all we need to define AssignmentNode which is the node we will return to the visitor.

class AssignmentNode(Node):

    node_type = 'assignment'

    def __init__(self, variable, value):
        self.variable = variable
        self.value = value

    def asdict(self):
        return {
            'type': self.node_type,
            'variable': self.variable.value,
            'value': self.value.asdict(),
        }

At this point we need a method to parse a variable in CalcParser. This method is very similar to _parse_symbol and parse_integer

    def _parse_variable(self):
        t = self.lexer.get_token()
        return VariableNode(t.value)

Since the test is running parse_assignment we just need to add that method. We want the assignment to have a variable as its left member and an expression as its right member

    def parse_assignment(self):
        variable = self._parse_variable()
        self.lexer.discard_type(clex.LITERAL)
        value = self.parse_expression()

        return AssignmentNode(variable, value)

This code makes both the test_parse_assignment and the test_parse_assignment_with_expression tests pass.

As discussed in the introductory text before the test code the variable storage space can be a simple dictionary. The key will be the name of the variable, and the content will be another dictionary with value and type. This is sufficient for the moment and should be also extensible when future requirements will arise.

The CalcVisitor class can be then changed to get the new methods, and a __init__ that initializes the dictionary. I also added the relevant if statement to the method visit of the same class. The new CalcVisitor is then

class CalcVisitor:

    def __init__(self):
        self.variables = {}

    def isvariable(self, name):
        return name in self.variables

    def valueof(self, name):
        return self.variables[name]['value']

    def typeof(self, name):
        return self.variables[name]['type']

    def visit(self, node):
        if node['type'] == 'integer':
            return node['value'], node['type']

        if node['type'] == 'unary':
            operator = node['operator']['value']
            cvalue, ctype = self.visit(node['content'])

            if operator == '-':
                return - cvalue, ctype

            return cvalue, ctype

        if node['type'] == 'binary':
            lvalue, ltype = self.visit(node['left'])
            rvalue, rtype = self.visit(node['right'])

            operator = node['operator']['value']

            if operator == '+':
                return lvalue + rvalue, rtype
            elif operator == '-':
                return lvalue - rvalue, rtype
            elif operator == '*':
                return lvalue * rvalue, rtype
            elif operator == '/':
                return lvalue // rvalue, rtype

        if node['type'] == 'assignment':
            right_value, right_type = self.visit(node['value'])
            self.variables[node['variable']] = {
                'value': right_value,
                'type': right_type
            }

            return None, None

To pass the second test we need only to change the method visit adding an if statement for the variable nodes. The new version of the method is

    def visit(self, node):
        if node['type'] == 'integer':
            return node['value'], node['type']

        if node['type'] == 'variable':
            return self.valueof(node['value']), self.typeof(node['value'])

        if node['type'] == 'unary':
            operator = node['operator']['value']
            cvalue, ctype = self.visit(node['content'])

            if operator == '-':
                return - cvalue, ctype

            return cvalue, ctype

        if node['type'] == 'binary':
            lvalue, ltype = self.visit(node['left'])
            rvalue, rtype = self.visit(node['right'])

            operator = node['operator']['value']

            if operator == '+':
                return lvalue + rvalue, rtype
            elif operator == '-':
                return lvalue - rvalue, rtype
            elif operator == '*':
                return lvalue * rvalue, rtype
            elif operator == '/':
                return lvalue // rvalue, rtype

        if node['type'] == 'assignment':
            right_value, right_type = self.visit(node['value'])
            self.variables[node['variable']] = {
                'value': right_value,
                'type': right_type
            }

            return None, None

Level 14 - Parsing expressions and assignments¶

Speak words we can all understand! - The Lord of the Rings: The Fellowship of the Ring (2001)

We are missing a final step. The CLI uses parse_expression as its default entry point, which means that it doesn't understand variable assignments for the time being. We need then to introduce a new entry point parse_line that we will use to process general language statements. The test for this goes in tests/test_calc_parser.py

def test_parse_line_supports_expression():
    p = cpar.CalcParser()
    p.lexer.load("2 * x + 4")

    node = p.parse_line()

    assert node.asdict() == {
        'type': 'binary',
        'left': {
            'type': 'binary',
            'left': {
                'type': 'integer',
                'value': 2
            },
            'right': {
                'type': 'variable',
                'value': 'x'
            },
            'operator': {
                'type': 'literal',
                'value': '*'
            }
        },
        'right': {
            'type': 'integer',
            'value': 4
        },
        'operator': {
            'type': 'literal',
            'value': '+'
        }
    }

and checks that parse_line can parse expressions (which can be solved just wrapping parse_expression with it). The second test checks that parse_line can parse variable assignments and goes in the same file

def test_parse_line_supports_assigment():
    p = cpar.CalcParser()
    p.lexer.load("x = 5")

    node = p.parse_line()

    assert node.asdict() == {
        'type': 'assignment',
        'variable': 'x',
        'value': {
            'type': 'integer',
            'value': 5
        }
    }

At this point we can change the entry point in the CLI, using parse_line instead of parse_expression. The new CLI is then

from smallcalc import calc_parser as cpar
from smallcalc import calc_visitor as cvis


def main():
    p = cpar.CalcParser()
    v = cvis.CalcVisitor()

    while True:
        try:
            text = input('smallcalc :> ')
            p.lexer.load(text)

            node = p.parse_line()
            res = v.visit(node.asdict())

            print(res)

        except EOFError:
            print("Bye!")
            break

        if not text:
            continue


if __name__ == '__main__':
    main()

Try to fire the CLI and enjoy a calculator with variables! Everything works, but you now know it is not magic, but the outcome of a good amount of code. And you wrote it, so you may proudly say that you created a simple but working programming language.

Solution¶

To pass the first test, as suggested, I added the method parse_line as a wrapper around parse_expression

    def parse_line(self):
        return self.parse_expression()

The second test requires some changes to parse_line. As I do not know if the next token is an expression or an assignment I decided to stash the status and try one of the two. In case of error I just pop the state and try with the second option

    def parse_line(self):
        try:
            self.lexer.stash()
            return self.parse_assignment()
        except clex.TokenError:
            self.lexer.pop()
            return self.parse_expression()

At the same time parse_assignment has to be changed. The current code parses a variable and then discards a literal, which is too generic, as an expression like x * 2 will not raise an error. The new code for that method is then

    def parse_assignment(self):
        variable = self._parse_variable()
        self.lexer.discard(token.Token(clex.LITERAL, '='))
        value = self.parse_expression()

        return AssignmentNode(variable, value)

where I explicitly discard a literal = sign.

Final words¶

Managing variables may look like a very easy task, but as soon as we will start implementing functions and local scopes we will have to move to something richer than a simple global dictionary. Memory management is another big topic that I didn't touch here, perhaps in the future I might discuss garbage collections and related problems.

The code I developed in this post is available on the GitHub repository tagged with part3 (link).

In the next issue I will face with you the task of adding the power operator, support for floating point numbers, and a big refactoring with context managers that will greatly simplify the code.

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

A game of tokens: solution - Part 2

2017-10-17T13:00:00+01:00

This post originally contained my solution to the challenge posted here. I moved those solutions inside the post itself, under the "Solution" subsections.

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

A game of tokens: write an interpreter in Python with TDD - Part 2

2017-10-01T15:00:00+01:00

Introduction¶

Welcome to the second part of the series of posts about writing an interpreter with Python and TDD. In the first post we developed together a simple calculator that can handle integers, addition and subtraction. In this instalment I'll give you new tests that will guide you through the implementation of multiplication, division, parentheses, and unary operators. I will obviously reference the structure I used in my solution, but you mileage may vary, so feel free to ignore the comments or the suggested solutions in case your code is different.

Level 8 - Multiplication and division¶

"They're coming outta the walls. They're coming outta the goddamn walls." - Aliens (1986)

As you remember from the previous post our interpreter is made of three different components, the lexer, the parser, and the visitor. So, to implement the missing basic operations, multiplication and division, we need to start with the lexer and ensure that it understands the traditional symbols * and /

Lexer¶

Put the following tests in the tests/test_calc_lexer.py file

def test_get_tokens_understands_multiplication():
    l = clex.CalcLexer()

    l.load('3 * 5')

    assert l.get_tokens() == [
        token.Token(clex.INTEGER, '3'),
        token.Token(clex.LITERAL, '*'),
        token.Token(clex.INTEGER, '5'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]


def test_get_tokens_understands_division():
    l = clex.CalcLexer()

    l.load('3 / 5')

    assert l.get_tokens() == [
        token.Token(clex.INTEGER, '3'),
        token.Token(clex.LITERAL, '/'),
        token.Token(clex.INTEGER, '5'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

Do the tests fail? Why? Please remember that when tests pass without requiring any code change you have to ask yourself "Why do they pass?", and be sure that you understood the answer before going further. Otherwise you might be adding tests that are wrong, or tests for things that have already been tested, and in either case you should act on them.

Parser¶

Now that the lexer understands the symbols we can start considering the parser. The parser has to output a sensible structure that represents the new operations, which is not different from what it outputs for the sum and the difference. Add the following tests to tests/test_calc_parser.py

def test_parse_term():
    p = cpar.CalcParser()
    p.lexer.load("2 * 3")

    node = p.parse_term()

    assert node.asdict() == {
        'type': 'binary',
        'left': {
            'type': 'integer',
            'value': 2
        },
        'right': {
            'type': 'integer',
            'value': 3
        },
        'operator': {
            'type': 'literal',
            'value': '*'
        }
    }


def test_parse_term_with_multiple_operations():
    p = cpar.CalcParser()
    p.lexer.load("2 * 3 / 4")

    node = p.parse_term()

    assert node.asdict() == {
        'type': 'binary',
        'left': {
            'type': 'binary',
            'left': {
                'type': 'integer',
                'value': 2
            },
            'right': {
                'type': 'integer',
                'value': 3
            },
            'operator': {
                'type': 'literal',
                'value': '*'
            }
        },
        'right': {
            'type': 'integer',
            'value': 4
        },
        'operator': {
            'type': 'literal',
            'value': '/'
        }
    }

This time you should have some failures, so go and edit the CalcParser class in order to pass the tests. As the two new binary operations are at this level the same as sum and difference you could change the method parse_expression (try it!). This will however make things harder later when we will prioritise operations (multiplications have to be performed before sums), so my advice is to introduce a method parse_term in the parser, which is the method used in the tests.

Visitor¶

Now it's the visitor's turn, where the syntax tree gets analysed and actually executed. Add the following tests to the tests/test_calc_visitor.py file and then make them pass changing the CalcVisitor class accordingly.

def test_visitor_term_multiplication():
    ast = {
        'type': 'binary',
        'left': {
            'type': 'integer',
            'value': 5
        },
        'right': {
            'type': 'integer',
            'value': 4
        },
        'operator': {
            'type': 'literal',
            'value': '*'
        }
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (20, 'integer')


def test_visitor_term_division():
    ast = {
        'type': 'binary',
        'left': {
                'type': 'integer',
                'value': 11
        },
        'right': {
            'type': 'integer',
            'value': 4
        },
        'operator': {
            'type': 'literal',
            'value': '/'
        }
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (2, 'integer')

Solution¶

The tests we added for the lexer already pass. This is not surprising, as the lexer is designed to return everything it doesn't know as a LITERAL (smallcalc/calc_lexer.py:119). As we already instructed the lexer to skip spaces the new operators are happily digested. As I discussed in the previous post, I decided for this project not to assign operators a specific token, so from this point of view our lexer is pretty open and could already understand instructions like 3 $ 5 or 7 : 9, even though they do not have any meaning in our new language (yet, maybe).

The parser is not so merciful, and the two new tests do not pass. We are explicitly calling a method parse_term that is not defined, so a success would have been very worrying. In these two tests parse_term is called explicitly and there is no relationship with the other methods named parse_*, so we can implement it as a stand-alone processing.

We know that a term is an operation between two integers, so we can follow what we did with parse_expression. The first thing we do is to parse the first integer, then we peek the next token and we decide what to do. If the token is a LITERAL we suppose it is the operation symbol, otherwise we probably hit the end of the file and we will just return the previously read integer. The second element may be a simple integer or another multiplication or division, so we recursively call parse_term and return a BinaryNode with the result.

[Note: I noticed that the parse_addsymbol could be now named parse_literal but this wasn't done when I prepared the source code. Regardless of the name, however, what this method does is to just pack a literal in a LiteralNode and return it.]

The whole parser is now the following

class CalcParser:

    def __init__(self):
        self.lexer = clex.CalcLexer()

    def parse_addsymbol(self):
        t = self.lexer.get_token()
        return LiteralNode(t.value)

    def parse_integer(self):
        t = self.lexer.get_token()
        return IntegerNode(t.value)

    def parse_term(self):
        left = self.parse_integer()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL:
            operator = self.parse_addsymbol()
            right = self.parse_integer()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

    def parse_expression(self):
        left = self.parse_integer()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL:
            operator = self.parse_addsymbol()
            right = self.parse_integer()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

The visitor was instructed only to deal with sums and subtractions, and it treats everything is not the former as the latter. This is why the new tests give as results 1 and 7. We just need to extend the if statement to include the new operations

class CalcVisitor:

    def visit(self, node):
        if node['type'] == 'integer':
            return node['value'], node['type']

        if node['type'] == 'binary':
            lvalue, ltype = self.visit(node['left'])
            rvalue, rtype = self.visit(node['right'])

            operator = node['operator']['value']

            if operator == '+':
                return lvalue + rvalue, rtype
            elif operator == '-':
                return lvalue - rvalue, rtype
            elif operator == '*':
                return lvalue * rvalue, rtype
            elif operator == '/':
                return lvalue // rvalue, rtype

Now we have a pretty simple but fully working calculator! Enjoy the cli.py, as YOU did it this time! I remember I was pretty excited the first time I run a command line calculator done by me. But hold tight, because you are going to learn and implement much more!

Level 9 - Mixing operators¶

"Don't cross the streams." - Ghostbusters (1984)

Ok, it's time to do some serious math. What happens if you mix sums and multiplications? Let's try it and see how our interpreter reacts. We already know that the lexer happily digests all the four symbols so we can head straight to the parser and add the following test to tests/test_calc_parser.py

def test_parse_expression_with_term():
    p = cpar.CalcParser()
    p.lexer.load("2 + 3 * 4")

    node = p.parse_expression()

    assert node.asdict() == {
        'type': 'binary',
        'left': {
            'type': 'integer',
            'value': 2
        },
        'right': {
            'type': 'binary',
            'left': {
                'type': 'integer',
                'value': 3
            },
            'right': {
                'type': 'integer',
                'value': 4
            },
            'operator': {
                'type': 'literal',
                'value': '*'
            }
        },
        'operator': {
            'type': 'literal',
            'value': '+'
        }
    }

Chances are that this will fail miserably. Probably you have to rework a bit parse_expression as it is ignoring the new entry, parse_term. Please note that 2 * 3 + 4 must give 10 according to the standard math rules, and not 14. This happens because multiplication is performed before sum, and the order depends uniquely on the structure created by the parser, and not by the visitor (which is at this point a pretty dumb component).

Once the parser outputs the correct structure the visitor shouldn't have issues, as it is already behaving in a recursive way. If you want to check feel free to add relevant tests, however.

Solution¶

Ouch! It looks like putting multiplications and sums in the same line is not really working. As you may recall we didn't link parse_term with the other methods, and we use a generic function to treat literals. This works in principle, but doesn't consider operator precedence.

When we try to evaluate 2 + 3 * 4 the output of the parser is

{
  "type": "binary",
  "left": {
    "type": "binary",
    "left": {
      "type": "integer",
      "value": 2
    },
    "right": {
      "type": "integer",
      "value": 3
    },
    "operator": {
      "type": "literal",
      "value": "+"
    }
  },
  "right": {
    "type": "integer",
    "value": 4
  },
  "operator": {
    "type": "literal",
    "value": "*"
  }
}

As you can clearly see the parser recognised the multiplication operator, but then returns a nested sum (the oputput of a recursive call of parse_term). This gives the sum a greater precedence that that of the sum, which is against the mathematical rules we want to follow here. 2 + 3 * 4 shall be considered 2 + (3 * 4) and not (2 + 3) * 4.

To fix this we have to rework parse_term. First of all it shall accept only the * and / operators, then it shall return the left part if it finds a different literal. Even parse_expression shall change a bit: the first thing to do is to call parse_term instead of parse_integer and then to return the left part.

The new code is then

    def parse_term(self):
        left = self.parse_integer()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL\
                and next_token.value in ['*', '/']:
            operator = self.parse_addsymbol()
            right = self.parse_integer()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

    def parse_expression(self):
        left = self.parse_term()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL:
            operator = self.parse_addsymbol()
            right = self.parse_term()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

Let's see what happens parsing 2 * 3 + 4. The test calls parse_expression which tries immediately to run parse_term. The latter recognises 2 and *, so it calls itself recursively just before the 3 and returns the binary node. This means that the multiplication is the first operation we return, the one with higher precedence. The recursive call recognises 3 but then doesn't know what to do with + as we specifically consider only * and /, so it just returns the integer value. Back to parse_expression, then the variable left will contain the binary node that represents 2 * 3. The function will then finish adding the binary node for the sum.

Take your time to understand the mechanism, perhaps trying with different operations like 2 + 4 * 6 - 8, which should return 18.

Level 10 - Parentheses¶

"When nine hundred years old you reach, look as good you will not." - Return of the Jedi (1983)

Parentheses, are curved brackets used in mathematics to change the order of operations. As this part is pretty important I will spend some time on it, because the order of operations will be of concerns also when it comes to language operators, and not only when dealing with mathematical operations. As explained in the previous section almost everything at this point happens in the parser, as the resulting structure that we will give to the visitor is the one that rules the precedence of operations.

Let's start to check that the lexer understands the parentheses symbols ( and ).

def test_get_tokens_understands_parentheses():
    l = clex.CalcLexer()

    l.load('3 * ( 5 + 7 )')

    assert l.get_tokens() == [
        token.Token(clex.INTEGER, '3'),
        token.Token(clex.LITERAL, '*'),
        token.Token(clex.LITERAL, '('),
        token.Token(clex.INTEGER, '5'),
        token.Token(clex.LITERAL, '+'),
        token.Token(clex.INTEGER, '7'),
        token.Token(clex.LITERAL, ')'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

As our lexer is pretty open-minded it shouldn't raise any objections and happily pass the test (why?).

As always, instead, its neighbour the parser is not that forgiving, and I bet it will make a fuss. Let's try and feed it with some simple expression with parentheses

def test_parse_expression_with_parentheses():
    p = cpar.CalcParser()
    p.lexer.load("(2 + 3)")

    node = p.parse_expression()

    assert node.asdict() == {
        'type': 'binary',
        'left': {
            'type': 'integer',
            'value': 2
        },
        'right': {
            'type': 'integer',
            'value': 3
        },
        'operator': {
            'type': 'literal',
            'value': '+'
        }
    }

To make this test pass my suggestion is to introduce a method parse_factor, where the term factor encompasses both integers and the expressions between parentheses. In the latter case, obviously, you will need to call parse_expression, which somehow breaks the hierarchical structure of methods in the parser.

Solution¶

Let's have some Lisp time here and introduce parentheses. As happened for the new mathematical operators, parentheses are already accepted by the lexer as simple literals, so the first test passes without any change in the code. The parser complains, however, as it always expects an integer (smallcalc/calc_parser.py:76).

As I suggested, my idea is to introduce a method that parses a so-called factor, which can either be an integer of an expression between parentheses.

class CalcParser:

    def __init__(self):
        self.lexer = clex.CalcLexer()

    def parse_addsymbol(self):
        t = self.lexer.get_token()
        return LiteralNode(t.value)

    def parse_integer(self):
        t = self.lexer.get_token()
        return IntegerNode(t.value)

    def parse_factor(self):
        next_token = self.lexer.peek_token()

        if next_token.type == clex.LITERAL and next_token.value == '(':
            self.lexer.get_token()
            expression = self.parse_expression()
            self.lexer.get_token()
            return expression

        return self.parse_integer()

The method parse_term now has to call parse_factor

    def parse_term(self):
        left = self.parse_factor()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL\
                and next_token.value in ['*', '/']:
            operator = self.parse_addsymbol()
            right = self.parse_integer()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

And last we need to slightly change parse_expression introducing a check on the literal token value. This happens because I decided to identify everything with a literal, so the method has to rule out every literal it is not interested to manage. If you introduce specific tokens for operations, parentheses, etc., this change is not required (but you won't use clex.LITERAL at that point).

    def parse_expression(self):
        left = self.parse_term()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL\
                and next_token.value in ['+', '-']:
            operator = self.parse_addsymbol()
            right = self.parse_term()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

Level 11 - Priorities¶

"You got issues, Quill." - Guardians of the Galaxy (2014)

As parentheses have been introduced to change the default priority rules between operators we need to be sure that this happens. We can test it easily with this code

def test_parse_parentheses_change_priority():
    p = cpar.CalcParser()
    p.lexer.load("(2 + 3) * 4")

    node = p.parse_expression()

    assert node.asdict() == {
        'type': 'binary',
        'left': {
            'type': 'binary',
            'left': {
                'type': 'integer',
                'value': 2
            },
            'right': {
                'type': 'integer',
                'value': 3
            },
            'operator': {
                'type': 'literal',
                'value': '+'
            }
        },
        'right': {
            'type': 'integer',
            'value': 4
        },
        'operator': {
            'type': 'literal',
            'value': '*'
        }
    }

Now when your parser passes this test you have a full-fledged calculator that supports parentheses. Make sure to test the new features in the CLI. Does multiple parentheses work? Why?

Solution¶

Another feature that comes for free with the previous changes, as the first thing that parse_expression does is to run parse_term, and the first thing the latter does is to run parse_factor, which in turn manages expressions between parentheses. If the expression is enclosed between parentheses the method parse_factor doesn't call parse_expression and just returns the integer.

Level 12 - Unary operators¶

"There can be only one!" - Highlander (1986)

Now it's time to introduce unary operators, which are very important in programming languages. Just think at not x and you will immediately understand why you need them. Unary operators do not fit in the current structure of our interpreter as the parser is always expecting either an integer or an open parenthesis as the first token.

Let's first write a test for the most simple unary operator, which is a minus (as in -2). Remember that we are testing the parser here, as the lexer is already able to parse the minus sign.

def test_parse_factor_supports_unary_operator():
    p = cpar.CalcParser()
    p.lexer.load("-5")

    node = p.parse_factor()

    assert node.asdict() == {
        'type': 'unary',
        'operator': {
            'type': 'literal',
            'value': '-'
        },
        'content': {
            'type': 'integer',
            'value': 5
        }
    }

When your parser passes this test we have to make sure that the unary minus can be applied also to expressions between parentheses

def test_parse_factor_supports_negative_expressions():
    p = cpar.CalcParser()
    p.lexer.load("-(2 + 3)")

    node = p.parse_factor()

    assert node.asdict() == {
        'type': 'unary',
        'operator': {
            'type': 'literal',
            'value': '-'
        },
        'content': {
            'type': 'binary',
            'left': {
                'type': 'integer',
                'value': 2
            },
            'right': {
                'type': 'integer',
                'value': 3
            },
            'operator': {
                'type': 'literal',
                'value': '+'
            }
        }
    }

Once the parser is able to pass these two tests we are confident that the unary minus can be used in front of all the basic elements of our expressions. At this point it is time to execute the unary expressions produced by the parsing layer, so include this test for the visitor

def test_visitor_unary_minus():
    ast = {
        'type': 'unary',
        'operator': {
            'type': 'literal',
            'value': '-'
        },
        'content': {
            'type': 'binary',
            'left': {
                'type': 'integer',
                'value': 2
            },
            'right': {
                'type': 'integer',
                'value': 3
            },
            'operator': {
                'type': 'literal',
                'value': '+'
            }
        }
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (-5, 'integer')

Change the visitor to pass this test and you can go straight to the CLI and start using negative numbers or negative expressions. Can you execute something like --2 (minus minus 2)? What is the result? Why?

Now let's go back to the parser and ensure that the unary plus can be used as well. This is the test

def test_parse_factor_supports_unary_plus():
    p = cpar.CalcParser()
    p.lexer.load("+(2 + 3)")

    node = p.parse_factor()

    assert node.asdict() == {
        'type': 'unary',
        'operator': {
            'type': 'literal',
            'value': '+'
        },
        'content': {
            'type': 'binary',
            'left': {
                'type': 'integer',
                'value': 2
            },
            'right': {
                'type': 'integer',
                'value': 3
            },
            'operator': {
                'type': 'literal',
                'value': '+'
            }
        }
    }

and the code should be trivial, as you already manage the unary minus. The relative test for the visitor is

def test_visitor_unary_plus():
    ast = {
        'type': 'unary',
        'operator': {
            'type': 'literal',
            'value': '+'
        },
        'content': {
            'type': 'binary',
            'left': {
                'type': 'integer',
                'value': 2
            },
            'right': {
                'type': 'integer',
                'value': 3
            },
            'operator': {
                'type': 'literal',
                'value': '+'
            }
        }
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (5, 'integer')

Once your code passes all the tests head to the CLI and try to run something like -+--++-3. Does it work?

Solution¶

The minus unary operator uses a literal that we already manage in the lexer, so there is nothing to do there. The first test I gave you checks if the parser can process a factor in the form -5.

The current implementation of parse_factor processes either an expression enclosed between parentheses or an integer, and actually the test doesn't pass, complaining against the minus sign not being a valid integer with base 10. The solution is pretty straightforward, as it is enough to add another if that manages the minus sign. When we encounter such a sign, however, we have to return a different type of node, as the test states, so we also have to introduce the relative class.

class UnaryNode(Node):

    node_type = 'unary'

    def __init__(self, operator, content):
        self.operator = operator
        self.content = content

    def asdict(self):
        result = {
            'type': self.node_type,
            'operator': self.operator.asdict(),
            'content': self.content.asdict()
        }

        return result


class CalcParser:

    def __init__(self):
        self.lexer = clex.CalcLexer()

    def parse_addsymbol(self):
        t = self.lexer.get_token()
        return LiteralNode(t.value)

    def parse_integer(self):
        t = self.lexer.get_token()
        return IntegerNode(t.value)

    def parse_factor(self):
        next_token = self.lexer.peek_token()

        if next_token.type == clex.LITERAL and next_token.value == '-':
            operator = self.parse_addsymbol()
            factor = self.parse_factor()
            return UnaryNode(operator, factor)

        if next_token.type == clex.LITERAL and next_token.value == '(':
            self.lexer.get_token()
            expression = self.parse_expression()
            self.lexer.get_token()
            return expression

        return self.parse_integer()

The second test passes automatically because parse_factor intercepts the - literal before the ( one.

The visitor has to be updated with the new type of unary node. The new visitor is then

class CalcVisitor:

    def visit(self, node):
        if node['type'] == 'integer':
            return node['value'], node['type']

        if node['type'] == 'unary':
            operator = node['operator']['value']
            cvalue, ctype = self.visit(node['content'])

            if operator == '-':
                return - cvalue, ctype

        if node['type'] == 'binary':
            lvalue, ltype = self.visit(node['left'])
            rvalue, rtype = self.visit(node['right'])

            operator = node['operator']['value']

            if operator == '+':
                return lvalue + rvalue, rtype
            elif operator == '-':
                return lvalue - rvalue, rtype
            elif operator == '*':
                return lvalue * rvalue, rtype
            elif operator == '/':
                return lvalue // rvalue, rtype

Now the unary plus is easy to sort out, as we just need to take it into account in parse_factor along with the unary minus.

    def parse_factor(self):
        next_token = self.lexer.peek_token()

        if next_token.type == clex.LITERAL and next_token.value in ['-', '+']:
            operator = self.parse_addsymbol()
            factor = self.parse_factor()
            return UnaryNode(operator, factor)

        if next_token.type == clex.LITERAL and next_token.value == '(':
            self.lexer.get_token()
            expression = self.parse_expression()
            self.lexer.get_token()
            return expression

        return self.parse_integer()

And the visitor is missing a single return after the if statement that deals with the unary minus.

        if node['type'] == 'unary':
            operator = node['operator']['value']
            cvalue, ctype = self.visit(node['content'])

            if operator == '-':
                return - cvalue, ctype

            return cvalue, ctype

Final words¶

That's all for this post. If you feel brave or do not like to wait for the next post go and try adding new operators! Next time I will cover variables, assignments and postfix-operators like the power operation (2^3).

The code I developed in this post is available on the GitHub repository tagged with part2 (link).

Updates¶

2017-12-24: test_parse_term_with_multiple_operations has been changed after Victor Uriarte spotted an error in the tree construction. See the updates section of the first post in the series for a full explanation of the issue.

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

A game of tokens: solution - Part 1

2017-07-12T10:00:00+01:00

This post originally contained my solution to the challenge posted here. I moved those solutions inside the post itself, under the "Solution" subsections.

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

A game of tokens: write an interpreter in Python with TDD - Part 1

2017-05-09T23:00:00+01:00

Introduction¶

Writing an interpreter or a compiler is usually considered one of the greatest goals that a programmer can achieve, and with good reason. I do not believe the importance of going through this experience is primarily due to its difficulty, though. After all, writing an efficient compiler is difficult, but the same is true for a good web framework, or a feature-rich editor.

Being able to write an interpreter is a significant skill mainly because of its recursive (or self-referring) nature. Think about it: you use a language to write a new language. And this new language, if it becomes sufficiently rich, can eventually be used to create its own compiler.

A language can be used to write the program that executes that same language.

Didn't this last sentence fire you with enthusiasm? It makes me eager to start!

Compilers have been the subject of academic research since the 50s, with the works of Hopper and Glennie, so trying to provide an overview in a few lines is basically impossible. I highly recommend you to check the online resources listed at the bottom of the post if you are seriously interested in the matter.

In this series of posts I want to try an experiment. I want to guide you through the creation of a simple interpreter in Python using a pure TDD (Test-Driven Development) approach. The posts will be structured like a game, where every level is represented by a new test that I will add to the suite. If you are not confident with TDD, you will find more on it in the specific section.

Following this series you will learn about Python, compilers, interpreters, parsers, lexers, test-driven development, refactoring, coverage, regular expressions, classes, context managers. Wow, that's a lot!

Are you ready to start?

On the TDD game¶

This series of posts will introduce you to TDD with a sort of game. I'll give you the test, and you are supposed to write something that passes that test, finishing the level. Update: I decided to move solutions into the same post where the challenge is given, you will find them in specific sections named "Solution" after each level.

My best advice for the TDD game is: remember that the easiest solution for a test that requires the output A is to write a function that returns exactly A.

Beautiful is better than ugly, but ugly and tested is better than beautiful and untested.

About the language¶

At the time of writing the language we are going to implement is a simple calculator with support for integer and floats, binary operators (addition, subtraction, multiplication, division, and power), unary operators (negation), nested expressions (parentheses) and variables.

The name smallcalc is a homage to one of the most innovative and influential languages ever conceived: Smalltalk.

I do not know if the final version will be something richer, it depends on how much fun you will find in the series. So, if you are interested, just ask! You can drop a line of appreciation on Twitter.

At the time of writing, then, the language grammar is

factor : ('+' | '-') factor | '(' expression ')' | variable | number
power : factor [ '^' power ]*
term : power [ ('*' | '/') term ]*
expression : term [ ('*' | '/') expression ]*
assignment : variable '=' expression
line : assignment | expression

The syntax of the grammar is pretty self-explanatory if you have some programming background. If you want to know more about grammars like the one above start from the links in the resources section.

TDD and refactoring¶

If you already know what TDD is feel free to skip this section.

TDD means Test-Driven Development, and in short it is a programming methodology that requires you to write a test for a feature before implementing the feature itself. Much has been said on the benefits of TDD elsewhere. I personally think it is one of the most effective ways to work on a programming task, and something that every programmer should know. I wrote a post on TDD with Python that you can find here.

A test, in TDD, is code that uses the code you are going to develop. You will start with a project skeleton and add the tests I will present in the posts one at a time. Once you add the test, you have to write the code that passes the test. Your code doesn't need to be beautiful or smart, it just needs to pass the test. Then you can move to the following test and start the cycle again.

After adding some tests you can start considering refactoring, which means changing the existing code in order to make it more beautiful, simpler or better organised. Every change has to be tested against the existing battery of tests. If the tests do not fail your change is correct, at least in terms of the behaviour that the tests are checking.

Coverage is the check of how much of your code is covered by your tests. We call some code "covered" by a test if executing the test makes that code run. So, for example, if you have a test (an if block) you should write two tests. One to cover the first option, and another to cover the second one. If you work with a strict TDD methodology your coverage is going to be always 100%, because you wrote just the code that makes the tests pass.

You can find more on TDD on this blog here.

About the project¶

The main components of our interpreter are the following:

Token: a token is the minimal element of the language syntax, like an integer (not a digit, but a group of them), a name (not a letter but a group of them), or a symbol (like the mathematical operations).
Buffer: the input text (the program) has to be managed by a specific component. Parsing the input text has many requirements, among them being able to read upcoming parts of the text and to move back, or to move to specific locations.
Lexer: this is the first component of standard interpreters. Its job is to divide the stream of input characters into meaningful chunks called tokens. It will process a string like "123 + x" and output three tokens: an integer, a symbol and a variable name.
Parser: the second component of standard interpreters. It analyses the stream of tokens produced by the lexer and produces a data structure that represents the whole program.
Visitor: the output of the parser is processed by a component that will either write the equivalent in another language or execute it.
Command Line Interface (CLI): the whole stack can be directly used by a REPL (Read, Evaluate, Print Loop), a command line interface similar to the one Python provides. There each line is lexed, parsed, and visited, and the result is printed immediately.

I will provide two classes: Token and TextBuffer. These will avoid you spending too much time to create the basic tools, and allow you to get straight into the game. Since those classes come obviously with their own test suite you are free to develop them on your own. You should however start from the same tests that I used, otherwise your interface might end up being incompatible witht he rest of the project.

Initial setup¶

I prepared this repository, which contains everything you need to start the project.

$ git clone https://github.com/lgiordani/smallcalc.git

Once you cloned the repository, set up a Python virtual environment using your favourite method/tool and install the testing requirements

pip install -r requirements/test.txt

At this point you should be able to run the test suite. For this project we are going to use pytest, so the command line is

pytest -svv

or, if you want to check your code coverage,

pytest -svv  --cov-report term-missing --cov=smallcalc

Tokens¶

The first class that I provide to start working on our interpreter is Token.

class Token:

    def __init__(self, _type, value=None, position=None):
        self.type = _type
        self.value = str(value) if value is not None else None
        self.position = position

    def __str__(self):
        if not self.position:
            return "Token({}, '{}')".format(
                self.type,
                self.value,
            )

        return "Token({}, '{}', line={}, col={})".format(
            self.type,
            self.value,
            self.position[0],
            self.position[1]
        )

    __repr__ = __str__

    def __eq__(self, other):
        return (self.type, self.value) == (other.type, other.value)

    def __len__(self):
        if self.value:
            return len(self.value)

        return 0

    def __bool__(self):
        return True

This represents one syntax unit in which we divide the input text. The token can contain information about its original position, which can be useful in case of syntax errors to print meaningful messages for the user. The class implements the method __eq__ to provide comparison between tokens.

The value of a token is always a string, and shall be converted into a different type by an external object according to the value that the token assumes. For example the string '123' can be interpreted as an integer, but could also be the name of a variable if our language supports such a feature.

Remember that everything you find in this class has been introduced to make one or more tests pass, so check the test suite to understand how the object can be used.

Buffer¶

The second element that you will find in the initial setup is the class TextBuffer, that provides a very basic manager for an input text file

class EOLError(ValueError):

    """ Signals that the buffer is reading after the end of a line."""


class EOFError(ValueError):

    """ Signals that the buffer is reading after the end of the text."""


class TextBuffer:

    def __init__(self, text=None):
        self.load(text)

    def reset(self):
        self.line = 0
        self.column = 0

    def load(self, text):
        self.text = text
        self.lines = text.split('\n') if text else []
        self.reset()

    @property
    def current_line(self):
        try:
            return self.lines[self.line]
        except IndexError:
            raise EOFError(
                "EOF reading line {}".format(self.line)
            )

    @property
    def current_char(self):
        try:
            return self.current_line[self.column]
        except IndexError:
            raise EOLError(
                "EOL reading column {} at line {}".format(
                    self.column, self.line
                )
            )

    @property
    def next_char(self):
        try:
            return self.current_line[self.column + 1]
        except IndexError:
            raise EOLError(
                "EOL reading column {} at line {}".format(
                    self.column, self.line
                )
            )

    @property
    def tail(self):
        return self.current_line[self.column:]

    @property
    def position(self):
        return (self.line, self.column)

    def newline(self):
        self.line += 1
        self.column = 0

    def skip(self, steps=1):
        self.column += steps

    def goto(self, line, column=0):
        self.line, self.column = line, column

As happened for the Token class, you can read the tests to understand how to use the class. Basically, however, the class can load an input text and extract the current_line, the current_char, and the next_char. You can also skip a given number of characters, goto a given position, extract the current position and read the tail, which is the remaining text from the current position to the end of the line.

This class has not been optimized or designed to manage big files or continuous streams of text. This is perfectly fine for our current project, but be aware that for a real compiler you might want to implement something more powerful.

CLI¶

The third element I provide is a simple REPL (Read–eval–print loop) that at the moment just echoes any text you will input and gracefully exit when we press Ctrl+D. There are and there will be no tests for the CLI. Testing endpoints like this is complex and not always worth the effort, as in this case.

The command line can be run from the project main directory with

python cli.py

Level 1 - End of file¶

"End? No, the journey doesn't end here." - The Lord of the Rings: The Return of the King (2003)

The first thing a Lexer shall be able to do is to load and process an empty text. This should return an EOF (End Of File) token. EOF is used to signal that the input buffer has ended and that there is no more text to process.

The method get_tokens returns all the tokens of the input stream in a single list.

Add this code to tests/test_calc_lexer.py

from smallcalc import tok as token
from smallcalc import calc_lexer as clex


def test_get_tokens_understands_eof():
    l = clex.CalcLexer()

    l.load('')

    assert l.get_tokens() == [
        token.Token(clex.EOF)
    ]

To avoid misleading errors you should also create the empty file smallcalc/calc_lexer.py, as without that file pytest will raise an ImportError.

This is our first test, and if you run the test suite now you will see that it fails. This is expected, as there is no code to pass the test.

$ pytest -svv  --cov-report term-missing --cov=smallcalc
================================== FAILURES ===================================
_______________________ test_get_tokens_understands_eof _______________________

    def test_get_tokens_understands_eof():
>       l = clex.CalcLexer()
E       AttributeError: module 'smallcalc.calc_lexer' has no attribute 'CalcLexer'

tests/test_calc_lexer.py:6: AttributeError
===================== 1 failed, 29 passed in 0.08 seconds =====================

Implement now a class CalcLexer in the file smallcalc/calc_lexer.py that makes the test pass. Remember that you just need the code to pass this test. So do not implement complex systems now and go for the simplest solution (hint: the test expects that specific output).

The EOF constant can be a simple string with the value 'EOF'.

It is worth executing the test suite with coverage (check the command line above), which will tell you if you over-engineered your code. You should aim for 100% coverage, always.

Solution¶

To pass the test, the class CalcLexer can use the provided text_buffer.TextBuffer class, that exposes a method load and wrap it in CalcLexer.load. The test is not providing any input so the easiest solution is just to return the required token. The test requires us to implement the method get_tokens, but I preferred to isolate the code in a method called get_token and to call the latter from get_tokens. The file smallcalc/calc_lexer.py is then

from smallcalc import text_buffer
from smallcalc import tok as token

EOF = 'EOF'


class CalcLexer:
    def __init__(self, text=''):

        self._text_storage = text_buffer.TextBuffer(text)

    def load(self, text):
        self._text_storage.load(text)

    def get_token(self):
        self._current_token = token.Token(EOF)
        return self._current_token

    def get_tokens(self):
        return [self.get_token()]

You can see here in practice what I mentioned in the introduction about TDD. The method get_token returns a hardcoded token.Token(EOF), because that is enough to pass the test. It is not enough to be a good Lexer, but if we write and pass the right tests, this will happen in time. Be smart, be strict: write the minimal code needed to pass the test.

Being really strict, however, this solution is already over-engineered. The code

    def get_tokens(self):
        return [token.Token(EOF)]

would be enough to pass the test. It would also be the first thing we change as soon as we add another test. So, let me amend the previous advice: be strict, with a pinch of salt.

Level 2 - Single digit integers¶

"You're missing just a couple of digits there." - Iron Man (2008)

The requirement for this section is

# The only accepted value for the input is one single digit between 0 and 9
integer: [0-9]

Lexer¶

Since a calculator has to deal with numbers let us implement support for integers (we will add floating point numbers later). The first thing that we need is to recognise single-digit integers. This is the test that you have to add to tests/test_calc_lexer.py

def test_get_token_understands_integers():
    l = clex.CalcLexer()

    l.load('3')

    assert l.get_token() == token.Token(clex.INTEGER, '3')

Note that here we are testing get_token and not get_tokens. This method will come handy later, so it is worth testing it here. As soon as that works you can test the behaviour of get_tokens

def test_get_tokens_understands_integers():
    l = clex.CalcLexer()

    l.load('3')

    assert l.get_tokens() == [
        token.Token(clex.INTEGER, '3'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

Please note that now the lexer shall output both an EOF and an EOL token, as the current line of code ends. The biggest issue you have to face here is that when you recognise a token then you have to skip it in the source text.

After this test you may end up with some code duplication, as get_token and get_tokens perform similar tasks. If you haven't already, please call the former from the latter. It could also be worth doing some refactoring. Remember: you can confidently change your code, because as long as the tests pass your changes are correct! This is the true power of TDD.

If you refactor the code creating helper methods you should make them "private" by prefixing their name with an underscore. This also means that you do not need to test them, in principle (watch this talk by Sandy Metz on this subject).

Parser¶

Now that we have a working lexer that recognises integers let us work on the parser. This has to use the lexer to process a text and produce a tree of nodes that represent the syntactic structure of the processed code. Don't worry if it seems extremely complex, it is actually pretty simple if you follow the tests.

Edit the tests/test_calc_parser.py file and insert this code

from smallcalc import calc_parser as cpar


def test_parse_integer():
    p = cpar.CalcParser()
    p.lexer.load("5")

    node = p.parse_integer()

    assert node.asdict() == {
        'type': 'integer',
        'value': 5
    }

The node variable is an instance of a specific class that contains integers, IntegerNode (but you are free to name it as you want, as this is not tested). Please note that this class doesn't consider the value as a string any more, but as a proper (Python) integer ('value': 5). Now edit the file smallcalc/calc_parser.py and write some code that passes the test.

Does it work? Well, you just wrote your first parser! Congratulations! From here to something that understands C++ or Python the journey is pretty long, but the initial steps are promising.

Visitor¶

Let us consider the visitor, now. This is the run-time component of our language, the part that actually runs through the tree of nodes and executes it. This part, thus, is where most of the actual behaviour of the language happens. For instance, the fact that the symbol "+" actually sums integers is because the visitor implements that operation.

This can seem a trivial consideration, but if you think about the division between integers you immediately understand that the visitor has a great responsibility. Does the symbol / divide integers with or without floating point math? Python 3, for instance, opted for a floating point division, and introduced the // operator for the integer version of the operation, but other languages behave differently.

I'll discuss this in more detail later, when we will implement mathematical operations. For the time being, let us create the tests/test_calc_visitor.py file and introduce the following test

from smallcalc import calc_visitor as cvis


def test_visitor_integer():
    ast = {
        'type': 'integer',
        'value': 12
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (12, 'integer')

As you can see, at this stage the visitor has a trivial job, which is to just return the value and the type of the number that it finds in the tree. Note that the visitor provides a method visit which is type agnostic (i.e. it doesn't care about the type of the node). This is correct, as the visitor has to traverse the whole tree recursively and to react to the different nodes without a previous knowledge of what it should expect.

As simple as the visitor can be, now we can make our CLI interface use the parser and the visitor to understand and execute one simple command, which is to parse a single-digit integer and print it with its type. Change the cli.py file to

from smallcalc import calc_parser as cpar
from smallcalc import calc_visitor as cvis


def main():
    p = cpar.CalcParser()
    v = cvis.CalcVisitor()

    while True:
        try:
            text = input('smallcalc :> ')
            p.lexer.load(text)

            node = p.parse_integer()
            res = v.visit(node.asdict())

            print(res)

        except EOFError:
            print("Bye!")
            break

        if not text:
            continue


if __name__ == '__main__':
    main()

Test it to check that everything works. If your code passes the tests I gave you, the result is guaranteed.

$ python cli.py 
smallcalc :> 3
(3, 'integer')

Let me recap what we just created. We wrote a lexer, which is a component that splits the input text in different tokens with a meaning, and we instructed it to react to single-digits integers. Then, we created a parser, which is the component that tries to make sense of several tokens put together, applying syntactical rules. Last, the visitor runs through the output of the parser and actually performs the actions that the grammar describes. All this to just print out an integer? Seems overkill! It is, actually, but there is a lot to come, and this separation of levels will come handy.

Solution¶

The two functions get_token and get_tokens have to evolve to deal with the new requirements, and to avoid having too much code in a single function I created some private helpers (where "private" has the Python meaning of "please don't use them").

The idea behind get_tokens is to call get_token until the EOF token is returned, even though we want the latter to be present in the final result.

    def get_tokens(self):
        t = self.get_token()
        tokens = []

        while t != token.Token('EOF'):
            tokens.append(t)
            t = self.get_token()

        tokens.append(token.Token('EOF'))

        return tokens

Then I decided to make get_token the central hub of my process with the following paradigm: the function tries to extract a specific token (_process_integer, in this case) and to return it; if the token cannot be extracted, the function tries the following one. At the moment I don't have any other type of token, but I will have them soon.

    def get_token(self):
        eof = self._process_eof()
        if eof:
            return eof

        eol = self._process_eol()
        if eol:
            return eol

        integer = self._process_integer()
        if integer:
            return integer

The three helpers shall just try to extract and return the token they have been assigned or None. After some refactoring I came up with three functions (two of them as properties) that simplify common tasks. _current_char and _current_line are just wrappers around two attributes of self._text_storage, while _set_current_token_and_skip is a bit more complex and ensures that the _current_token is always up to date.

    @property
    def _current_char(self):
        return self._text_storage.current_char

    @property
    def _current_line(self):
        return self._text_storage.current_line

    def _set_current_token_and_skip(self, token):
        self._text_storage.skip(len(token))

        self._current_token = token
        return token

Once this functions are in place I can write the actual helpers for the token extraction. The method _process_eol leverages self._text_storage, which raises an EOLError when the end of line has been reached. So all I need to do is to try to get the current char and return None if nothing happens. In case an EOLError exception is raised I run _set_current_token_and_skip with the end of line token.

    def _process_eol(self):
        try:
            self._current_char
            return None
        except text_buffer.EOLError:
            self._text_storage.newline()

            return self._set_current_token_and_skip(
                token.Token(EOL)
            )

The helper to process the end of file (_process_eof) is exactly like _process_eol, using self._current_line and text_buffer.EOFError.

    def _process_eof(self):
        try:
            self._current_line
            return None
        except text_buffer.EOFError:
            return self._set_current_token_and_skip(
                token.Token(EOF)
            )

At this point of the development the incoming token can only be EOL, EOF, or an integer, so the _process_integer function doesn't need to return None. Therefore, it is sufficient to create an integer token with the current char and return it.

    def _process_integer(self):
        return self._set_current_token_and_skip(
            token.Token(INTEGER, self._current_char)
        )

The above methods use two new global variables EOL and INTEGER that are defined at the beginning of the file along with EOF

EOL = 'EOL'
INTEGER = 'INTEGER'

CalcParser is the only class that is tested, but forecasting (actually, knowing) that we are going to manage multiple types of nodes, I isolated the code for the IntegerNode in its own class. There is no need to abstract things further for the time being, so IntegerNode doesn't inherit from any other class.

From a pure TDD point of view this is wrong, because I should have written some tests for the IntegerNode class before writing it. The purpose of this exercise, however is to guide you through the creation of a simple compiler, so tests are already given, and I will turn a blind eye on my own exception to the rule (how convenient!).

from smallcalc import calc_lexer as clex


class IntegerNode:
    node_type = 'integer'

    def __init__(self, value):
        self.value = int(value)

    def asdict(self):
        return {
            'type': self.node_type,
            'value': self.value
        }


class CalcParser:

    def __init__(self):
        self.lexer = clex.CalcLexer()

    def parse_integer(self):
        t = self.lexer.get_token()
        return IntegerNode(t.value)

CalcVisitor is by far the simplest class at the moment, as the only node we are managing is the one with an integer type.

class CalcVisitor:

    def visit(self, node):
        if node['type'] == 'integer':
            return node['value'], node['type']

Level 3 - Binary operations: addition¶

"You're about to become a permanent addition to this archaeological find." - Raiders of the Lost Ark (1981)

Let's update the grammar of the language with addsymbol and expression

integer: [0-9]

# Label the symbol '+'' with the name 'addsymbol'
addsymbol: '+'

# An expression is an integer followed by an addsymbol followed by another integer
expression: integer addsymbol integer

At the moment, our parser doesn't sound like an important component, as its output is just a refurbished version of the lexer one. The visitor, in turn, doesn't really perform any action but to print in a different format what the parser produces.

Let us try to introduce a simple mathematical operation, then, that should spice up our components. The new test for the lexer component (tests/test_calc_lexer.py) is

def test_get_tokens_understands_unspaced_sum_of_integers():
    l = clex.CalcLexer()

    l.load('3+5')

    assert l.get_tokens() == [
        token.Token(clex.INTEGER, '3'),
        token.Token(clex.LITERAL, '+'),
        token.Token(clex.INTEGER, '5'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

Please note that there are no spaces, as our lexer doesn't know how to deal with them yet. As you can see the output is straightforward, so go and change the CalcLexer class to make this tests pass without breaking any of the ones you already wrote. Check for coverage, to spot possible overengineered parts, and if necessary refactor the class to keep methods as simple as possible.

The parser now has a more complex job than before, though not yet really challenging. The test for the parser is

def test_parse_expression():
    p = cpar.CalcParser()
    p.lexer.load("2+3")

    node = p.parse_expression()

    assert node.asdict() == {
        'type': 'binary',
        'left': {
            'type': 'integer',
            'value': 2
        },
        'right': {
            'type': 'integer',
            'value': 3
        },
        'operator': {
            'type': 'literal',
            'value': '+'
        }
    }

I want to resume here the discussion about mathematical operators and the role of the visitor that I started in the previous section. As you can see the expression is a generic binary operator, with a left term, a right term, and an operator. The operator, furthermore, is just a literal which value is the symbol we use for that binary operation.

This parser, thus, is pretty ignorant of the different operations we can perform, giving the whole responsibility to the visitor. We could, however, implement the parser to make it produce something more specific, like for example a binary_sum or addition node, which represents only the addition, and which wouldn't need the 'operator' key, as it is implicit in the node type.

The amount of work done by the parser and by the visitor is a peculiarity of the specific language or program, so feel free to experiment. For the moment you have to stick to one solution as you are guided by the tests that I wrote, but as soon as you grasped the concepts and start writing a new language, you will be free to implement each component as you prefer.

Finally, the visitor shall implement the actual mathematical operation. The test is

def test_visitor_expression_sum():
    ast = {
        'type': 'binary',
        'left': {
                'type': 'integer',
                'value': 5
        },
        'right': {
            'type': 'integer',
            'value': 4
        },
        'operator': {
            'type': 'literal',
            'value': '+'
        }
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (9, 'integer')

As soon as you changed the method visit to deal with 'expression' nodes, you can test the new syntax in the CLI. Since we changed visit internally, that part of the CLI doesn't require any modification. We have, however, to change the parser entry point from parse_integer to parse_expression, so the new cli.py file will be

from smallcalc import calc_parser as cpar
from smallcalc import calc_visitor as cvis


def main():
    p = cpar.CalcParser()
    v = cvis.CalcVisitor()

    while True:
        try:
            text = input('smallcalc :> ')
            p.lexer.load(text)

            node = p.parse_expression()
            res = v.visit(node.asdict())

            print(res)

        except EOFError:
            print("Bye!")
            break

        if not text:
            continue


if __name__ == '__main__':
    main()

And a quick test of the CLI confirms that everything works fine

$ python cli.py 
smallcalc :> 2+4
(6, 'integer')

Everything? Well, not exactly. If I type just a single integer in the CLI the whole program crashes with an exception. If your solution doesn't blow up with a single integer, it means that you (probably) overengineered it a little. This is fine, but if you had implemented just what was needed to pass the tests the result would have been an error in that case.

Why do we have an error? Because we now parse the input with parse_expression and this method expects its input to be a full-formed expression, not a single integer. Generally speaking, our parser's entry point should be able to parse different syntax structures. We will improve this behaviour later, when we will address the problem of nested expressions.

Solution¶

The helper _process_literal does what _process_integer did before, which is to blindly return a token, this time with the LITERAL type.

LITERAL = 'LITERAL'

    def _process_literal(self):
        return self._set_current_token_and_skip(
            token.Token(LITERAL, self._current_char)
        )

The helper _process_integer, on the other hand, changes to return None when no integer can be parsed, which is easily checked with isdigit.

    def _process_integer(self):
        if not self._current_char.isdigit():
            return None

        return self._set_current_token_and_skip(
            token.Token(INTEGER, self._current_char)
        )

Last, the method get_token receives _process_literal as an additional case.

    def get_token(self):
        eof = self._process_eof()
        if eof:
            return eof

        eol = self._process_eol()
        if eol:
            return eol

        integer = self._process_integer()
        if integer:
            return integer

        literal = self._process_literal()
        if literal:
            return literal

The parser needs a node that represents the literal, namely LiteralNode, and a node to represent a binary operation, called BinaryNode. To avoid duplicating methods I created the ValueNode class and made both IntegerNode and LiteralNode inherit from that.

from smallcalc import calc_lexer as clex


class Node:

    def asdict(self):
        return {}  # pragma: no cover


class ValueNode(Node):

    node_type = 'value_node'

    def __init__(self, value):
        self.value = value

    def asdict(self):
        return {
            'type': self.node_type,
            'value': self.value
        }


class IntegerNode(ValueNode):
    node_type = 'integer'

    def __init__(self, value):
        self.value = int(value)


class LiteralNode(ValueNode):

    node_type = 'literal'


class BinaryNode(Node):

    node_type = 'binary'

    def __init__(self, left, operator, right):
        self.left = left
        self.operator = operator
        self.right = right

    def asdict(self):
        result = {
            'type': self.node_type,
            'left': self.left.asdict()
        }

        result['right'] = None
        if self.right:
            result['right'] = self.right.asdict()

        result['operator'] = None
        if self.operator:
            result['operator'] = self.operator.asdict()

        return result

The most important change, however, is in CalcParser, where I added the methods parse_addsymbol and parse_expression.

class CalcParser:

    def __init__(self):
        self.lexer = clex.CalcLexer()

    def parse_addsymbol(self):
        t = self.lexer.get_token()
        return LiteralNode(t.value)

    def parse_integer(self):
        t = self.lexer.get_token()
        return IntegerNode(t.value)

    def parse_expression(self):
        left = self.parse_integer()
        operator = self.parse_addsymbol()
        right = self.parse_integer()

        return BinaryNode(left, operator, right)

The visitor has to add the processing code for binary nodes, which assumes the operation is a sum, so it just needs to visit the left and right nodes.

class CalcVisitor:

    def visit(self, node):
        if node['type'] == 'integer':
            return node['value'], node['type']

        if node['type'] == 'binary':
            lvalue, ltype = self.visit(node['left'])
            rvalue, rtype = self.visit(node['right'])

            return lvalue + rvalue, rtype

Level 4 - Multi-digit integers¶

"So many." - Braveheart (1995)

Let's move allowing integers to be made of multiple digits.

# An integer is a sequence of digits, + here means `one or more`
integer: [0-9]+
addsymbol: '+'
expression: integer addsymbol integer

Up to now our language can handle only single-digit integers, so this part shall be enhanced before moving to more complex syntax structures. The only component that requires a change is the lexer, as it should emit one single token containing all the digits. The test, consequently, goes in tests/test_calc_lexer.py

def test_get_tokens_understands_multidigit_integers():
    l = clex.CalcLexer()

    l.load('356')

    assert l.get_tokens() == [
        token.Token(clex.INTEGER, '356'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

There are many ways to solve this problem, one of the simplest (and a perfectly valid one) is using regular expressions (which are, if you think about it, another language).

If you do not know how to use regular expressions do yourself a favour and learn them! You can find a nice tutorial on them at RegexOne. If you already know the syntax but don't know how to use them in Python this Google for Education page and the official documentation are your friends.

After this test the CLI should be able to handle expressions like 123+456. We don't need any change in the parser and in the visitor, can you tell why?

Solution¶

To provide support for multi-digit integers we just need to change the method _process_integer of the lexer. The new version makes use of a very simple regular expressions.

import re

    def _process_integer(self):
        regexp = re.compile('\d+')

        match = regexp.match(
            self._text_storage.tail
        )

        if not match:
            return None

        token_string = match.group()

        return self._set_current_token_and_skip(
            token.Token(INTEGER, int(token_string))
        )

The reason why we don't need to change the parser and the visitor is that nothing changed at that level. We altered the way the lexer identifies an integer token, but once that has been isolated the following steps are exactly the same as before.

Level 5 - Whitespaces¶

"Follow the white rabbit." - The Matrix (1999)

The second limitation that our language has at the moment is that it cannot handle whitespaces. If you try to input an expression like 3 + 4 in the CLI the program will crash with an exception (why?). Traditionally, whitespaces are completely ignored by programming languages: in Python, as well as in C and many other languages, writing 3+4, 3 + 4, 3+ 4, or 3 + 4 doesn't change the meaning at all. In Python, however, whitespaces matter at the beginning of the line, as indentation is used in lieu of parentheses.

How can we put such a behaviour in our language? Again, the lexer is the component in charge, as it should just skip whitespaces. So add this test to tests/test_calc_lexer.py

def test_get_tokens_ignores_spaces():
    l = clex.CalcLexer()

    l.load('3 + 5')

    assert l.get_tokens() == [
        token.Token(clex.INTEGER, '3'),
        token.Token(clex.LITERAL, '+'),
        token.Token(clex.INTEGER, '5'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

While you change the CalcLexer class to make it pass this test, ask yourself if the current structure of the class satisfies you or if it is the right time to refactor it, possibly even heavily rewriting some parts of it. You have a good test suite, now, so you can be sure that what you implemented is correct, at least according to the current requirements.

Note that we are hitting a limitation of unit testing here, which is that we should test that the language skips any amount of whitespaces, but it is impossible to write a test for this. We can test 1, 2, 100 whitespaces, but never any amount. The pragmatic solution, here is that of testing that the language skips one whitespace and leave further tests to be written only if specific errors arise in the future. The code should however try to provide a generic solution.

Solution¶

To process whitespaces I needed to add a helper called _process_whitespace with the same structure of the new _process_integer.

    def _process_whitespace(self):
        regexp = re.compile('\ +')

        match = regexp.match(
            self._text_storage.tail
        )

        if not match:
            return None

        self._text_storage.skip(len(match.group()))

Note that the solution here is '\ +' which skips any amount of whitespaces, even though '\ ' would have been enough to pass the test. As I said before, every time we have to test cases with "any" in them we have to be a bit less strict. TDD is not perfect, and remember that at the end of the day it's more important to have something that works and is not perfect than something that is perfect and doesn't work at all.

As this time I am not interested in returning whitespace tokens, I just want to skip them. The helper is therefore added to get_token without a return statement.

    def get_token(self):
        eof = self._process_eof()
        if eof:
            return eof

        eol = self._process_eol()
        if eol:
            return eol

        self._process_whitespace()

        integer = self._process_integer()
        if integer:
            return integer

        literal = self._process_literal()
        if literal:
            return literal

Level 6 - Subtraction¶

"I can add, subtract. I can make coffee. I can shuffle cards." - The Bourne Identity (2002)

integer: [0-9]+

# An addsymbol can be the symbol '+' or the symbol '-'
addsymbol: '+' | '-'

expression: integer addsymbol integer

Now that we addressed two basic issues of our language we can start enhancing the higher level syntactical structures. Since we implemented the addition operation, the most natural step forward is to implement subtraction. As for the addition, this change will involve all the three layers of the language, lexer, parser, and visitor.

Let us start teaching the lexer to understand the minus sign. The test that we need is the following (in tests/test_calc_lexer.py)

def test_get_tokens_understands_subtraction():
    l = clex.CalcLexer()

    l.load('3 - 5')

    assert l.get_tokens() == [
        token.Token(clex.INTEGER, '3'),
        token.Token(clex.LITERAL, '-'),
        token.Token(clex.INTEGER, '5'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

Does the lexer require any change? Why?

As you can see the decision to handle operators and symbols as LITERAL tokens allows us to introduce new symbols without the need to change the lexer. This obviously means that we will need to tell the symbols apart in a later stage, as nothing happens automatically. We could have decided to represent each symbol with a specific token, like PLUS and MINUS, but if you think about it, this would not have really changed the code in later stages, as PLUS is just another symbol, exactly like + is.

Using specific tokens, however, can simplify things if we want to handle multi-character literals. If we have an operator like -> (as in C) or // (like in Python), or something more complex, we could prefer to handle those in the lexer, emitting a single token with a specific name.

We could introduce in the lexer a table of accepted values for literals, which would lead to an earlier and better error reporting. At the moment our language accepts every literal between two integers (try with $, for example), but fails to process them in the parser, interpreting any literal as +. Feel free to expand the project in such a direction if you want.

The test for the parser is the following

def test_parse_expression_understands_subtraction():
    p = cpar.CalcParser()
    p.lexer.load("2-3")

    node = p.parse_expression()

    assert node.asdict() == {
        'type': 'binary',
        'left': {
            'type': 'integer',
            'value': 2
        },
        'right': {
            'type': 'integer',
            'value': 3
        },
        'operator': {
            'type': 'literal',
            'value': '-'
        }
    }

which is a small variation of the previously implemented test_parse_expression.

The considerations made for the lexer are perfectly valid for the parser as well, so you should need no node change at this point. The last test we have to add is that of the visitor, which is again very similar to the previous one

def test_visitor_expression_subtraction():
    ast = {
        'type': 'binary',
        'left': {
                'type': 'integer',
                'value': 5
        },
        'right': {
            'type': 'integer',
            'value': 4
        },
        'operator': {
            'type': 'literal',
            'value': '-'
        }
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (1, 'integer')

This time, however, we are at the end of the processing chain, and we have to deal with the difference symbol, namely to actually subtract numbers. To make this test pass, thus, you will need to change something in your CalcVisitor class.

Once your code passes this test, a quick test of the CLI shows that everything works as intended

$ python cli.py 
smallcalc :> 4 - 6
(-2, 'integer')

and since we rely on Python to perform the actual subtraction we get negative numbers for free. Pay attention: we can have negative numbers in the results, but we cannot input negative numbers. This is something that we will have to add later.

Solution¶

Adding the addition binary operation changed code in the lexer, the parser, and in the visitor. That operation was however considered a generic binary operation, and only the visitor implements the actual + operation. So adding the subtraction works out of the box for the first two stages and requires me to change the visitor only, with a simple if condition on the value of the operator.

class CalcVisitor:

    def visit(self, node):
        if node['type'] == 'integer':
            return node['value'], node['type']

        if node['type'] == 'binary':
            lvalue, ltype = self.visit(node['left'])
            rvalue, rtype = self.visit(node['right'])

            operator = node['operator']['value']

            if operator == '+':
                return lvalue + rvalue, rtype
            else:
                return lvalue - rvalue, rtype

Level 7 - Multiple operations¶

"The machine simply does not operate as expected." - The Prestige (2006)

integer: [0-9]+
addsymbol: '+' | '-'

# A expression starts with a single integer and optionally
# contains an addsymbol and another expression
# (this is a recursive definition) 
expression: integer (addsymbol expression)

Before we dive into the fascinating but complex topic of nested operations, let us take a look and implement multiple operations, that is the application of a chain of "similar" operators with the same priority.

Since this tutorial is a practical approach to the construction of an interpreter, I will not go too deep into the subject matter. Feel free to check the references if you are interested in such topics. For the moment, it is sufficient to understand that addition and subtraction are two operations that have the same precedence, which means that their order can be changed without affecting the result.

For instance: the expression 3 + 4 - 5 gives 2 as a result. The result is the same if we perform (3 + 4) - 5 = 7 - 5 = 2 or 3 + (4 - 5) = 3 - 1 = 2, where the expressions between parentheses are executed first.

From the interpreter's point of view, then, we can process a chain of additions and subtractions without being concerned about precedence, which greatly simplifies our job. As the output of the parser is a tree, however, we need to find a way to represent such a chain of operations in that form. One way is to nest expressions, which means that each operation is a single binary node, with the left term containing an integer and the right one the rest of the expression. In the previous example 3 + 4 - 5 is represented by an addition between 3 and 4 - 5. 4 - 5, in turn, is another binary node, a subtraction between 4 and 5.

Let us start checking if the lexer understand multiple operations

def test_get_tokens_understands_multiple_operations():
    l = clex.CalcLexer()

    l.load('3 + 5 - 7')

    assert l.get_tokens() == [
        token.Token(clex.INTEGER, '3'),
        token.Token(clex.LITERAL, '+'),
        token.Token(clex.INTEGER, '5'),
        token.Token(clex.LITERAL, '-'),
        token.Token(clex.INTEGER, '7'),
        token.Token(clex.EOL),
        token.Token(clex.EOF)
    ]

If the current version of your lexer doesn't pass this test make the necessary changes to the code.

Now we should modify the parser. While expressing the test is very simple, actually creating the code that makes it pass is not that trivial. So, as an intermediate step, I will make you implement the code that allows the parser to check upcoming tokens.

One solution to this problem is to save the state of the parser, get as many tokens as we need, and then restore the status. Inspired by Git, I called those methods stash and pop. Put the following test in tests/test_calc_lexer.py

def test_lexer_can_stash_and_pop_status():
    l = clex.CalcLexer()
    l.load('3 5')

    l.stash()
    l.get_token()
    l.pop()

    assert l.get_token() == token.Token(clex.INTEGER, '3')

As you can see the get_token call between stash and pop doesn't leave any trace.

Once your code is working implement the second test. Create a method peek_token that performs all the previous actions together

def test_lexer_can_peek_token():
    l = clex.CalcLexer()
    l.load('3 + 5')

    l.get_token()
    assert l.peek_token() == token.Token(clex.LITERAL, '+')

You can implement peek_token very easily leveraging stash and pop.

Now we are ready to face the test that covers nested operations, which goes in tests/test_calc_parser.py

def test_parse_expression_with_multiple_operations():
    p = cpar.CalcParser()
    p.lexer.load("2 + 3 - 4")

    node = p.parse_expression()

    assert node.asdict() == {
        'type': 'binary',
        'left': {
            'type': 'binary',
            'left': {
                'type': 'integer',
                'value': 2
            },
            'right': {
                'type': 'integer',
                'value': 3
            },
            'operator': {
                'type': 'literal',
                'value': '+'
            }
        },
        'right': {
            'type': 'integer',
            'value': 4
        },
        'operator': {
            'type': 'literal',
            'value': '-'
        }
    }

A note of warning: probably the first version of the code that makes this test pass will be horrible, as the logic involved is not trivial. Remember that your first goal is to make the test pass and then, with the battery of tests in your arsenal, to tidy up the code.

As usual, the last test involves the visitor

def test_visitor_expression_with_multiple_operations():
    ast = {
        'type': 'binary',
        'left': {
            'type': 'binary',
            'left': {
                'type': 'integer',
                'value': 3
            },
            'right': {
                'type': 'integer',
                'value': 4
            },
            'operator': {
                'type': 'literal',
                'value': '-'
            }
        },
        'right': {
            'type': 'integer',
            'value': 200
        },
        'operator': {
            'type': 'literal',
            'value': '+'
        }
    }

    v = cvis.CalcVisitor()
    assert v.visit(ast) == (199, 'integer')

What changes do you need to make to the CalcVisitor class? Why?

Solution¶

I made no assumptions on the length of the tokens stream in get_tokens, so processing multiple tokens comes out of the box in the lexer.

Adding stash and pop is not very complex, as the tests show exactly what we need to save and retrieve. Here I leverage the position attribute and the goto functions of the TextBuffer class.

class CalcLexer:

    def __init__(self, text=''):
        self._text_storage = text_buffer.TextBuffer(text)
        self._status = []
        self._current_token = None

    [...]

    @property
    def _current_status(self):
        status = {}
        status['text_storage'] = self._text_storage.position
        status['current_token'] = self._current_token
        return status

    def stash(self):
        self._status.append(self._current_status)

    def pop(self):
        status = self._status.pop()
        self._text_storage.goto(*status['text_storage'])
        self._current_token = status['current_token']

Once stash and pop are in place implementing peek_token is trivial

    def peek_token(self):
        self.stash()
        token = self.get_token()
        self.pop()

        return token

Finally, peek_token allows me to add support for multiple expressions in the parser.

    def parse_expression(self):
        left = self.parse_integer()

        next_token = self.lexer.peek_token()

        while next_token.type == clex.LITERAL:
            operator = self.parse_addsymbol()
            right = self.parse_integer()

            left = BinaryNode(left, operator, right)

            next_token = self.lexer.peek_token()

        return left

Final words¶

Phew! That was something, wasn't it? I think so, we went from nothing to a trivial calculator, but the engine we have under the bonnet is clearly powerful, so I'm already looking forward to implementing more complex syntax elements, like round brackets, multiplication, division, not to mention that sooner or later this should become a language, so we will need variables, functions, scopes, and so on.

The code I developed in this post is available on the GitHub repository tagged with part1 (link).

Well, See you in the next post of the series, then!

Resources¶

Some links on compilers history

Tutorials and analysis of compilers and parsers

The beautiful "Let’s Build A Simple Interpreter" series by Ruslan Spivak. Thanks Ruslan!
How to implement a programming language in JavaScript on Lisperator.net by Mihai Bazon.
Build Your Own Lisp
LL and LR Parsing Demystified by Josh Haberman.

Grammars

Updates¶

2017-12-24: Victor Uriarte (vmuriart) spotted an important issue in a previous version of the post. The last two tests (test_parse_expression_with_multiple_operations and test_visitor_expression_with_multiple_operations) used a right-growing tree instead of a left-growing one. The problem with a right-growing tree is that an operator affects everything is on the right side, that is the whole rest of the operation. Thus, an operation like 10 - 1 + 1 would become 10 - (1 + 1), and the result is obviously different. I fixed the tests and the solution I give in the next posts. You can read Victor's issue here. Thanks Victor for spotting it!

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.