A game of tokens: write an interpreter in Python with TDD - Part 2
By Leonardo Giordani - Updated on
Introduction¶
Welcome to the second part of the series of posts about writing an interpreter with Python and TDD. In the first post we developed together a simple calculator that can handle integers, addition and subtraction. In this instalment I'll give you new tests that will guide you through the implementation of multiplication, division, parentheses, and unary operators. I will obviously reference the structure I used in my solution, but you mileage may vary, so feel free to ignore the comments or the suggested solutions in case your code is different.
Level 8 - Multiplication and division¶
"They're coming outta the walls. They're coming outta the goddamn walls." - Aliens (1986)
As you remember from the previous post our interpreter is made of three different components, the lexer, the parser, and the visitor. So, to implement the missing basic operations, multiplication and division, we need to start with the lexer and ensure that it understands the traditional symbols *
and /
Lexer¶
Put the following tests in the tests/test_calc_lexer.py
file
def test_get_tokens_understands_multiplication():
l = clex.CalcLexer()
l.load('3 * 5')
assert l.get_tokens() == [
token.Token(clex.INTEGER, '3'),
token.Token(clex.LITERAL, '*'),
token.Token(clex.INTEGER, '5'),
token.Token(clex.EOL),
token.Token(clex.EOF)
]
def test_get_tokens_understands_division():
l = clex.CalcLexer()
l.load('3 / 5')
assert l.get_tokens() == [
token.Token(clex.INTEGER, '3'),
token.Token(clex.LITERAL, '/'),
token.Token(clex.INTEGER, '5'),
token.Token(clex.EOL),
token.Token(clex.EOF)
]
Do the tests fail? Why? Please remember that when tests pass without requiring any code change you have to ask yourself "Why do they pass?", and be sure that you understood the answer before going further. Otherwise you might be adding tests that are wrong, or tests for things that have already been tested, and in either case you should act on them.
Parser¶
Now that the lexer understands the symbols we can start considering the parser. The parser has to output a sensible structure that represents the new operations, which is not different from what it outputs for the sum and the difference. Add the following tests to tests/test_calc_parser.py
def test_parse_term():
p = cpar.CalcParser()
p.lexer.load("2 * 3")
node = p.parse_term()
assert node.asdict() == {
'type': 'binary',
'left': {
'type': 'integer',
'value': 2
},
'right': {
'type': 'integer',
'value': 3
},
'operator': {
'type': 'literal',
'value': '*'
}
}
def test_parse_term_with_multiple_operations():
p = cpar.CalcParser()
p.lexer.load("2 * 3 / 4")
node = p.parse_term()
assert node.asdict() == {
'type': 'binary',
'left': {
'type': 'binary',
'left': {
'type': 'integer',
'value': 2
},
'right': {
'type': 'integer',
'value': 3
},
'operator': {
'type': 'literal',
'value': '*'
}
},
'right': {
'type': 'integer',
'value': 4
},
'operator': {
'type': 'literal',
'value': '/'
}
}
This time you should have some failures, so go and edit the CalcParser
class in order to pass the tests. As the two new binary operations are at this level the same as sum and difference you could change the method parse_expression
(try it!). This will however make things harder later when we will prioritise operations (multiplications have to be performed before sums), so my advice is to introduce a method parse_term
in the parser, which is the method used in the tests.
Visitor¶
Now it's the visitor's turn, where the syntax tree gets analysed and actually executed. Add the following tests to the tests/test_calc_visitor.py
file and then make them pass changing the CalcVisitor
class accordingly.
def test_visitor_term_multiplication():
ast = {
'type': 'binary',
'left': {
'type': 'integer',
'value': 5
},
'right': {
'type': 'integer',
'value': 4
},
'operator': {
'type': 'literal',
'value': '*'
}
}
v = cvis.CalcVisitor()
assert v.visit(ast) == (20, 'integer')
def test_visitor_term_division():
ast = {
'type': 'binary',
'left': {
'type': 'integer',
'value': 11
},
'right': {
'type': 'integer',
'value': 4
},
'operator': {
'type': 'literal',
'value': '/'
}
}
v = cvis.CalcVisitor()
assert v.visit(ast) == (2, 'integer')
Solution¶
The tests we added for the lexer already pass. This is not surprising, as the lexer is designed to return everything it doesn't know as a LITERAL
(smallcalc/calc_lexer.py:119
). As we already instructed the lexer to skip spaces the new operators are happily digested. As I discussed in the previous post, I decided for this project not to assign operators a specific token, so from this point of view our lexer is pretty open and could already understand instructions like 3 $ 5
or 7 : 9
, even though they do not have any meaning in our new language (yet, maybe).
The parser is not so merciful, and the two new tests do not pass. We are explicitly calling a method parse_term
that is not defined, so a success would have been very worrying. In these two tests parse_term
is called explicitly and there is no relationship with the other methods named parse_*
, so we can implement it as a stand-alone processing.
We know that a term
is an operation between two integers, so we can follow what we did with parse_expression
. The first thing we do is to parse the first integer, then we peek the next token and we decide what to do. If the token is a LITERAL
we suppose it is the operation symbol, otherwise we probably hit the end of the file and we will just return the previously read integer. The second element may be a simple integer or another multiplication or division, so we recursively call parse_term
and return a BinaryNode
with the result.
[Note: I noticed that the parse_addsymbol
could be now named parse_literal
but this wasn't done when I prepared the source code. Regardless of the name, however, what this method does is to just pack a literal in a LiteralNode
and return it.]
The whole parser is now the following
class CalcParser:
def __init__(self):
self.lexer = clex.CalcLexer()
def parse_addsymbol(self):
t = self.lexer.get_token()
return LiteralNode(t.value)
def parse_integer(self):
t = self.lexer.get_token()
return IntegerNode(t.value)
def parse_term(self):
left = self.parse_integer()
next_token = self.lexer.peek_token()
while next_token.type == clex.LITERAL:
operator = self.parse_addsymbol()
right = self.parse_integer()
left = BinaryNode(left, operator, right)
next_token = self.lexer.peek_token()
return left
def parse_expression(self):
left = self.parse_integer()
next_token = self.lexer.peek_token()
while next_token.type == clex.LITERAL:
operator = self.parse_addsymbol()
right = self.parse_integer()
left = BinaryNode(left, operator, right)
next_token = self.lexer.peek_token()
return left
The visitor was instructed only to deal with sums and subtractions, and it treats everything is not the former as the latter. This is why the new tests give as results 1
and 7
. We just need to extend the if
statement to include the new operations
class CalcVisitor:
def visit(self, node):
if node['type'] == 'integer':
return node['value'], node['type']
if node['type'] == 'binary':
lvalue, ltype = self.visit(node['left'])
rvalue, rtype = self.visit(node['right'])
operator = node['operator']['value']
if operator == '+':
return lvalue + rvalue, rtype
elif operator == '-':
return lvalue - rvalue, rtype
elif operator == '*':
return lvalue * rvalue, rtype
elif operator == '/':
return lvalue // rvalue, rtype
Now we have a pretty simple but fully working calculator! Enjoy the cli.py
, as YOU did it this time! I remember I was pretty excited the first time I run a command line calculator done by me. But hold tight, because you are going to learn and implement much more!
Level 9 - Mixing operators¶
"Don't cross the streams." - Ghostbusters (1984)
Ok, it's time to do some serious math. What happens if you mix sums and multiplications? Let's try it and see how our interpreter reacts. We already know that the lexer happily digests all the four symbols so we can head straight to the parser and add the following test to tests/test_calc_parser.py
def test_parse_expression_with_term():
p = cpar.CalcParser()
p.lexer.load("2 + 3 * 4")
node = p.parse_expression()
assert node.asdict() == {
'type': 'binary',
'left': {
'type': 'integer',
'value': 2
},
'right': {
'type': 'binary',
'left': {
'type': 'integer',
'value': 3
},
'right': {
'type': 'integer',
'value': 4
},
'operator': {
'type': 'literal',
'value': '*'
}
},
'operator': {
'type': 'literal',
'value': '+'
}
}
Chances are that this will fail miserably. Probably you have to rework a bit parse_expression
as it is ignoring the new entry, parse_term
. Please note that 2 * 3 + 4
must give 10
according to the standard math rules, and not 14
. This happens because multiplication is performed before sum, and the order depends uniquely on the structure created by the parser, and not by the visitor (which is at this point a pretty dumb component).
Once the parser outputs the correct structure the visitor shouldn't have issues, as it is already behaving in a recursive way. If you want to check feel free to add relevant tests, however.
Solution¶
Ouch! It looks like putting multiplications and sums in the same line is not really working. As you may recall we didn't link parse_term
with the other methods, and we use a generic function to treat literals. This works in principle, but doesn't consider operator precedence.
When we try to evaluate 2 + 3 * 4
the output of the parser is
{
"type": "binary",
"left": {
"type": "binary",
"left": {
"type": "integer",
"value": 2
},
"right": {
"type": "integer",
"value": 3
},
"operator": {
"type": "literal",
"value": "+"
}
},
"right": {
"type": "integer",
"value": 4
},
"operator": {
"type": "literal",
"value": "*"
}
}
As you can clearly see the parser recognised the multiplication operator, but then returns a nested sum (the oputput of a recursive call of parse_term
). This gives the sum a greater precedence that that of the sum, which is against the mathematical rules we want to follow here. 2 + 3 * 4
shall be considered 2 + (3 * 4)
and not (2 + 3) * 4
.
To fix this we have to rework parse_term
. First of all it shall accept only the *
and /
operators, then it shall return the left part if it finds a different literal. Even parse_expression
shall change a bit: the first thing to do is to call parse_term
instead of parse_integer
and then to return the left part.
The new code is then
def parse_term(self):
left = self.parse_integer()
next_token = self.lexer.peek_token()
while next_token.type == clex.LITERAL\
and next_token.value in ['*', '/']:
operator = self.parse_addsymbol()
right = self.parse_integer()
left = BinaryNode(left, operator, right)
next_token = self.lexer.peek_token()
return left
def parse_expression(self):
left = self.parse_term()
next_token = self.lexer.peek_token()
while next_token.type == clex.LITERAL:
operator = self.parse_addsymbol()
right = self.parse_term()
left = BinaryNode(left, operator, right)
next_token = self.lexer.peek_token()
return left
Let's see what happens parsing 2 * 3 + 4
. The test calls parse_expression
which tries immediately to run parse_term
. The latter recognises 2
and *
, so it calls itself recursively just before the 3
and returns the binary node. This means that the multiplication is the first operation we return, the one with higher precedence. The recursive call recognises 3
but then doesn't know what to do with +
as we specifically consider only *
and /
, so it just returns the integer value. Back to parse_expression
, then the variable left
will contain the binary node that represents 2 * 3
. The function will then finish adding the binary node for the sum.
Take your time to understand the mechanism, perhaps trying with different operations like 2 + 4 * 6 - 8
, which should return 18
.
Level 10 - Parentheses¶
"When nine hundred years old you reach, look as good you will not." - Return of the Jedi (1983)
Parentheses, are curved brackets used in mathematics to change the order of operations. As this part is pretty important I will spend some time on it, because the order of operations will be of concerns also when it comes to language operators, and not only when dealing with mathematical operations. As explained in the previous section almost everything at this point happens in the parser, as the resulting structure that we will give to the visitor is the one that rules the precedence of operations.
Let's start to check that the lexer understands the parentheses symbols (
and )
.
def test_get_tokens_understands_parentheses():
l = clex.CalcLexer()
l.load('3 * ( 5 + 7 )')
assert l.get_tokens() == [
token.Token(clex.INTEGER, '3'),
token.Token(clex.LITERAL, '*'),
token.Token(clex.LITERAL, '('),
token.Token(clex.INTEGER, '5'),
token.Token(clex.LITERAL, '+'),
token.Token(clex.INTEGER, '7'),
token.Token(clex.LITERAL, ')'),
token.Token(clex.EOL),
token.Token(clex.EOF)
]
As our lexer is pretty open-minded it shouldn't raise any objections and happily pass the test (why?).
As always, instead, its neighbour the parser is not that forgiving, and I bet it will make a fuss. Let's try and feed it with some simple expression with parentheses
def test_parse_expression_with_parentheses():
p = cpar.CalcParser()
p.lexer.load("(2 + 3)")
node = p.parse_expression()
assert node.asdict() == {
'type': 'binary',
'left': {
'type': 'integer',
'value': 2
},
'right': {
'type': 'integer',
'value': 3
},
'operator': {
'type': 'literal',
'value': '+'
}
}
To make this test pass my suggestion is to introduce a method parse_factor
, where the term factor encompasses both integers and the expressions between parentheses. In the latter case, obviously, you will need to call parse_expression
, which somehow breaks the hierarchical structure of methods in the parser.
Solution¶
Let's have some Lisp time here and introduce parentheses. As happened for the new mathematical operators, parentheses are already accepted by the lexer as simple literals, so the first test passes without any change in the code. The parser complains, however, as it always expects an integer (smallcalc/calc_parser.py:76
).
As I suggested, my idea is to introduce a method that parses a so-called factor, which can either be an integer of an expression between parentheses.
class CalcParser:
def __init__(self):
self.lexer = clex.CalcLexer()
def parse_addsymbol(self):
t = self.lexer.get_token()
return LiteralNode(t.value)
def parse_integer(self):
t = self.lexer.get_token()
return IntegerNode(t.value)
def parse_factor(self):
next_token = self.lexer.peek_token()
if next_token.type == clex.LITERAL and next_token.value == '(':
self.lexer.get_token()
expression = self.parse_expression()
self.lexer.get_token()
return expression
return self.parse_integer()
The method parse_term
now has to call parse_factor
def parse_term(self):
left = self.parse_factor()
next_token = self.lexer.peek_token()
while next_token.type == clex.LITERAL\
and next_token.value in ['*', '/']:
operator = self.parse_addsymbol()
right = self.parse_integer()
left = BinaryNode(left, operator, right)
next_token = self.lexer.peek_token()
return left
And last we need to slightly change parse_expression
introducing a check on the literal token value. This happens because I decided to identify everything with a literal, so the method has to rule out every literal it is not interested to manage. If you introduce specific tokens for operations, parentheses, etc., this change is not required (but you won't use clex.LITERAL
at that point).
def parse_expression(self):
left = self.parse_term()
next_token = self.lexer.peek_token()
while next_token.type == clex.LITERAL\
and next_token.value in ['+', '-']:
operator = self.parse_addsymbol()
right = self.parse_term()
left = BinaryNode(left, operator, right)
next_token = self.lexer.peek_token()
return left
Level 11 - Priorities¶
"You got issues, Quill." - Guardians of the Galaxy (2014)
As parentheses have been introduced to change the default priority rules between operators we need to be sure that this happens. We can test it easily with this code
def test_parse_parentheses_change_priority():
p = cpar.CalcParser()
p.lexer.load("(2 + 3) * 4")
node = p.parse_expression()
assert node.asdict() == {
'type': 'binary',
'left': {
'type': 'binary',
'left': {
'type': 'integer',
'value': 2
},
'right': {
'type': 'integer',
'value': 3
},
'operator': {
'type': 'literal',
'value': '+'
}
},
'right': {
'type': 'integer',
'value': 4
},
'operator': {
'type': 'literal',
'value': '*'
}
}
Now when your parser passes this test you have a full-fledged calculator that supports parentheses. Make sure to test the new features in the CLI. Does multiple parentheses work? Why?
Solution¶
Another feature that comes for free with the previous changes, as the first thing that parse_expression
does is to run parse_term
, and the first thing the latter does is to run parse_factor
, which in turn manages expressions between parentheses. If the expression is enclosed between parentheses the method parse_factor
doesn't call parse_expression
and just returns the integer.
Level 12 - Unary operators¶
"There can be only one!" - Highlander (1986)
Now it's time to introduce unary operators, which are very important in programming languages. Just think at not x
and you will immediately understand why you need them. Unary operators do not fit in the current structure of our interpreter as the parser is always expecting either an integer or an open parenthesis as the first token.
Let's first write a test for the most simple unary operator, which is a minus (as in -2
). Remember that we are testing the parser here, as the lexer is already able to parse the minus sign.
def test_parse_factor_supports_unary_operator():
p = cpar.CalcParser()
p.lexer.load("-5")
node = p.parse_factor()
assert node.asdict() == {
'type': 'unary',
'operator': {
'type': 'literal',
'value': '-'
},
'content': {
'type': 'integer',
'value': 5
}
}
When your parser passes this test we have to make sure that the unary minus can be applied also to expressions between parentheses
def test_parse_factor_supports_negative_expressions():
p = cpar.CalcParser()
p.lexer.load("-(2 + 3)")
node = p.parse_factor()
assert node.asdict() == {
'type': 'unary',
'operator': {
'type': 'literal',
'value': '-'
},
'content': {
'type': 'binary',
'left': {
'type': 'integer',
'value': 2
},
'right': {
'type': 'integer',
'value': 3
},
'operator': {
'type': 'literal',
'value': '+'
}
}
}
Once the parser is able to pass these two tests we are confident that the unary minus can be used in front of all the basic elements of our expressions. At this point it is time to execute the unary expressions produced by the parsing layer, so include this test for the visitor
def test_visitor_unary_minus():
ast = {
'type': 'unary',
'operator': {
'type': 'literal',
'value': '-'
},
'content': {
'type': 'binary',
'left': {
'type': 'integer',
'value': 2
},
'right': {
'type': 'integer',
'value': 3
},
'operator': {
'type': 'literal',
'value': '+'
}
}
}
v = cvis.CalcVisitor()
assert v.visit(ast) == (-5, 'integer')
Change the visitor to pass this test and you can go straight to the CLI and start using negative numbers or negative expressions. Can you execute something like --2
(minus minus 2)? What is the result? Why?
Now let's go back to the parser and ensure that the unary plus can be used as well. This is the test
def test_parse_factor_supports_unary_plus():
p = cpar.CalcParser()
p.lexer.load("+(2 + 3)")
node = p.parse_factor()
assert node.asdict() == {
'type': 'unary',
'operator': {
'type': 'literal',
'value': '+'
},
'content': {
'type': 'binary',
'left': {
'type': 'integer',
'value': 2
},
'right': {
'type': 'integer',
'value': 3
},
'operator': {
'type': 'literal',
'value': '+'
}
}
}
and the code should be trivial, as you already manage the unary minus. The relative test for the visitor is
def test_visitor_unary_plus():
ast = {
'type': 'unary',
'operator': {
'type': 'literal',
'value': '+'
},
'content': {
'type': 'binary',
'left': {
'type': 'integer',
'value': 2
},
'right': {
'type': 'integer',
'value': 3
},
'operator': {
'type': 'literal',
'value': '+'
}
}
}
v = cvis.CalcVisitor()
assert v.visit(ast) == (5, 'integer')
Once your code passes all the tests head to the CLI and try to run something like -+--++-3
. Does it work?
Solution¶
The minus unary operator uses a literal that we already manage in the lexer, so there is nothing to do there. The first test I gave you checks if the parser can process a factor in the form -5
.
The current implementation of parse_factor
processes either an expression enclosed between parentheses or an integer, and actually the test doesn't pass, complaining against the minus sign not being a valid integer with base 10. The solution is pretty straightforward, as it is enough to add another if
that manages the minus sign. When we encounter such a sign, however, we have to return a different type of node, as the test states, so we also have to introduce the relative class.
class UnaryNode(Node):
node_type = 'unary'
def __init__(self, operator, content):
self.operator = operator
self.content = content
def asdict(self):
result = {
'type': self.node_type,
'operator': self.operator.asdict(),
'content': self.content.asdict()
}
return result
class CalcParser:
def __init__(self):
self.lexer = clex.CalcLexer()
def parse_addsymbol(self):
t = self.lexer.get_token()
return LiteralNode(t.value)
def parse_integer(self):
t = self.lexer.get_token()
return IntegerNode(t.value)
def parse_factor(self):
next_token = self.lexer.peek_token()
if next_token.type == clex.LITERAL and next_token.value == '-':
operator = self.parse_addsymbol()
factor = self.parse_factor()
return UnaryNode(operator, factor)
if next_token.type == clex.LITERAL and next_token.value == '(':
self.lexer.get_token()
expression = self.parse_expression()
self.lexer.get_token()
return expression
return self.parse_integer()
The second test passes automatically because parse_factor
intercepts the -
literal before the (
one.
The visitor has to be updated with the new type of unary
node. The new visitor is then
class CalcVisitor:
def visit(self, node):
if node['type'] == 'integer':
return node['value'], node['type']
if node['type'] == 'unary':
operator = node['operator']['value']
cvalue, ctype = self.visit(node['content'])
if operator == '-':
return - cvalue, ctype
if node['type'] == 'binary':
lvalue, ltype = self.visit(node['left'])
rvalue, rtype = self.visit(node['right'])
operator = node['operator']['value']
if operator == '+':
return lvalue + rvalue, rtype
elif operator == '-':
return lvalue - rvalue, rtype
elif operator == '*':
return lvalue * rvalue, rtype
elif operator == '/':
return lvalue // rvalue, rtype
Now the unary plus is easy to sort out, as we just need to take it into account in parse_factor
along with the unary minus.
def parse_factor(self):
next_token = self.lexer.peek_token()
if next_token.type == clex.LITERAL and next_token.value in ['-', '+']:
operator = self.parse_addsymbol()
factor = self.parse_factor()
return UnaryNode(operator, factor)
if next_token.type == clex.LITERAL and next_token.value == '(':
self.lexer.get_token()
expression = self.parse_expression()
self.lexer.get_token()
return expression
return self.parse_integer()
And the visitor is missing a single return after the if
statement that deals with the unary minus.
if node['type'] == 'unary':
operator = node['operator']['value']
cvalue, ctype = self.visit(node['content'])
if operator == '-':
return - cvalue, ctype
return cvalue, ctype
Final words¶
That's all for this post. If you feel brave or do not like to wait for the next post go and try adding new operators! Next time I will cover variables, assignments and postfix-operators like the power operation (2^3
).
The code I developed in this post is available on the GitHub repository tagged with part2
(link).
Updates¶
2017-12-24: test_parse_term_with_multiple_operations
has been changed after Victor Uriarte spotted an error in the tree construction. See the updates section of the first post in the series for a full explanation of the issue.
Feedback¶
Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.