A game of tokens: write an interpreter in Python with TDD - Part 3
By Leonardo Giordani - Updated on
Introduction¶
This is the third instalment of a series of posts on how to write an interpreter in Python. In the first part we developed together a small command line calculator that could sum and subtract numbers, while in the second part we went further adding multiplication, division and unary plus and minus.
In this third part we wil start adding variables to our calculator, moving towards a proper programming language.
Mezzanine - Refactoring¶
Often, after wroting some code, you realise that some of the original choices you did are not perfect, especially when it comes to variable and function names. Furthermore, you can realise that some of your functions are too long, and you may consider splitting them in mutiple functions to make the code easier to understand and to use.
It is time, then, to reconsider the code of smallcalc and see if we can improve it. Luckily, having all our tests in place, we may refactor it, that is we can change the code with a high degree of confidence, as the tests check that the behaviour of the whole system doesn't change.
The first change is the naming of the method parse_addsymbol
, which now can be more aptly named _parse_symbol
. As Martin Fowler says in his book "Refactoring" (a recommended reading): "Life being what it is, you won't get your names right the first time. [...] Remember your code is for a human first and a computer second. Humans need good names." The name of the method will be prefixed with an underscore because this method is used only internally, and shouldn't be used by third parties.
The proper way to change the name of a method involves calling the new method from the old one, but in this case we may safely rely on tests to tell us what needs to be fixed (this is because our codebase is small). We may thus open the file smallcalc/calc_parser.py
and change the name to _parse_symbol
. At this point, running the test suite, you should have 11 failures. You can fix them with a text replace action of your editor of choice, but I recommend you to make the tests fail before replacing the text. The 3 replacements are in the methods parse_factor
, parse_term
, and parse_expression
.
I then wanted to add two methods, discard
and discard_type
, to the lexer, to better control what gets discarded. At the moment the code is using self.lexer.get_token
which doesn't allow to explicitly check what we are dropping. These are the tests that I added to tests/test_calc_lexer.py
import pytest
def test_discard_tokens():
l = clex.CalcLexer()
l.load('3 + 5')
l.discard(token.Token(clex.INTEGER, '3'))
assert l.get_token() == token.Token(clex.LITERAL, '+')
def test_discard_checks_equality():
l = clex.CalcLexer()
l.load('3 + 5')
with pytest.raises(clex.TokenError):
l.discard(token.Token(clex.INTEGER, '5'))
def test_discard_tokens_by_type():
l = clex.CalcLexer()
l.load('3 + 5')
l.discard_type(clex.INTEGER)
assert l.get_token() == token.Token(clex.LITERAL, '+')
def test_discard_type_checks_equality():
l = clex.CalcLexer()
l.load('3 + 5')
with pytest.raises(clex.TokenError):
l.discard_type(clex.LITERAL)
As you can see the idea is for both methods to require a parameter, either the token or the type. The code that passes these tests is made by a custom exception in smallcalc/calc_lexer.py
class TokenError(ValueError):
""" The expected token cannot be found """
and, in the CalcLexer
class in the same file
def discard(self, token):
if self.get_token() != token:
raise TokenError(
'Expected token {}, found {}'.format(
token, self._current_token
))
def discard_type(self, _type):
t = self.get_token()
if t.type != _type:
raise TokenError(
'Expected token of type {}, found {}'.format(
_type, self._current_token.type
))
As I am satisfied with the code that I have now, I will move on to add new features.
Level 13 - Variables¶
I have been assigned by my strength and cunning. - Up (2009)
Variables are labels assigned to values, so what we need to add is a way for the user to make this assignment and then to use variables intead of actual values. The simplest syntax, used by many languages is name = value
and we will stick to this. Usually languages allow only a subset of symbols in the name of a variable so we will learn how to use lower- and uppercase names that may also contain an underscore.
Lexer¶
We want the lexer to recognise a new token called NAME
, so the test we have to add to tests/test_calc_lexer.py
is
def test_get_tokens_understands_letters():
l = clex.CalcLexer()
l.load('somevar')
assert l.get_tokens() == [
token.Token(clex.NAME, 'somevar'),
token.Token(clex.EOL),
token.Token(clex.EOF)
]
This test checks only the support for lowercase letters. Since we want to support also uppercase letters and underscores we need another pair of test
def test_get_tokens_understands_uppercase_letters():
l = clex.CalcLexer()
l.load('SomeVar')
assert l.get_tokens() == [
token.Token(clex.NAME, 'SomeVar'),
token.Token(clex.EOL),
token.Token(clex.EOF)
]
def test_get_tokens_understands_names_with_underscores():
l = clex.CalcLexer()
l.load('some_var')
assert l.get_tokens() == [
token.Token(clex.NAME, 'some_var'),
token.Token(clex.EOL),
token.Token(clex.EOF)
]
If you wonder why I created two tests instead of just one with both uppercase letters and underscores, the reason is that I generally prefer to have tests that focus on one specific feature. This obviously depends on the level of granularity that you want, and in this case we are discussing very simple features, so I would not argue if I saw both tested at the same time.
Parser¶
To support variables in expressions we need to change the behaviour of parse_factor
, which is the method where we parse the building blocks like integers of unary operators. The test you need to add to tests/test_calc_parser.py
is
def test_parse_factor_variable():
p = cpar.CalcParser()
p.lexer.load("somevar")
node = p.parse_factor()
assert node.asdict() == {
'type': 'variable',
'value': 'somevar'
}
After this we want to provide support for variable assignments. Working on the parser we need only to output the correct node so the test is pretty straightforward
def test_parse_assignment():
p = cpar.CalcParser()
p.lexer.load("x = 5")
node = p.parse_assignment()
assert node.asdict() == {
'type': 'assignment',
'variable': 'x',
'value': {
'type': 'integer',
'value': 5
}
}
This test tries to assign the value 5
to the variable x
, but in general we want to support assignment with expressions, so we should test this behaviour as well, including the presence of variables
def test_parse_assignment_with_expression():
p = cpar.CalcParser()
p.lexer.load("x = 4 * (3 + 5)")
node = p.parse_assignment()
assert node.asdict() == {
'type': 'assignment',
'variable': 'x',
'value': {
'type': 'binary',
'operator': {
'type': 'literal',
'value': '*'
},
'left': {
'type': 'integer',
'value': 4
},
'right': {
'type': 'binary',
'operator': {
'type': 'literal',
'value': '+'
},
'left': {
'type': 'integer',
'value': 3
},
'right': {
'type': 'integer',
'value': 5
}
}
}
}
def test_parse_assignment_expression_with_variables():
p = cpar.CalcParser()
p.lexer.load("x = y + 4")
node = p.parse_assignment()
assert node.asdict() == {
"type": "assignment",
"variable": "x",
'value': {
'type': 'binary',
'operator': {
'type': 'literal',
'value': '+'
},
'left': {
'type': 'variable',
'value': 'y'
},
'right': {
'type': 'integer',
'value': 4
},
}
}
Visitor¶
It is now time to implement the code that actually stores and retrieves variables, which is what happens in the visitor when an assignment
or a variable
node are processed. For the moment we do not have specific requirements for variables and we can treat the storage space as a big global dictionary.
The test we want to pass specifies the initial API of the storage space when we assign a value to a variable
def test_visitor_assignment():
ast = {
'type': 'assignment',
'variable': 'x',
'value': {
'type': 'integer',
'value': 5
}
}
v = cvis.CalcVisitor()
assert v.visit(ast) == (None, None)
assert v.isvariable('x') is True
assert v.valueof('x') == 5
assert v.typeof('x') == 'integer'
We want the visitor to provide three new methods, isvariable
, valueof
, and typeof
, that allow us to interact with the variables we defined.
The last change that the visitor requires is some code that allows it to read the value of variables to be able to use them when computing the result of an expression. The test is then
def test_visitor_variable():
assignment_ast = {
'type': 'assignment',
'variable': 'x',
'value': {
'type': 'integer',
'value': 123
}
}
read_ast = {
'type': 'variable',
'value': 'x'
}
v = cvis.CalcVisitor()
v.visit(assignment_ast)
assert v.visit(read_ast) == (123, 'integer')
where two different ASTs have been created. The first one assigns a value to the variable, the second one reads it and returns its value. Note that the visitor returns both value and type of the variable, which seems reasonable to implement later checks of equality or other operations on variables.
Solution¶
To pass the test_get_tokens_understands_letters
test I added a method _process_name
to the CalcLexer
class
def _process_name(self):
regexp = re.compile('[a-z]+')
match = regexp.match(
self._text_storage.tail
)
if not match:
return None
token_string = match.group()
return self._set_current_token_and_skip(
token.Token(NAME, token_string)
)
and then added it to the method get_token
. The new version of the latter is then
def get_token(self):
eof = self._process_eof()
if eof:
return eof
eol = self._process_eol()
if eol:
return eol
self._process_whitespace()
name = self._process_name()
if name:
return name
integer = self._process_integer()
if integer:
return integer
literal = self._process_literal()
if literal:
return literal
At this point to pass the remaining tests test_get_tokens_understands_uppercase_letters
and test_get_tokens_understands_names_with_underscores
it is sufficient to change the regular expression we use in _process_name
def _process_name(self):
regexp = re.compile('[a-zA-Z_]+')
match = regexp.match(
self._text_storage.tail
)
if not match:
return None
token_string = match.group()
return self._set_current_token_and_skip(
token.Token(NAME, token_string)
)
The required change to parse_factor
is simple, but since we will be returning a new type of node we have to define it
class VariableNode(ValueNode):
node_type = 'variable'
We can then add the required if
statement in parse_factor
, which becomes
def parse_factor(self):
next_token = self.lexer.peek_token()
if next_token.type == clex.LITERAL and next_token.value in ['-', '+']:
operator = self._parse_symbol()
factor = self.parse_factor()
return UnaryNode(operator, factor)
if next_token.type == clex.LITERAL and next_token.value == '(':
self.lexer.discard_type(clex.LITERAL)
expression = self.parse_expression()
self.lexer.discard_type(clex.LITERAL)
return expression
if next_token.type == clex.NAME:
t = self.lexer.get_token()
return VariableNode(t.value)
return self.parse_integer()
The second test that we have to pass checks if the parser can understand variable assignments. First of all we need to define AssignmentNode
which is the node we will return to the visitor.
class AssignmentNode(Node):
node_type = 'assignment'
def __init__(self, variable, value):
self.variable = variable
self.value = value
def asdict(self):
return {
'type': self.node_type,
'variable': self.variable.value,
'value': self.value.asdict(),
}
At this point we need a method to parse a variable in CalcParser
. This method is very similar to _parse_symbol
and parse_integer
def _parse_variable(self):
t = self.lexer.get_token()
return VariableNode(t.value)
Since the test is running parse_assignment
we just need to add that method. We want the assignment to have a variable as its left member and an expression as its right member
def parse_assignment(self):
variable = self._parse_variable()
self.lexer.discard_type(clex.LITERAL)
value = self.parse_expression()
return AssignmentNode(variable, value)
This code makes both the test_parse_assignment
and the test_parse_assignment_with_expression
tests pass.
As discussed in the introductory text before the test code the variable storage space can be a simple dictionary. The key will be the name of the variable, and the content will be another dictionary with value
and type
. This is sufficient for the moment and should be also extensible when future requirements will arise.
The CalcVisitor
class can be then changed to get the new methods, and a __init__
that initializes the dictionary. I also added the relevant if
statement to the method visit
of the same class. The new CalcVisitor
is then
class CalcVisitor:
def __init__(self):
self.variables = {}
def isvariable(self, name):
return name in self.variables
def valueof(self, name):
return self.variables[name]['value']
def typeof(self, name):
return self.variables[name]['type']
def visit(self, node):
if node['type'] == 'integer':
return node['value'], node['type']
if node['type'] == 'unary':
operator = node['operator']['value']
cvalue, ctype = self.visit(node['content'])
if operator == '-':
return - cvalue, ctype
return cvalue, ctype
if node['type'] == 'binary':
lvalue, ltype = self.visit(node['left'])
rvalue, rtype = self.visit(node['right'])
operator = node['operator']['value']
if operator == '+':
return lvalue + rvalue, rtype
elif operator == '-':
return lvalue - rvalue, rtype
elif operator == '*':
return lvalue * rvalue, rtype
elif operator == '/':
return lvalue // rvalue, rtype
if node['type'] == 'assignment':
right_value, right_type = self.visit(node['value'])
self.variables[node['variable']] = {
'value': right_value,
'type': right_type
}
return None, None
To pass the second test we need only to change the method visit
adding an if
statement for the variable
nodes. The new version of the method is
def visit(self, node):
if node['type'] == 'integer':
return node['value'], node['type']
if node['type'] == 'variable':
return self.valueof(node['value']), self.typeof(node['value'])
if node['type'] == 'unary':
operator = node['operator']['value']
cvalue, ctype = self.visit(node['content'])
if operator == '-':
return - cvalue, ctype
return cvalue, ctype
if node['type'] == 'binary':
lvalue, ltype = self.visit(node['left'])
rvalue, rtype = self.visit(node['right'])
operator = node['operator']['value']
if operator == '+':
return lvalue + rvalue, rtype
elif operator == '-':
return lvalue - rvalue, rtype
elif operator == '*':
return lvalue * rvalue, rtype
elif operator == '/':
return lvalue // rvalue, rtype
if node['type'] == 'assignment':
right_value, right_type = self.visit(node['value'])
self.variables[node['variable']] = {
'value': right_value,
'type': right_type
}
return None, None
Level 14 - Parsing expressions and assignments¶
Speak words we can all understand! - The Lord of the Rings: The Fellowship of the Ring (2001)
We are missing a final step. The CLI uses parse_expression
as its default entry point, which means that it doesn't understand variable assignments for the time being. We need then to introduce a new entry point parse_line
that we will use to process general language statements. The test for this goes in tests/test_calc_parser.py
def test_parse_line_supports_expression():
p = cpar.CalcParser()
p.lexer.load("2 * x + 4")
node = p.parse_line()
assert node.asdict() == {
'type': 'binary',
'left': {
'type': 'binary',
'left': {
'type': 'integer',
'value': 2
},
'right': {
'type': 'variable',
'value': 'x'
},
'operator': {
'type': 'literal',
'value': '*'
}
},
'right': {
'type': 'integer',
'value': 4
},
'operator': {
'type': 'literal',
'value': '+'
}
}
and checks that parse_line
can parse expressions (which can be solved just wrapping parse_expression
with it). The second test checks that parse_line
can parse variable assignments and goes in the same file
def test_parse_line_supports_assigment():
p = cpar.CalcParser()
p.lexer.load("x = 5")
node = p.parse_line()
assert node.asdict() == {
'type': 'assignment',
'variable': 'x',
'value': {
'type': 'integer',
'value': 5
}
}
At this point we can change the entry point in the CLI, using parse_line
instead of parse_expression
. The new CLI is then
from smallcalc import calc_parser as cpar
from smallcalc import calc_visitor as cvis
def main():
p = cpar.CalcParser()
v = cvis.CalcVisitor()
while True:
try:
text = input('smallcalc :> ')
p.lexer.load(text)
node = p.parse_line()
res = v.visit(node.asdict())
print(res)
except EOFError:
print("Bye!")
break
if not text:
continue
if __name__ == '__main__':
main()
Try to fire the CLI and enjoy a calculator with variables! Everything works, but you now know it is not magic, but the outcome of a good amount of code. And you wrote it, so you may proudly say that you created a simple but working programming language.
Solution¶
To pass the first test, as suggested, I added the method parse_line
as a wrapper around parse_expression
def parse_line(self):
return self.parse_expression()
The second test requires some changes to parse_line
. As I do not know if the next token is an expression or an assignment I decided to stash the status and try one of the two. In case of error I just pop the state and try with the second option
def parse_line(self):
try:
self.lexer.stash()
return self.parse_assignment()
except clex.TokenError:
self.lexer.pop()
return self.parse_expression()
At the same time parse_assignment
has to be changed. The current code parses a variable and then discards a literal, which is too generic, as an expression like x * 2
will not raise an error. The new code for that method is then
def parse_assignment(self):
variable = self._parse_variable()
self.lexer.discard(token.Token(clex.LITERAL, '='))
value = self.parse_expression()
return AssignmentNode(variable, value)
where I explicitly discard a literal =
sign.
Final words¶
Managing variables may look like a very easy task, but as soon as we will start implementing functions and local scopes we will have to move to something richer than a simple global dictionary. Memory management is another big topic that I didn't touch here, perhaps in the future I might discuss garbage collections and related problems.
The code I developed in this post is available on the GitHub repository tagged with part3
(link).
In the next issue I will face with you the task of adding the power operator, support for floating point numbers, and a big refactoring with context managers that will greatly simplify the code.
Feedback¶
Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.
Part 3 of the A game of tokens series
Related Posts
TDD in Python with pytest - Part 5
Updated on
TDD in Python with pytest - Part 2
Updated on
TDD in Python with pytest - Part 1
Updated on