Compiler Development in Python – Lexical Analysis Part 2

In the last installment in the series, we talked about what a scanner is and how finite state machines representing a scanner’s rules can be converted into code very easily.  In this article, we’re going to look at how to construct the scanner with the view of keeping it easy to extend for the future.

Before we do that, let’s consider what our scanner needs to do –

  • Track the line number and position
  • Provide a way to increment the cursor
  • Watch for the end of file
  • Return a list of tokens

With that in mind, let’s start by creating some attributes to hold the information we will need and take a look at the function that will look after keeping track of each of those values.

class Lexer(object):

    eof_marker = '$'
    whitespace = ' \t\n'
    newline = '\n'
    comment_marker = '#'

    def __init__(self, code):
        super(Lexer, self).__init__()

        self.code = code
        self.cursor = 0
        self.tokens = []

        self.lines = code.split(Lexer.newline)
        self.line_no = 0
        self.line_pos = 0

    def get_next_char(self):
        self.cursor += 1
        self.line_pos += 1
        if self.cursor > len(self.code):
            return Lexer.eof_marker

        return self.code[self.cursor - 1]

There should be nothing surprising about the initialiser – it takes a string representing the source code and sets up all the instance variables.

The get_next_char  function is more interesting, though still very simple.   The first three items on our list are largely handled by this small function.  Each time the function gets called it performs the following actions –

  • Increment the cursor
  • Increment the line position
  • Check if we have reached the end of the code and return an a specific end of file marker
  • Return the current character – remember we incremented the cursor first, so we return the character to the left of the cursor

We still need to deal with tracking line number and resetting the line position when we hit a newline.  This will be dealt with in the main scanner function when we handle whitespace.

Now, at the end of part 1, I asked you to expand the state diagram we discussed for identifiers to include the other token types.  Let’s look at what I came up with –

Ok so that’s definitely more complex than the first state diagram we looked at, but when you break it down, it’s telling you a very similar thing.  You can see the ident state diagram from part 1 in there (I highlighted it in green), now we’ve added transititions for number tokens and operator tokens.  While number tokens are very similar to ident tokens, operators are a little different.  You can see that they don’t have a transition back to themselves – this simply means that all operator tokens are going to be a single character long.  Additionally, the transition condition is a little different – the | (pipe) character means or, so the FSM transititions to the operator state if the input is + or – or / or * or =.  Easy!

One thing in this diagram that you might have missed when you went through this exercise was the transition back to the start state so that the scanner can accept multiple tokens – you can see the transition condition is that the the scanner hasn’t hit the end of the file yet.

Before we start exploring the code for the main scanner function, let’s talk about whitespace.  Everyone knows that whitespace is important in coding – it makes code easier to read!  That said there are a lot of different ways you can handle whitepace in your compiler –

  • Ignore it completely – like FORTRAN where GO TO and GOTO are exactly the same thing
  • Insignificant whitespace – whitespace will end a token but is otherwise ignored
  • Significant whitepace – whitespace has a specific role in the language and will need to be tokenised, for example Python indentation to signify code blocks
  • Other more esoteric schemes – for example Whitespace – a programming language where whitespace is the keywords

Our compiler project will be working on the insignificant whitespace principle – so we will be discarding whitespace.

Let’s check out some code –

def tokenise(self):
        char = self.get_next_char()
        while char != Lexer.eof_marker:

            # ignore whitespace
            if char in Lexer.whitespace:
                if char in Lexer.newline:
                    self.line_no += 1
                    self.line_pos = 0
                char = self.get_next_char()

This is the start of the main scanner code.  First, we get the current character – this is our start state.  Then we loop until we hit the end of the file, representing the transition from the accept state back to the start state.

The remainer of the code in this block deals with whitespace.  In the case of a newline, we increment the line counter and reset the line position counter to 0, for all cases we discard the character and load up the next one.  This could have been represented this on the state diagram if we had wanted to – we would have a transition to a whitespace state that would then transition to a reject state – however, in the spirit of trying to not overcomplicate the theory, we’ll just deal with it in the code and leave the state diagram to show tokens only.

We are going to add one more thing to the scanner before we attack actual tokens, and that’s comments.  Commenting is important for understanding code and notating copyright notices/authorship etc.  In a manner similar to whitespace, we’re going to ignore comments.  For this compiler, we’ll only be implementing single line comments, so we’re going to ignore everything from the comment marker up to the end of the line.

            # comment
            elif char in Lexer.comment_marker:
                while char not in Lexer.newline:
                    char = self.get_next_char()

You’ll recall the code below from the previous installment as the code that we derived straight from the state diagram for identifiers.  This code simply consumed the input – it was you homework to add to it so that we could create a Token instance and store the token for later use.

if char in string.ascii_letters:
    char = next_char()
    while(char in string.ascii_letters):
        char = next_char()

Here is the actual code from the scanner with the lines from above highlighted.  You can see that translating from the state diagram to an actual implementation is very straight forward.  Most of the lines are exactly the same as above, we’re just storing thr result this time.

            # identifier token
            elif char in string.ascii_letters:
                match = char
                char = self.get_next_char()
                while char in string.ascii_letters:
                    match += char
                    char = self.get_next_char()

                token = Token(Token.ident, match, self.lines[self.line_no], self.line_no, self.line_pos)
                self.tokens.append(token)

I’m sure you realise that matching numbers will be almost exactly the same, and operators are even easier since they are single characters –

            # number token
            elif char in string.digits:
                match = char
                char = self.get_next_char()
                token_type = Token.ident
                while char in string.digits:
                    match += char
                    char = self.get_next_char()

                token = Token(Token.number, match, self.lines[self.line_no], self.line_no, self.line_pos)
                self.tokens.append(token)

            # operators
            elif char in '+-*/=':
                token = Token(Token.operator, char, self.lines[self.line_no], self.line_no, self.line_pos)
                self.tokens.append(token)
                char = self.get_next_char()

The last thing we need to do is create an end of file token and return the list of tokens.  This of course happens once the while loop exits having received an end of file character.

        # end of file token
        token = Token(Token.eof, char, None, self.line_no, self.line_pos)
        self.tokens.append(token)

        return self.tokens

That’s all there is to it.  Let’s create test out our newly created scanner and look at the output.

lexer = Lexer('a = b * 2')
for token in lexer.tokenise():
    print token

The output is similar to the output from the Python scanner that we looked at in the first installment on lexical analysis.  You can clearly see how the input source is being tokenised into discrete components so the parse knows what it’s dealing with.

C:\Python27\python.exe D:/Projects/compiler/lexer.py
1:1       IDENT          a
1:2       OP             =
1:5       IDENT          b
1:6       OP             *
1:9       NUMBER         2
1:9       EOF            $

Process finished with exit code 0

This completes the lexical scanning part of the series, however there are a couple of changes and additions that I have left as an exercise for the reader to consolidate your knowledge.

The additional token types that your scanner will need to be able to detect and return –

  • Keywords – if, else, while and output
  • Start and end block – { }

These two new token types the compiler to work with more than math expressions and give us the basis of a fully functional language.   You will need to create a new token type and code to detect the token.  Think carefully about how you implement keywords.

There are two final changes that should be made to complete the scanner in preparation for moving onto parsing.  They are –

  • Consider that currently identifiers can consist only of letters.  Most programming languages require only that the first character of an identifier is a letter, then  allow numbers as well.  Modify the code detecting identifiers to allow this form.
  • The scanner should raise an error should it come across an invalid character.

As a test, provide the following input source code to your scanner and check it against the output provided.

a = 0
# comment here
while a < 10 {
    output a
    a = a + 1
}

The tokens your scanner should output –

C:\Python27\python.exe D:/Projects/compiler/lexer2.py
1:1       IDENT          a
1:2       OP             =
1:5       NUMBER         0
3:18      KEYWORD        while
3:24      IDENT          a
3:25      OP             <
3:28      NUMBER         10
3:30      START          {
4:21      KEYWORD        output
4:28      IDENT          a
5:21      IDENT          a
5:22      OP             =
5:25      IDENT          a
5:26      OP             +
5:29      NUMBER         1
6:17      END            }
6:18      END-OF-FILE    $

Process finished with exit code 0

Below is the full code for the scanner we’ve developed over the last two installments of the series, as well as the complete scanner implementing the extra features listed above.  Please consider trying to work through these problems yourself before looking at my solution – they are quite simple if you’re a competent coder and follow the same principles we discussed throughout.

The additional code for supporting the homework assignment is highlighted in the solution.

import string


class Token(object):
    eof = 'EOF'
    ident = 'IDENT'
    number = 'NUMBER'
    operator = 'OP'

    def __init__(self, type, value, line, line_no, line_pos):
        self.type = type
        self.value = value
        self.line = line
        self.line_pos = line_pos - len(value)
        self.line_no = line_no

    def __str__(self):
        return '{0}:{1}'.format(self.line_no + 1, self.line_pos).ljust(10) + self.type.ljust(15) + self.value


class Lexer(object):

    eof_marker = '$'
    whitespace = ' \t\n'
    newline = '\n'
    comment_marker = '#'

    def __init__(self, code):
        super(Lexer, self).__init__()

        self.code = code
        self.cursor = 0
        self.tokens = []

        self.lines = code.split(Lexer.newline)
        self.line_no = 0
        self.line_pos = 0

    def get_next_char(self):
        self.cursor += 1
        self.line_pos += 1
        if self.cursor > len(self.code):
            return Lexer.eof_marker

        return self.code[self.cursor - 1]

    def tokenise(self):
        char = self.get_next_char()
        while char != Lexer.eof_marker:

            # ignore whitespace
            if char in Lexer.whitespace:
                if char in Lexer.newline:
                    self.line_no += 1
                    self.line_pos = 0
                char = self.get_next_char()

            # comment
            elif char in Lexer.comment_marker:
                while char not in Lexer.newline:
                    char = self.get_next_char()

            # identifier token
            elif char in string.ascii_letters:
                match = char
                char = self.get_next_char()
                while char in string.ascii_letters:
                    match += char
                    char = self.get_next_char()

                token = Token(Token.ident, match, self.lines[self.line_no], self.line_no, self.line_pos)
                self.tokens.append(token)

            # number token
            elif char in string.digits:
                match = char
                char = self.get_next_char()
                token_type = Token.ident
                while char in string.digits:
                    match += char
                    char = self.get_next_char()

                token = Token(Token.number, match, self.lines[self.line_no], self.line_no, self.line_pos)
                self.tokens.append(token)

            # operators
            elif char in '+-*/=':
                token = Token(Token.operator, char, self.lines[self.line_no], self.line_no, self.line_pos)
                self.tokens.append(token)
                char = self.get_next_char()

        # end of file token
        token = Token(Token.eof, char, None, self.line_no, self.line_pos)
        self.tokens.append(token)

        return self.tokens
import string


class Token(object):
    eof = 'END-OF-FILE'
    ident = 'IDENT'
    number = 'NUMBER'
    operator = 'OP'
    keyword = 'KEYWORD'
    block_start = 'START'
    block_end = 'END'

    keywords = ['if', 'else', 'while', 'output']

    def __init__(self, type, value, line, line_no, line_pos):
        self.type = type
        self.value = value
        self.line = line
        self.line_pos = line_pos - len(value)
        self.line_no = line_no

    def __str__(self):
        return '{0}:{1}'.format(self.line_no + 1, self.line_pos).ljust(10) + self.type.ljust(15) + self.value


class Lexer(object):

    eof_marker = '$'
    whitespace = ' \t\n'
    newline = '\n'
    comment_marker = '#'

    def __init__(self, code):
        super(Lexer, self).__init__()

        self.code = code
        self.cursor = 0
        self.tokens = []

        self.lines = code.split(Lexer.newline)
        self.line_no = 0
        self.line_pos = 0

    def get_next_char(self):
        self.cursor += 1
        self.line_pos += 1
        if self.cursor > len(self.code):
            return Lexer.eof_marker

        return self.code[self.cursor - 1]

    def tokenise(self):
        char = self.get_next_char()
        while char != Lexer.eof_marker:

            # ignore whitespace
            if char in Lexer.whitespace:
                if char in Lexer.newline:
                    self.line_no += 1
                    self.line_pos = 0
                char = self.get_next_char()

            # comment
            elif char in Lexer.comment_marker:
                while char not in Lexer.newline:
                    char = self.get_next_char()

            # identifier token
            elif char in string.ascii_letters:
                match = char
                char = self.get_next_char()
                while char in (string.ascii_letters + string.digits):
                    match += char
                    char = self.get_next_char()
                token = Token(Token.ident, match, self.lines[self.line_no], self.line_no, self.line_pos)

                if match in Token.keywords:
                    token.type = Token.keyword

                self.tokens.append(token)

            # number token
            elif char in string.digits:
                match = char
                char = self.get_next_char()
                while char in string.digits:
                    match += char
                    char = self.get_next_char()

                token = Token(Token.number, match, self.lines[self.line_no], self.line_no, self.line_pos)
                self.tokens.append(token)

            # operators
            elif char in '+-*/=<>':
                token = Token(Token.operator, char, self.lines[self.line_no], self.line_no, self.line_pos)
                self.tokens.append(token)
                char = self.get_next_char()

            # start block
            elif char == '{':
                token = Token(Token.block_start, char, self.lines[self.line_no], self.line_no, self.line_pos)
                self.tokens.append(token)
                char = self.get_next_char()

            # end block
            elif char == '}':
                token = Token(Token.block_end, char, self.lines[self.line_no], self.line_no, self.line_pos)
                self.tokens.append(token)
                char = self.get_next_char()

            else:
                raise ValueError('Unexpected character found: {0}:{1} -> {2}\n{3}'.format(self.line_no + 1, self.line_pos + 1, char, self.lines[self.line_no]))

        # end of file token
        token = Token(Token.eof, char, None, self.line_no, self.line_pos)
        self.tokens.append(token)

        return self.tokens
Compiler Development in Python – Lexical Analysis Part 2

Leave a Reply