Compiler Development in Python – Lexical Analysis Part 1

Today we’re going to talk about lexical analysis.  Lexical analysis is the process of turning a stream of input characters into a stream of keywords, numbers, identifiers and potentially other types of token.  The code used to perform this step has a number of names – scanner, tokeniser, lexer, etc – we’ll stick to scanner.

If we have a look at the output from the Python scanner, you’ll see that it is breaking the source code up into it’s simplest components –

import tokenize 
file = open("tokenize-example-1.py")
1,0-1,6:     NAME    'import'
1,7-1,15:    NAME    'tokenize'
1,15-1,16:   NEWLINE '\012'
2,0-2,1:     NL      '\012'
3,0-3,4:     NAME    'file'
3,5-3,6:     OP      '='
3,7-3,11:    NAME    'open'
3,11-3,12:   OP      '('
3,12-3,35:   STRING  '"tokenize-example-1.py"'
3,35-3,36:   OP      ')'
3,36-3,37:   NEWLINE '\012'

You can see that each token has some information –

  • The line and position that the token was found
  • The type of token
  • The value of the token

We will be storing one other thing – the actual line of code that the token was found in.  This will make it easier to generate error messages later on when we are parsing the code.  With that in mind, let’s create a Token class for our scanner.

class Token(object):
    eof = 'END-OF-FILE'
    ident = 'IDENT'
    number = 'NUMBER'
    operator = 'OP'

    def __init__(self, type, value, line, line_no, line_pos):
        self.type = type
        self.value = value
        self.line = line
        self.line_pos = line_pos - len(value)
        self.line_no = line_no

    def __str__(self):
        return '{0}:{1}'.format(self.line_no + 1, self.line_pos).ljust(10) + self.type.ljust(15) + self.value

There really isn’t much to this class beyond it storing each of the token attributes we discussed above.  There are two extra things we haven’t yet looked at – the __str__ special method and the token types.

The __str__ special method simply returns a string representation of the token.  This is useful during testing to see what tokens the scanner is spitting out and doesn’t really have a use in the compiler itself.

To keep the scanner simple for now (and to leave room for you to extend it yourself as an exercise), we are scanning for just three types of tokens – identifiers, numbers and operators.  The eof token type is, unsurprisingly, used to designate the end of the file.  Each token type has a specific format –

  • identifiers – these are things like variable and function names – one or more characters ( [a-zA-Z]+)
  • numbers – one or more digits ([0-9] +)
  • operators – one of +, -, /, *, =

Remember from the last installment, that our initial compiler is going to operate solely with integers, so we don’t need to do anything more complex with numbers.  If you don’t know what the bracketed expression at the end of the first two token types, they are regular expressions and are a way of defining the format of a string.  It’s not required knowledge, however regular expressions are a type of parser in themselves and are worth looking into for further study.

Before we look at the code for the lexical scanner, I’d like to go over a small amount of theory which, I believe, makes it much more obvious how to construct the scanner.

Lexical scanners are a form of logic called Finite State Machines (FSM).  From Wikipedia –

It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition.

That’s a bit jargon filled, so let’s break it down a bit by looking at and analysing the image below.

Diagram2

Each circle represents a state.  Don’t over think here – states are literally just the current state of the (imaginary) machine – just as a flipped coin has two states – heads and tails (it’s not landing on an edge!) – this finite machine as three states.  FSMs generally have two special states – the start state which is exactly what you expect, and the accept state which tells us we have accepted the input to the machine.

Each arrow represents a transition – that is a way to go from one state to another.  In the state diagram above, I’ve labeled the transitions with the condition that is require to transition from one state to the next.

In this example, we need either a lower or uppercase alphabet character to transition from the start state to the ident state.  This translates directly to code as an if statement checking if the current character in the stream matched the transition.  You’ll notice that the ident state can transition to itself, meaning it can repeat.  You’ll notice that the transition condition is the same as the start state to ident condition.  This maps nicely to a while statement in the scanner code.  When this condition can no longer be met, the only transition the state machine can make is to the accept state where we create a token instance, store it and return to the start state to continue scanning (obviously, there will be more than one transition away from the start state for multiple token types)

Below is some example code that would implement the state diagram we just discussed.  It does nothing with the characters that are come across other than consume them. next_char() can be assumed to return the next character in the stream.

if char in string.ascii_letters:
    char = next_char()
    while(char in string.ascii_letters):
        char = next_char()

You can see how directly the state diagram maps to code as we discussed previously.

You might wonder why I chose to explain this bit of theory (in what is supposed to be a theory-lite version of compiler development).  There’s a few reasons –

  1. If you go on to more formal learning or read any of the myriad of compiler construction textbooks available, you will always come across FSMs and state diagrams in their discussions of scanners.
  2. It’s a nice example of a case of the theory directing the implementation.
  3. State diagrams are a nice alternative way of representing token format other than regular expressions.

We’ll leave it there for now.  Your job between now and the next installment is to consider how to extend the toy code we created from the state diagram to store the text we come across and create a Token instance to be saved for use in the parser.  Also have a think about (and draw if you can) what the number and operator state diagrams might look like then combine the state diagrams to create an overall FSM for the three token classes (identifier, number, operator) we’re going to be initially using.

Compiler Development in Python – Lexical Analysis Part 1

Leave a Reply