Compiler Development in Python – Starting Out

Modern compilers can be very complex pieces of software.  I’d lay bets on the GNU Compiler project, for example, being in the top 10 most complex pieces of software ever developed .  That said, there are lots of tools and shortcuts that can significantly decrease the amount of work invovled in writing a compiler.   While we’ll avoid the tools (at least in the shortterm), we’re going to use some trickery towards the end when we go from an interpreted language to a fully compiled language.  Arguably, the back end of the compiler is the least interesting part anyway and realistically, is probably the least applicable knowledge to have.

That’s a roundabout way to say that we’re not going to be generating x86 machine code ourselves, partly because I think it’s somewhat pointless and partly because it’s a lot of effort for not much return. So, what we are going to do is create a compiler that compiles from our source language to C.  This has some nice benefits –

  1. We get macros, so the transition from the stack-based virtual machine to compiled code can be done in an unexpectedly easy way.
  2. We get the GNU optimiser for free (I don’t plan on covering implementation of optimisations, only what optimisations a compiler performs and the benefits)
  3. We don’t have to worry about a tonne of low level theory and  implementation detail that I am desperately trying to avoid.
  4. It’s quick and easy!

Now that we got that out of the way, let’s look at what is happening behind the scenes when we run a compiler.

compiler-1

There are four primary stages to compiling a program, with a number of ancilliary and substages that we’re going to ignore for now.

  • Lexical analysis/scanning takes the stream of input code and breaks it up into logical, bite-sized pieces called tokens.  Each token get’s a type, which might be number or identifier or keyword, the value of the token and a collection of other information which is useful later, such as filename, line number and line position.  There is some basic error checking which hapens in this stage – primarily, ensuring that there are no extraneous characters that the rest of the compiler can’t handle, and that strings are all terminated.
  • Syntax analysis or parsing takes the stream of tokens from the scanner and ensures that they conform to the structure of the language.  I think it’s useful to think of a lexical scanner as a word checker and a parser as a sentence checker.  Typically a parser will generate a parse tree of some kind as this is the easiest way to apply optimisations, however, it is quite possible to generate code straight out of the parser. The image below represents the parse tree of the expression  6 + (a * a + a * 0) * 8
  • Code generation is the process of taking the parse tree and outputting code in the target language.  In modern compilers, this is often an intermediate form of code as this allows the separation of the front and backends of the compiler, making it retargetable.  What this means is that all the hard stuff like optimsations and generating binaries can be left up to the backend which takes the intermediate form (which is much simpler in form than a typical language).  It’s now much easier to make compilers for many different languages as the compiler writer need only deal with the front end of the compiler.  The downside of doing it this way is complexity of course.  In our compiler, we’ll initially be generating stack machine code which we will later sneakily convert to C using macro expansion.
  • Runtime support can mean a few different things.  In the case of an interpreter it refers to the virtual machine we need to run the code we generate.  For compilers it might mean runtime support packages such as those required by Visual Studio binaries (MSCVRTxx.dll for example) or it might be compiled into the binary itself as is the case for GNU C compiled binaries.  In our case, we’ll be looking at interpreted code first and will thus be creating a a stack-based virtual machine to run our generated code.

Let’s talk briefly about the language we’ll be developing a compiler for.  Because I am steering clear of the theory initially, it would be best if you stick to the same source language as me as we go through the series.  There are certain cases where it’s not possible to parse a given language using the type of parser we’re going to create and so I don’t want you to get down the track and find out it’s just not going to work.  At the end of the series, I’ll come back and flesh out some of the theory to give you a better idea of what is and isn’t possible when it comes to languages parsed with the type of parser we’re using.

That said, here is the program that I will be using as the primary test case for our compiler.  There will be other smaller test cases as we incrementally add features to the compiler, however this will be the first major goal.

function fib(n) {
	if n == 1 or n == 2 {
		return 1
	}
	return fib(n - 1) + fib(n - 2)
}

set x = 0
set n = input 'n:'

while x != n {
	output fib(n)
	set x = x + 1
}

You’ll probably notice it’s quite similar to C (indeed the code highlighter on this page is using the C parser to highlight the program).  This is not deliberate so much as me trying to create a simple, easy to parse language as a starting point.

The test case is specifically designed to test two things –

  • Control flow – if/while statements and function definitions/calls, and
  • Recursion – a function that calls itself

These two things are important to being able to effectively express the solution to nearly all problems and form the general requirements of a language to be widely useful.

In the next installment, we’re going to start looking at lexical scanning and how we formally define the language we’re writing a compiler for.

Compiler Development in Python – Starting Out

Leave a Reply