Building your own simple programming language

Have you wondered what it takes to create your own programming language? Have you wondered what it takes to write an interpreter? I certainly have and recently got inspired by this blog post series.

As it turns out, it is not that complicated. Only three steps are involved:

Lexing - split text into words and symbols (tokens)
Parsing - create a program structure (AST)
Interpreting - traverse and execute the structure

The full code is available on Github at https://github.com/timonback/smol with a demo at https://repl.it/@timonback/smol.

Smol

Let’s first start with the language itself, which Drew as the creator and author of the blog series calls Smol. A simple addition in Smol is coded as:

let a = 1;
let b = 2;

let add = |x, y| { x + y; };
let result = add(a, b);

print("The result is: ");
print(result);

This code is interpreted by the smol interpreter which is written in Javascript (compiled from Typescript). Using a scripting language has some advantages like relaxed types as well as faster prototyping cycles.

Smol has its own syntax and so you might be surprised by the fact that the add function parameters are enclosed using the pipe | symbol. Since parentheses are already used for passing parameters to a function call, it simplifies the parsing step to define and use different symbols for different actions. As this is a custom language, you are free to redefine any symbols as you like.

Some smaller adjustments have been made to the original code. On the one hand side, the result of each processing step is written to a file, which helps to understand the structure. We will see this in the next sections. Also, support for recursion and multi line comments was added. Check the related commits at my Github.

Drew explains the code well in his posts as well as the steps that are required. So, if you are interested in the inner workings, have a look at his posts. Instead, I will have a closer look at the outputs of each step.

Lexer

As a first step, the program code is handed over to the lexer. The lexer breaks the code up into tokens that can later be used to create structure from it. The tokens have no hierarchy, but are already associated with a type to know how to parse each token. Each type as a different meaning to the parser. Lexing the first line of the addition example earlier results in this structure:

[
  {
    "type": "keyword",
    "value": "let"
  },
  {
    "type": "identifier",
    "value": "a"
  },
  {
    "type": "operator",
    "value": "="
  },
  {
    "type": "number",
    "value": "1"
  },
  {
    "type": "terminator",
    "value": ";"
  }
]

Parser

Next, the structure is parsed into an abstract syntax tree (AST). The AST is a structural representation of the code. As part of parsing the structure, the parser checks the grammar and fails if parenthesis are missing or a token does make sense at a specific location. Therefore, the code has to follow a specific schema. Still, a successfully parsed code can still contain errors.

Also non relevant details for the execution are removed; for example, termination symbols or comments. White spaces outside of strings and new lines were already filtered in the previous step as they are not relevant in Smol. Notice, however that indentation can have a relevance in other languages.

The resulting AST of the previous shown tokens is shown in the enxt code block. It creates a variable a and assigns it the value 1; later the variable is printed:

[
  {
    "type": "variable-decleration",
    "name": "a",
    "varType": "let"
  },
  {
    "type": "function-call",
    "function": "=",
    "args": [
      {
        "type": "number",
        "value": 1
      },
      {
        "type": "identifier",
        "name": "a"
      }
    ]
  },
  ...
  {
    "type": "function-call",
    "function": "print",
    "args": [
      {
        "type": "identifier",
        "name": "a"
      }
    ]
  }
]

Interpreter

Based on the AST, the interpreter will step through it and run the commands. While the AST only contains actions and references to other functions, the interpreter contains all the necessary logic to “understand” and run it. Additionally, a simple stack and memory for variables is managed.

In addition to Drew’s implementation, the interpreter is able to handle recursions successfully, check out the fibonacci example for this. Also, the parser filters out comments and converts them to a nop (no-operation).

Conclusion

Building a custom, simple programming language with its own interpreter is quite easy. Smol supports handling of (global) variables, methods with parameters and return values, basic algebra with infix notation, recursion, comments and printing data to the console. To my understand the language is even turing complete - assuming an infinite memory band in the Javascript engine.

Finally, I invite you to also explore the language, write own simple codes and maybe even add another feature to the language. Lexing, parsing and interpreting code is straight-forward and Drew did a great job explaining the code. A demo and playground is available at https://repl.it/@timonback/smol

Smol#

Lexer#

Parser#

Interpreter#

Conclusion#

Smol

Lexer

Parser

Interpreter

Conclusion