All Articles

Building a programming language from scratch

Today, I’m going to start working on a new project: I’m going to build my own programming language from scratch.

In many projects, we often only see the final result, we don’t the see the journey and struggles in the middle (which as I repeatedly mention in my internship post is what’s invaluable) so I’m going to document my process in detail.

Day 1

Technically speaking, my experiences with programming languages are quite limited. I have followed the Let’s Build an Interpreter From Scratch for the first ~8 episodes so I’m broadly familiar with the high level components of an interpreter. We have a lexer which is used to convert our code or text into individual tokens, we have a parser to figure out what the tokens are (is this a function declaration, is this an operation, is this a number etc.), we have an interpreter to then do what is asked of us (i.e. call this function, do this operation etc). We have lexical grammars which specify the rules for how we interpret the language.

I also know that many programming languages create an abstract syntax tree from the language and that doing some kind of traversal on the tree is how we then figure out what to actually do from the source code. To figure out the map from the territory, Linus’s post has been helpful. As of right now, I’m not sure what type of language to create. Should I explore functional programming, similar to Lisp? Should I explore concurrency? Should I explore memory allocation or readability? I don’t know.

I’m going to start working through Crafting Interpreters and get a better picture of the map. I’ll figure out what later. I’m going to write the language in Go because, although I’m more proficient in Python, I like Go from my experience building athena, carly, and vibely. So I’m going to run with it. Also because Go compiles code into a single executable which can be run anywhere, it’s more likely I’d be able to take advantage of this down the line if I wanted to use my language to build and ship other side projects.

Update, I’m switching to Writing an Interpreter in Go. Since I’m familiar with the high-level ideas, I want to make sure I follow good practices for how to build the language in Go - since it’s not as object-oriented as Java, I don’t want to start off the bat with bad practices.

Let’s get into lexical anaylsis.

Things that have been simplified that will need to be adapted

  1. Attach filenames and line numbers to tokens (to track down errors more easily)

What I did today:

  1. Laid foundation for lexer
  2. Defined token and initial simple list of token types
  3. Wrote a test to make sure our nextToken function in our lexer correctly works.

Day 2

I’ve decided I want to explore functional programming with my programming language, taking from Lisp. Specifically, I think I’d like it to have a similar flavor to Clojure. Given that I also don’t know Lisp or Clojure, I’m going to hold off on building the interpreter and do some research about the language, how it works, why it’s so powerful, and really get into the meat of the design of the language.

We’re starting here

Notes
  • Clojure = modern dialect of Lisp
  • Runs as a hosted language on the Java Virtual Machine (and other platforms)
  • Opinionated approach (different to Java, C++, similar to Javascript)
  • Lots of brackets, strips away most other syntax (no special characters, no operator overloading, no semi-colons) - pure minimalism
  • Main theme: Every piece of source code is a definition of an executable tree structure

    • Writing in Clojure = defining nested branches of the tree, identified by brackets called S-expressions
  • Operations first - prefix notation
  • One rule no exceptions

    (function param1 param2... paramN)
  • Variables don’t really exist in Clojure
  • Core concept: Symbols

    • Used to name things - can be bound to value
    • Like variables but immutable - bound to fixed value which cannot be changed after definition
    • Omits need for core object-oriented concepts like encapsulation
  • Clojure data is immutable by default
  • Functions first class entities, treated like any data/type
  • Functions which consume or produce other functions Higher Order Functions (similar to higher order components)
  • namespaces = similar to modules
  • Vars = named storage containers with out data (called anywhere from namespace)

    • different from a variable which is direct mapping from symbol to data
    • symbols map to var objects which then can access data
    • Defined with def
    (def a 4)
  • (if test true_expression falseexpression)
  • In Clojure, data = code and code = data

    • Referred as homoiconicty, lends itself to metaprogramming or code generation which is why Lisp and Lisp variants are so powerful
  • To evaluate data as data, symbolically, use quote or ’
  • Lists don’t allow direct access to any other elements (only access at head)
  • Vectors

    • More traditional list-like structure similar to arrays or lists in other languages

Did a lot of research today. It turns out designing a programming language is a much more complex process than simply building an interpreter or compiler. In order to make concrete progress, I’m going to break this into sub-goals.

  1. I want to better understand Lisp / Clojure
  2. I want my newfound understanding to help me architect my language

To accomplish 1, I’m going to start by building a Lisp interpreter in Go. To do that I’m going to complete the make a lisp interpreter from mal.

Setting everything up took ages, at least 3 hours. Felt like a big time noob learning about how to use makefiles for the first time.

Day 3

I want to iterate quickly and I don’t want to get bogged down. So I’m to going to get started building the lisp interpreter in the easiest way possible here. Only difference is I’ll implement it in Go. Hardest part has been to start doing it. It’s infinitely easier to keep finding the best place or way to get started. There is no best way. The best way is whatever gets you started, so I’ll swallow my own pill and get started. Some motivation has been this

“Compilers take a stream of symbols, figure out their structure according to some domain-specific predefined rules, and transform them into another symbol stream.”

Schema Notes
  • Numbers & Symbols (a for example) & operators (e.g. >, +, -) are atomic expressions

    • Can’t be broken up
  • Everything else is a list expression (…) with … being some list of things
  • Special form list = one that starts with a keyword like if (meaning depends on keyword)
  • List that with starts with a Non-keyword = a function
  • 5 keywords and 8 syntactic forms

Task 1: Implement a simple lispy calculator

Update: The tutorial was very interesting but many core features were implemented with python abstractions that did the heavy lifting. Go is more low-level so doing this kind of thing would not transfer easily. I’m now going back to crafting interpreters + reverse engineering Ale to help guide me. The crafting interpreters narrative is helping cement the fundamentals while the Ale source code will be a good tool for bridging some of the more complex parts.

So slight pivot, I’m basically gonna do this from first principles now. Using all of the resources to help orient and guide me but, I want to build every layer from the ground up.

Notes

Key point about context-free grammars

  • Distinction between

    • lexical analysis

      • comprised of characters which form tokens or lexemes
      • implemented by a scanner
    • syntactic grammar

      • comprised of tokens which form expressions
      • implemented by a parser
      • operates on the outputs of the scanner

To write grammar, compromised of

  • terminals (literals, can’t be simplified down further e.g. 205)
  • nonterminals (potentially other expressions which can be called/nested)
  • Need to apply precedence clearly to avoid ambiguity in the parsing → same expression producing different syntax trees and thus different code

What needs to be done:

  • First create a list of tokens - this is the Scanner’s or Lexer’s job
  • Then onto parsing, given this list of tokens, construct the abstract syntax tree based on the grammar specified

    • For our simple lisp interpreter, this means when we see a ”(”, we have a new expression so we would need to “recurse” or “nest” our result!

Day 4

Made progress on lexer/scanner and parsing to tokens works. Still not entirely sure how to implement the parsing in Lisp - I found a potential grammar to work with, but we’re still trying to figure things out.

Resources being used:

Lisp/Clojure recall has Symbols, Numbers, Atoms , List, Expr, Env

  • Symbol is `[anything here]
  • Numbers is self-explanatory
  • Atoms = blocks of independent data
  • Lists = starts with a ( followed by 0 or more items

    • First item may be an operand, function call, variable etc. But always comes first

Good summary of Lisp

I’m trying to find a grammar to work with for lisp and a possible implementation to get inspiration but things are very scattered. Debating whether I should just dive in or keep looking. I think diving in would be the better and possibly harder option, which is why I keep trying to see how it’s been implemented by others. The problem I’m having is that for Lisp specifically (less so more broadly, but especially for Lisp), I haven’t bridged the gap with exactly how the lexer interfaces with the parser. I think this is because I don’t know exactly how to handle a list - my intuition is that we represent the actual data of the as well as an operation so something like

type ListOperator struct {
	operator TokenType
  data []byte
}

Looks like S-expression are handled as an interface. I’m going to run with it.

Day 5

I was struggling in the morning but we finally made some progress! I’m building the calculator bit of Lisp at the moment and I have a working parser for simple expressions like (+ 2 5) or (+ 2 (+ 2 5)).

I’m officially making progress now. Diving in head first made me make naive assumptions, however when I was forced to start thinking about how to improve on them, I was able to appreciate what some of the other Lisp interpreters were doing. Now I’m updating the parser and it’s off to the races (for now).

Parser is done, now the struggle has shifted to the evaluating section. Similar problem here, I don’t know how to handle these lists. Normally, the nodes would consist of things like binary operations, unary operations, assignments, function calls - all of which you evaluate the children then do the operation at the parent node. In the list implementation, the operator will be the first term of the list and every other term needs to have this operation carried on it. E.g. (+ 5 3 2) ⇒ 5 + 3 + 2. And our AST’s underlying structure looks like List: {Symbol, Number, Number, …} for example

Edit: finally got the calculator to work after many hours. Tomorrow is going to be adding state, bindings, and possibly conditions :) Things have been hard, partly because I’m thinking from first-principles, partly because of gaps in my knowledge that I have to fill as I do research (e.g. today I learned about closures, type assertions, and concurrency in Go), but it has been rewarding.

Also, you can follow along my progress here!

Day 6

Notes
  • Distinction between expressions and statements. Expressions evaluate to smt, statements, depending on the language (e.g. in many functional languages) do or don’t

  • Environment = data structure used to associate variables/functions with their values

    • Boils down to basically a hashmap
    • Scope and environment are close cousins, environment is the machinery that implements scope
    • Scope implemented like linked list - start in current scope and check, then get environment of next highest scope (each env has one reference to its parent scope) then check and so on and so forth
  • Pattern of defining a node in the syntax tree for handling different types (unary op, binary op, assignment, function) then having a visitor with an interpret method in evaluating where you put domain-specific code

Things to work on today

  • State with variables
  • Expanding expressions with truth, false, and statements that evaluate to these expressions e.g. (< 5 2) = false

So I just added a data structure to encapsulate environment but I think I need to adapt my old code for how I was handling calculations. For example, a definition in lisp e.g. def a 5 def is a symbol. But it’s logic should take the key it gives it and the value and store it in the hashmap. How am I supposed to handle the different functionality with my current code, where when I evaluate a symbol, I always assume it’s a binary operation so it should return an int? I’m not exactly sure without lots of different functions and lots of casting which, I get the impression, is probably not best practice. I think I’ll implement it like that for now and refactor later.

Looks like the intuition was right inspecting some of other people’s code. I need to encapsulate the logic and only return a result when evaluating the nodes, as opposed to returning a function which then carries the operation out at the node.

Refactored the code and things are starting to make sense! Now adding additional functionality should be more straightforward :)

Added state! Well for variables with integers and strings right now.

define

Added relational operators, logical operators, and truthiness!

Edit - added conditional statements and println, with some bug fixes. All in all, a solid day. Tomorrow we’re gonna get real with functions and then it’s off to the races.

To the 🚀