Build a WebAssembly Language for Fun and Profit: Lexing

6 min readAug 19, 2022

A Courier pigeon sits in the middle of a swirl representing the Fibonacci sequence.

WebAssembly (wasm) is a high performance assembly-like format optimized for the web. Code targeting WebAssembly can run at near-native speeds while still benefiting from the safe environment of a browser VM. Wasm has opened up a whole new world of demanding desktop-class apps that can comfortably run in the browser. For example, AutoCAD was able to port decades of code to the web using wasm. Cases like AutoCAD’s have made it clear that wasm will be a major disruptive force in how web apps are developed.

To facilitate the adoption of wasm, WebAssembly team developed a powerful compiler toolchain library called binaryen. Binaryen does a huge amount of heavy lifting for compiler authors. It offers dead code removal, code size reduction, and various performance optimizations out of the box.

As someone who has long been interested in programming languages, this piqued my interest. Writing compiled languages has always been a daunting task. What I found is binaryen made it incredibly fun and easy to build new languages that are shockingly speedy.

That's why I decided to create this guide and provide a simple overview designed to help get your feet wet in building languages and exploring the inner workings of wasm.

Here's a quick taste of the lisp inspired language we'll build, wispy:

 
(fn fib:i32 [val:i32]
  (if (lt_i32 val 2)
    val
    (add_i32 (fib (sub_i32 val 1)) (fib (sub_i32 val 2)))))(fn main:i32 [] (fib 15))

This simple function calculates values of the fibonacci sequence, a sequence of numbers that appears in surprising places through mathematics and nature. It's one of my favorite illustrations of how elegantly patterns of the universe can be described in code.

This guide is designed for intermediate to advanced software developers looking for a fun side project to challenge themselves with. By the end, we’ll have built a working compiler and runtime for wispy.

The guide will be broken down into three articles:

Setup and Lexing (this article): the process of converting the characters of code into meaningful units called tokens
Parsing: the process of converting the tokens into a logical tree known as an AST.
Compiling (or code generation): the process of converting the AST into the binary instructions run by our computer

Setup

In this guide, we will be using TypeScript and NodeJS. The concepts are highly portable, so feel free to use the environment you're most comfortable with. Our only major dependency, binaryen, has a simple C API. You are welcome to skip ahead to the next section if you're using a different language.

Requirements:

NodeJS v16+
Git

Quick Start

bash
git clone git@github.com:drew-y/wispy.git
cd wispy
git checkout quickstart
npm i

Manual Setup

I've included manual setup instructions as an alternative to the quick start, in case you want to know exactly how the project was set up or just like doing things from scratch. If you've already done the quick start, skip to the next section.

Open a terminal window and make a new directory:

bash
mkdir wispy
cd wispy

Initialize package.json:

bash
npm init -y # Be sure to have NodeJS 16+ installed

Install the project dependencies:

npm i @types/node binaryen typescript

Add these two fields to the package.json

"type": "module", // Binaryen uses the new module format so we must follow suit
"bin": {
  "wispy": "dist/index.mjs" // This will allow us to easily run the compiler from our terminal
},

Create a tsconfig file:

npx tsc init .

Set the following fields in tsconfig.json:

"module": "ES2022","rootDir": "./src","moduleResolution": "node","outDir": "./dist"

Lexing

Lexing is the process of digesting each individual character of our program into a set of tokens. A token is a group of characters that take on a special meaning when put together. Take the following snippet of wispy:

(add 1 2)

There are five unique tokens in that snippet (, add, 1, 2 and ). The lexer's job is simply to identify and list those tokens in order.Lexing is typically the first step in turning human readable code into something closer to what a computer can understand.

Defining Our Tokens

We'll start by defining our tokens in a new file:

# mts extension is important, it tells typescript to create a corresponding mjs file, so Node knows to use modulesmkdirp -p src/types/token.mts

First up is the IntToken. This token represents whole numbers like 1045:

// src/types/token.mtsexport type IntToken = { type: "int"; value: number };

Next up is the FloatToken. This token represents numbers that may have a decimal, like 1.8:

// src/types/token.mtsexport type FloatToken = { type: "float"; value: number };/** Previously defined tokens omitted for brevity */

Now, let's define some identifier tokens. In wispy, an identifier can represent either the name of a function, or the name of a function parameter. We have two types of identifier tokens, a standard IdentifierToken and a TypedIdentifierToken.

An IdentifierToken is used in the body of a function to refer to the function's parameters or to call another function.

A TypedIdentifierToken is used when defining a function or a parameter. Typed identifiers take the form identifier:type. For example, val:i32 defines a parameter that is a 32-bit integer. When defining a function, the type represents the function's return type. For example, fib:i32 is a function that returns a 32-bit integer.

Here are the definitions:

// src/types/token.mtsexport type IdentifierToken = { type: "identifier"; value: string };export type TypedIdentifierToken = { type: "typed-identifier"; value: string };/** Previously defined tokens omitted for brevity */

Up next is BracketToken. Wispy uses S-expression syntax, like lisp. So brackets are very important. To keep things simple we allow two kinds of brackets () and []. To keep things even more simple the compiler will treat () and [] as interchangeable. In actual use we will only use [] to define parameters.

// src/types/token.mtsexport type BracketToken = { type: "bracket"; value: Bracket };export type Bracket = "(" | ")" | "[" | "]";/** Previously defined tokens omitted for brevity */

Finally we define the top level Token type:

// src/types/token.mtsexport type Token = BracketToken | IntToken | FloatToken | IdentifierToken | TypedIdentifierToken;/** Previously defined tokens omitted for brevity */

Token is a discriminated union. Discriminated Unions are an incredibly powerful programming language construct. They represent a value that can be one of many types. In our case, a Token can be any one of the more specific token types we defined earlier, such as IntToken or FloatToken. You'll notice that each of these tokens have a unique type field, such as type: "int" in the case of IntToken. This is the discriminator. Down the line you can pass a Token to a function and that function can use the type field to figure out which specific token it's working with.

At this point src/types/token.mts is finished and should look like a file.

To make our new types easily accessible, export them from a new index.mts file:

// src/types/index.mtsexport * from "./token.mjs";

The Lex Function

Now that we have our tokens defined we can write the actual lex function. The lex function will take a string (i.e. a .wispy file) and output an array of tokens (Token[]):

Make a new lex file:

mkdirp -p src/lexer.mts

Define the lex function:

// src/lexer.mtsimport { Bracket, Token } from "./types/index.mjs";export const lex = (input: string): Token[] => {const chars = input// Remove any leading or trailing whitespace for simplicity.trim()// Break up the file into single characters.split("");// This array stores our tokensconst tokens: Token[] = [];// The loop continues as long as we have characters to consume

With that we have a working lexer. We can break our code down into tokens. This is a good place to break for now. In the next article, we’ll move onto parsing these tokens into a logical tree that will ultimately be converted to wasm.