Impera Progress Check #1

I’ve promised updates on my progress with Impera, so here it is: so far, I’ve managed to design a pretty thorough lexer, with support for the following tokens:

package token;

public enum TokenType {
    ILLEGAL, EOF, INT, FLOAT,

    ASSIGN, PLUS, MINUS, BANG,
    ASTERISK, POWER,
    SLASH, ARROW,

    LT, LTE, GT, GTE, EQ, NOT_EQ,
    COMMA, SEMICOLON,

    LPAREN, RPAREN, LBRACE, RBRACE,

    IDENT, FUNCTION, LET,
    TRUE, FALSE,
    IF, ELSE,
    RETURN,

    STRING, LBRACKET, RBRACKET, COLON
}

I used an enum to represent the tokens, since I’m using the heavily OOP Java as my language of choice. I began writing it in C++, but I didn’t feel as comfortable with it as with Java, and manual memory management is a pain when one is learning things for the first time. I was trying to decide between Ocaml, Kotlin, and Rust, but realized that the only language I was really comfortable with was Java.

Of course, I took the opportunity to use Java 15’s preview features (e.g., sealed classes, records, pattern matching) to compensate for the loss in expressibility found in the other languages I mentioned.

I keep hearing about how sealed classes give Java Algebraic Data Types to some extent, but I haven’t really figured out how to use them in a way that isn’t contrived. I know Kotlin has had them for a while, and I don’t really understand their full value yet. The fact that sealed classes prevent arbitrary inheritance may be useful when type checking at compile time for pattern matching exhaustivity, I think. Here’s some of the code that uses patern matching (and Java’s limited “higher-order functions” in the form of predicates):

package lexer;

import token.Token;
import token.TokenType;

import java.nio.charset.StandardCharsets;
import java.util.function.Predicate;

public class Lexer {
    private final String input;
    private final byte[] inputBytes;

    private final Predicate<Byte> isLetter = Character::isLetter,
            isDigit = Character::isDigit,
            isUnderscore = b -> b == '_';

    public Token nextToken() {

        this.skipWhitespace();

        var tok =  switch (this.ch) {
            case ';' -> new Token(TokenType.SEMICOLON, byteToString(this.ch));
            case ',' -> new Token(TokenType.COMMA, byteToString(this.ch));

            // Arithmetic operators
            case '+' -> new Token(TokenType.PLUS, byteToString(this.ch));
            case '-' -> new Token(TokenType.MINUS, byteToString(this.ch));
            case '/' -> new Token(TokenType.SLASH, byteToString(this.ch));
            case '*' -> {
                if (this.peekChar() == '*') {
                    byte b = moveForward();
                    yield new Token(TokenType.POWER, byteToString(b) + byteToString(this.ch));
                } else yield new Token(TokenType.ASTERISK, byteToString(this.ch));
            }

            case '=' -> {
                if (this.peekChar() == '=') {
                    byte b = moveForward();
                    yield new Token(TokenType.EQ, byteToString(b) + byteToString(this.ch));
                } else if (this.peekChar() == '>') {
                    byte b = moveForward();
                    yield new Token(TokenType.ARROW, byteToString(b) + byteToString(this.ch));
                } else yield new Token(TokenType.ASSIGN, byteToString(this.ch));
            }

            case '!' -> {
                if (this.peekChar() == '=') {
                    byte b = moveForward();
                    yield new Token(TokenType.NOT_EQ, byteToString(b) + byteToString(this.ch));
                } else yield new Token(TokenType.BANG, byteToString(this.ch));
            }

            case '<' -> {
                if (this.peekChar() == '=') {
                    byte b = moveForward();
                    yield new Token(TokenType.LTE, byteToString(b) + byteToString(this.ch));
                } else yield new Token(TokenType.LT, byteToString(this.ch));
            }

            case '>' -> {
                if (this.peekChar() == '=') {
                    byte b = moveForward();
                    yield new Token(TokenType.GTE, byteToString(b) + byteToString(this.ch));
                } else yield new Token(TokenType.GT, byteToString(this.ch));
            }

            case '(' -> new Token(TokenType.LPAREN, byteToString(this.ch));
            case ')' -> new Token(TokenType.RPAREN, byteToString(this.ch));

            case '{' -> new Token(TokenType.LBRACE, byteToString(this.ch));
            case '}' -> new Token(TokenType.RBRACE, byteToString(this.ch));

            case '[' -> new Token(TokenType.LBRACKET, byteToString(this.ch));
            case ']' -> new Token(TokenType.RBRACKET, byteToString(this.ch));

            case 0 -> new Token(TokenType.EOF, "");

            default -> {
                if (isLetter.or(isUnderscore).test(this.ch))
                    yield Token.lookupIdent(readToken(isLetter.or(isUnderscore).or(isDigit)));

                String ident = null;
                if (isDigit.test(this.ch)) {
                    var a = readToken(isDigit.or(isUnderscore).or(isLetter).or(c -> c == '.'));

                    if (a.chars().allMatch(c -> Character.isDigit(c) || c == '_'))
                        yield new Token(TokenType.INT, a.replaceAll("_", ""));

                    if (a.chars().allMatch(c -> Character.isDigit(c) || c == '_' || c == '.'))
                        // Ensure there is only one decimal point in the number
                        if (a.chars().filter(c -> c == '.').count() == 1)
                            yield new Token(TokenType.FLOAT, a.replaceAll("_", ""));

                    ident = a;
                }

                yield new Token(TokenType.ILLEGAL, ident);
            }
        };
//        this.readChar();

        return tok;
    }

    private byte peekChar() {
        if (this.readPosition >= this.input.length())
            return 0;
        else
            return this.inputBytes[this.readPosition];
    }

    private String readToken(Predicate<Byte> p) {
        var position = this.position;
        while (p.test(this.ch))
            this.readChar();
        return this.input.substring(position, this.position);
    }
}

The end goal is to eventually compile this into native code with LLVM, so I may port this to Kotlin, or even better, Rust because of it’s experimental LLVM backend. I can’t find any up to date Java bindings for it.

I shall continue to keep you updated on how things are going with Impera.


Govind Gnanakumar image
Govind Gnanakumar

Hunting Flutter devs through the multiverse