Impera Progress Check #1
I’ve promised updates on my progress with Impera, so here it is: so far, I’ve managed to design a pretty thorough lexer, with support for the following tokens:
package token;
public enum TokenType {
ILLEGAL, EOF, INT, FLOAT,
ASSIGN, PLUS, MINUS, BANG,
ASTERISK, POWER,
SLASH, ARROW,
LT, LTE, GT, GTE, EQ, NOT_EQ,
COMMA, SEMICOLON,
LPAREN, RPAREN, LBRACE, RBRACE,
IDENT, FUNCTION, LET,
TRUE, FALSE,
IF, ELSE,
RETURN,
STRING, LBRACKET, RBRACKET, COLON
}
I used an enum to represent the tokens, since I’m using the heavily OOP Java as my language of choice. I began writing it in C++, but I didn’t feel as comfortable with it as with Java, and manual memory management is a pain when one is learning things for the first time. I was trying to decide between Ocaml, Kotlin, and Rust, but realized that the only language I was really comfortable with was Java.
Of course, I took the opportunity to use Java 15’s preview features (e.g., sealed classes, records, pattern matching) to compensate for the loss in expressibility found in the other languages I mentioned.
I keep hearing about how sealed classes give Java Algebraic Data Types to some extent, but I haven’t really figured out how to use them in a way that isn’t contrived. I know Kotlin has had them for a while, and I don’t really understand their full value yet. The fact that sealed classes prevent arbitrary inheritance may be useful when type checking at compile time for pattern matching exhaustivity, I think. Here’s some of the code that uses patern matching (and Java’s limited “higher-order functions” in the form of predicates):
package lexer;
import token.Token;
import token.TokenType;
import java.nio.charset.StandardCharsets;
import java.util.function.Predicate;
public class Lexer {
private final String input;
private final byte[] inputBytes;
private final Predicate<Byte> isLetter = Character::isLetter,
isDigit = Character::isDigit,
isUnderscore = b -> b == '_';
public Token nextToken() {
this.skipWhitespace();
var tok = switch (this.ch) {
case ';' -> new Token(TokenType.SEMICOLON, byteToString(this.ch));
case ',' -> new Token(TokenType.COMMA, byteToString(this.ch));
// Arithmetic operators
case '+' -> new Token(TokenType.PLUS, byteToString(this.ch));
case '-' -> new Token(TokenType.MINUS, byteToString(this.ch));
case '/' -> new Token(TokenType.SLASH, byteToString(this.ch));
case '*' -> {
if (this.peekChar() == '*') {
byte b = moveForward();
yield new Token(TokenType.POWER, byteToString(b) + byteToString(this.ch));
} else yield new Token(TokenType.ASTERISK, byteToString(this.ch));
}
case '=' -> {
if (this.peekChar() == '=') {
byte b = moveForward();
yield new Token(TokenType.EQ, byteToString(b) + byteToString(this.ch));
} else if (this.peekChar() == '>') {
byte b = moveForward();
yield new Token(TokenType.ARROW, byteToString(b) + byteToString(this.ch));
} else yield new Token(TokenType.ASSIGN, byteToString(this.ch));
}
case '!' -> {
if (this.peekChar() == '=') {
byte b = moveForward();
yield new Token(TokenType.NOT_EQ, byteToString(b) + byteToString(this.ch));
} else yield new Token(TokenType.BANG, byteToString(this.ch));
}
case '<' -> {
if (this.peekChar() == '=') {
byte b = moveForward();
yield new Token(TokenType.LTE, byteToString(b) + byteToString(this.ch));
} else yield new Token(TokenType.LT, byteToString(this.ch));
}
case '>' -> {
if (this.peekChar() == '=') {
byte b = moveForward();
yield new Token(TokenType.GTE, byteToString(b) + byteToString(this.ch));
} else yield new Token(TokenType.GT, byteToString(this.ch));
}
case '(' -> new Token(TokenType.LPAREN, byteToString(this.ch));
case ')' -> new Token(TokenType.RPAREN, byteToString(this.ch));
case '{' -> new Token(TokenType.LBRACE, byteToString(this.ch));
case '}' -> new Token(TokenType.RBRACE, byteToString(this.ch));
case '[' -> new Token(TokenType.LBRACKET, byteToString(this.ch));
case ']' -> new Token(TokenType.RBRACKET, byteToString(this.ch));
case 0 -> new Token(TokenType.EOF, "");
default -> {
if (isLetter.or(isUnderscore).test(this.ch))
yield Token.lookupIdent(readToken(isLetter.or(isUnderscore).or(isDigit)));
String ident = null;
if (isDigit.test(this.ch)) {
var a = readToken(isDigit.or(isUnderscore).or(isLetter).or(c -> c == '.'));
if (a.chars().allMatch(c -> Character.isDigit(c) || c == '_'))
yield new Token(TokenType.INT, a.replaceAll("_", ""));
if (a.chars().allMatch(c -> Character.isDigit(c) || c == '_' || c == '.'))
// Ensure there is only one decimal point in the number
if (a.chars().filter(c -> c == '.').count() == 1)
yield new Token(TokenType.FLOAT, a.replaceAll("_", ""));
ident = a;
}
yield new Token(TokenType.ILLEGAL, ident);
}
};
// this.readChar();
return tok;
}
private byte peekChar() {
if (this.readPosition >= this.input.length())
return 0;
else
return this.inputBytes[this.readPosition];
}
private String readToken(Predicate<Byte> p) {
var position = this.position;
while (p.test(this.ch))
this.readChar();
return this.input.substring(position, this.position);
}
}
The end goal is to eventually compile this into native code with LLVM, so I may port this to Kotlin, or even better, Rust because of it’s experimental LLVM backend. I can’t find any up to date Java bindings for it.
I shall continue to keep you updated on how things are going with Impera.