bolingblog: Finite State Machine Tokenizers: Finite State Machine Tokenizers

"If you solve a problem with a regex, now you have two problems"

I have a dream! That one day software developers who have to do string processing will be just as likely to reach for a finite state machine as they are to start writing regular expressions. FSM's are not a magic bullet, but for some problems they're perfect: Easy to write, easy to understand, easy to test.

For the sake of this article, lets say that we want to grab the comments out of a c or java source file. These are the rules:

A line comment starts with a double slash and ends at the first newline
A block comment starts with slash star and ends at the first star slash
A double slash within a string literal does not start a comment
Slash star within a string literal does not start a comment
A string literal starts with doublequote and ends at the first double quote
A string literal does not end at a double quote preceded by a backslash

This is the definition of a FSM tokenizer that grabs the comments. This is not a real format, but I think it might be valid YAML. I also think it could be a fascinating project in java to implement a general purpose FSM tokenizer where the FSM is defined in yaml.

Start in the CODE state and for each character in the string do what your current state tells you, including moving to a new state.

---
CODE: {
'/': ONESLASH
'"': INQUOTE
?: CODE
}

ONESLASH: {
'/': LINECOMMENT
'*': BLOCKCOMMENT
?: CODE
}

LINECOMMENT: {
'\n': [endToken, CODE]
?: [appendTheChar, LINECOMMENT]
}

BLOCKCOMMENT: {
'*': BLOCKSTAR
?: [appendTheChar, BLOCKCOMMENT]
}

BLOCKSTAR: {
'/': [endToken, CODE]
?: [append:'*', appendTheChar; BLOCKCOMMENT]
}

INQUOTE: {
'"': CODE
'\\': ESCAPE
?: INQUOTE
}

ESCAPE: {
'"': INQUOTE
?: INQUOTE
}

Some of you are already familiar with this method of string processing. Some of you think that this type of FSM is better expressed with the characters at the root level. That is, all the state transitions that can be triggered by (say) a slash are grouped together, rather than all the transitions that start in (say) a quoted string being grouped together. "You're wrong and I hate you".

One of my teams has recently started to use the language Kotlin, so over the bank holiday weekend I gave myself a personal coding project: To write a FSM tokenizer in Kotlin and to make it as easy to read as I could. This what I ended up with.

/**

* base class for Finite State Machine based tokenizers (lexers)

abstract class FSMTokenizer {

var token = StringBuilder(); // the current token

var tokens = ArrayList<String>(); // all the tokens

operator fun plus(c: Char?){ // add a character to the current token

token.append(c);

}

operator fun plus(s: String){ // add a string to the current token

token.append(s);

}

fun split() { // end one token and start the next

tokens.add(token.toString());

token = StringBuilder();

}

fun tokenize(s: String): List<String> { // feed the characters of a string into a Finite State Machine

for (c in s) {

step(c)

}

step(null) // the FSM is given a null after the last character - it might have some tidying up to do

return tokens;

}

abstract fun step(c: Char?) // deal with one character. You need to implement this yourself

}

* tokeniser which grabs only the comments out of a java source file

class FSMComments(): FSMTokenizer() {

var state: States = States.CODE; // start the machine in the CODE state

override fun step(c: Char?) = state.step(this, c) // pass this tokenizer into the enum class

enum class States { // FSM model

CODE, ONESLASH, LINECOMMENT, BLOCKCOMMENT, BLOCKSTAR, INQUOTE, ESCAPE;

fun step(tokens: FSMComments, c: Char?) {

tokens.state = when(tokens.state) { // our implementation of step() updates the state of the machine

CODE ->

when (c) {

'/' -> ONESLASH

'"' -> INQUOTE

else -> CODE

}

ONESLASH ->

when (c) {

'/' -> LINECOMMENT

'*' -> BLOCKCOMMENT

else -> CODE

}

LINECOMMENT ->

when (c) {

'\n' -> {tokens.split(); CODE}

else -> {tokens+c; LINECOMMENT}

}

BLOCKCOMMENT ->

when (c) {

'*' -> BLOCKSTAR

else -> {tokens+c; BLOCKCOMMENT}

}

BLOCKSTAR ->

when (c) {

'/' -> {tokens.split(); CODE}

else -> {tokens+'*'; tokens+c; BLOCKCOMMENT}

}

INQUOTE ->

when (c) {

'"' -> CODE

'\\' -> ESCAPE

else -> INQUOTE

}

ESCAPE ->

when (c) {

'"' -> INQUOTE

else -> INQUOTE

}

* tokeniser which splits a string on commas. It eats leading whitespace and multiple commas

class FSMCommas(): FSMTokenizer() {

var state: States = States.COMMA; // start the machine in the COMMA state

// extra operators can be implemented here

override fun step(c: Char?) = state.step(this, c)

enum class States { // FSM model

TEXT, COMMA;

fun step(tokens: FSMCommas, c: Char?) {

tokens.state = when(tokens.state) {

TEXT ->

when (c) {

',', null -> {tokens.split(); COMMA}

else -> {tokens+c; TEXT}

}

COMMA ->

when (c) {

',', ' ', '\t', '\n' -> COMMA

else -> {tokens+c; TEXT}

}

fun main() {

val helloWorld = """

/* The code below will print the words Hello World

to the screen, and it is amazing */

#include <stdio.h>

int main() {

// printf() displays the string inside quotation

printf("Hello, World!");

printf("//shhh this isn\"t a comment");

return 0;

}

/* this block comment has a * in it */

""".trimIndent();

val list = """

one, two,

three, four""";

for (t in FSMCommas().tokenize(list)) {

println(t);

}

println("---")

for (t in FSMComments().tokenize(helloWorld)) {

println(t);

}

Richard "Blogger is rubbish for code samples" B

bolingblog

Saturday, 3 September 2022

Finite State Machine Tokenizers

No comments:

Post a Comment

Followers

Contributors