You could have invented Parser Combinators

01/12/2014

In my book Build Your Own Lisp I use a Parser Combinator library I wrote for C called mpc to teach readers about languages and parsing. Lots of people have asked me how this works, curious as to if they might be able to do it themselves. A few even dived into the mpc source code!

But unfortunately the implementation of mpc in C is quite complex due to various language limitations. Additionally most resources online about Parser Combinators can be quite cryptic for a beginner. This is because much of the text is geared toward people who already know Haskell, the language they were invented using. For those unfamiliar with Haskell, following along can be pretty difficult.

But the ideas behind Parser Combinator libraries are really quite simple. It is more than possible for a beginner to implement a Parser Combinator library in a high level language such as Javascript in just a few hundred lines of code. In fact, you could have invented Parser Combinators...

You start with a function which reads the first character from some input. If this first character is a, then it returns a and advances the input by one. If it isn't, it doesn't advance the input, and returns a failure.

But your boss wants this function to work for any character, not just a. Now normally you'd add a parameter to this function which specifies which character to look for, but today you try something different. You create a new function which takes this character as input, and returns the function you had previously - adapted to look for this specific character. This function you call lit.

function lit(c) {
	
	return (function(input) {
		var r = inputRead(input);
		if (r == c) {
			inputAdvance(input, 1);
			return c;
		} else {
			return failure;
		}
	});
	
}

Your boss thinks you're mad, but you think you're pretty clever, because instead of writing the function you needed, you generated it using another function.

var parser = lit('a');
var result = parser(input);

Now your boss wants you to check if the first character is one of two given characters. He tells you to stop messing around and update the original function to take two character parameters, but instead you decide to again write a function that generates the one your boss was talking about. This one takes as input two functions generated from lit, and returns a function that tries each one in turn, returning the first one that succeeds. This new function you call or.

function or(parser0, parser1) {

	return (function (input) {
		var result0 = parser0(input);
		if (result0 != failure) { return result0; }
		var result1 = parser1(input);
		if (result1 != failure) { return result1; }
		return failure;
	});
	
}

You use this in combination with lit to do what your boss asked of you.

var parser = or(lit('a'), lit('b'));
var result = parser(input);

You also make a function that takes as input two calls to lit, and returns a new function that tries each in series. If any of them fail, it fails and rewinds the input, while if both of them succeed it returns the string of both the characters joined together. This you call and.

function and(parser0, parser1) {

	return (function (input) {
		var pos = inputGet(input);
		var result0 = parser0(input);
		if (result0 == failure) { inputSet(pos); return failure; }
		var result1 = parser1(input);
		if (result1 == failure) { inputSet(pos); return failure; }
		return result0 + result1;
	});
	
}

You show your boss that this can be used with lit when you want to parse something and then something else.

var parser = and(lit('a'), lit('b'));
var result = parser(input);

Your boss says he has a new request that will break your approach. He asks you to check for either ab or cd at the start of the input, and if either one is present convert the characters to integers and return a list of them. He says this can, and should, be done with just one function, the conventional way.

Annoyingly all your functions are set up to deal with characters and strings, not integers. You wonder if you can make them more general to work across different types. You think that or is probably already quite general - it just returns the result of one parser or the other. So the problem is just with the and parser, but this can be made more general if it returns a list of the results. Now neither and or or rely on character types as output of the parsers.

function and(parser0, parser1) {

	return (function (input) {
		var pos = inputGet(input);
		var result0 = parser0(input);
		if (result0 == failure) { inputSet(pos); return failure; }
		var result1 = parser1(input);
		if (result1 == failure) { inputSet(pos); return failure; }
		return [result0, result1];
	});
	
}

This means and and or can take as input any functions that act in a given way - these are any functions that take some input and either fail, or produce some output. These type of functions you call parsing functions and they're what you've been generating. But wait a second - the functions returned by and and or are parsing functions themselves! All they do is take some input and either fail or produce output. So we can construct a parser from and or or, and because the output itself would be a parsing function we could give it as input back to and or or without worrying things will break.

var parser = or(
	and(lit('a'), lit('b')),
	and(lit('c'), lit('d')));
	
var result = parser(input);

Even so, you are still a bit unsure about how to approach conversion to integers. It would be nice if you could have a function which calls some other function with the result of a parse. Then you could just pass in a integer conversion function to it and use lit as before. No worries - we'll generate it again. You write a function which takes as input some parsing function, and some other function, and returns a function that runs the parsing function and, providing it didn't fail, returns the other function applied to the result. This function you call apply.

function apply(f, parser) {
	
	return (function (input) {
		var result = parser(input);
		if (result == failure) { return failure; }
		else { return f(result); }
	});

}

Now you can finally write your boss' request as the following.

var parser = or(
	and(apply(toInt, lit('a')), apply(toInt, lit('b'))),
	and(apply(toInt, lit('c')), apply(toInt, lit('d'))));
	
var result = parser(input);

You can even tidy up all those repeated function calls by making a lit_to_int function.

function lit_to_int(c) {
	return apply(toInt, lit(c));
}

var parser = or(
	and(lit_to_int('a'), lit_to_int('b')),
	and(lit_to_int('c'), lit_to_int('d')));
	
var result = parser(input);

To deal with daily requests you add more and more functions to your library. You add a function that takes one parser and returns a function that runs it zero or more times, collecting all the results in a list (many). You add a function that takes three parsers and returns a function that runs them in order and returns the result of the middle one (between). You add utility functions for matching identifiers and for ignoring whitespace around other parsers. The more you use your library the more you realize the code you write is more or less identical to the textual specification from your boss. For complex parsing tasks you can just look up a grammar and map it more or less directly to your library. It is like you just give it the problem and it finds the solution as if by magic.

A sweet name is only one thing is left before you release your invention upon the world - Parser Combinators.