Proper Parsing with RE2C

A project log for UBASIC And The Need For Speed

I basic, 'ubasic', they basic, we basic; wouldn't you like a speedy BASIC, too?

ziggurat29ziggurat29 06/20/2018 at 15:550 Comments

Jun-16 12:09 PM

It's a little inconvenient for the construction of this back-formed log, but chronologically I had started investigating a potential way of replacing the get_next_token() implementation just prior to the events of the last log entry, wherein I had discovered some data anomalies that changed my opinion about what work would result in the biggest gains for the least costs.  But having got past that diversion, I was now more resolved to actually implement and measure results.

My background is not in Computer Science -- I am self-taught in software engineering -- and consequently there are holes in my understanding of various formal Computer Science topics.  Amongst these is all the theory around the construction of language processors.  But I'm at least cursorily familiar with their existence, and I can construct useful web queries.

Parsing is one of my least favorite activities, right up there with keyboard debouncing and bitblt routines.  Doubtlessly this is because I have always hand-crafted parsers.  You can get by with that for simple stuff, but it rapidly becomes an un-fun chore.  Many years back I decided to bite-the-bullet and learn about 'regular expressions' as the term applies to software, which are a computerification of some stuff that Noam Chomsky and others did ages ago in the field of linguistics.

Suffice it to say that regular expressions (or 'regexes') are a succinct way of express a pattern for matching against a sequence of symbols.  Most will be familiar with a very simple form of pattern matching for doing directory listings; e.g. 'ls *.txt' to list all files that end with '.txt'.  Regexes are like that on steroids.

My boredom with parsing is evidently something of a cultural universal, because there have been tools created to automate the generation of parsers since time immemorial, and they pretty much all use regexes for the first part:  'lexing', or breaking stuff up into tokens ('lexemes').  I.e., exactly what we are trying to do in get_next_token().  (The second part is applying grammar, which we are leaving in the existing ubasic implementation).  There are a lot of tools for all the usual reasons:  folks make new versions with modern capabilities, folks make versions with more power, folks make versions suited to different execution environments.  So, time for shopping...

I didn't spend a lot of time researching options, but I did spend a little, and I'll spare you the shopping tales.  I decided on using 're2c' to generate the lexer/tokenizer.

This tool is pretty simple in what it does:  take a list of regular expressions associated with a chunklette of C code to invoke when they match.  It doesn't really even create a complete piece of code, but rather is expected to have it's output spewed into the middle of a function that you write, that is the lexical scanner.  The code that it generates is a monster switch statement that represents a compiled and optimized Deterministic Finite Automata (DFA)

for all the patterns that you have specified.  It is held in high regard, and apparently results in more efficient code than some alternative approaches, such as the table-driven approach used in Lex/Flex.  It is also minimalistic in that it just spews out pure C code, with no particular library dependencies (not even stdlib, really) since you are expected to do all the I/O needed to provide symbols to it.  So it seemed a good match, and a-learnin' I did go.

I had not used this tool before, so I did the usual cut-and-paste of the most simple examples I could find for it, and then find the 'hole' where I stick in my regexes.  I figured that out fairly quickly, so then I needed to see how to marry all this with the existing ubasic code.

There was a prevailing desire to not re-engineer ubasic, but rather to surgically replace the sub-optimal component.  Helpfully, the ubasic codebase was simple, and moreover had a separate component called 'tokenizer' which did all the tokenizing work.  Nice!

The only function I required to replace was get_next_token() as per my profiling results.  However looking over the tokenizer.c's implementation, this seemed like it might be an awkward task since it had some coupling with neighboring function's implementation.  On the other hand, the module as a whole had a relatively simple interface defined in tokenizer.h:

//this sets up pointers to start the process, and implicitly sets up to return the first token
void tokenizer_init(const char *program);
//this advances to the next token (if possible)
void tokenizer_next(void);
//this gets the token ID of current token
int tokenizer_token(void);
//this interprets the current token as an integer
int tokenizer_num(void);
//this interprets the current token as a variable name, returned as an index 0-25 (A-Z)
int tokenizer_variable_num(void);
//this interpretst the current token as a quoted string, with the quotes removed
void tokenizer_string(char *dest, int len);

//this indicates that we are done with the input symbol stream
int tokenizer_finished(void);
//this causes the implementation to spew out some debug info, if it is a debug build
void tokenizer_error_print(void);

I decided it would be easier to simply re-implement those methods in terms of the new lexical scanner I would be creating.  A little more work, but not much.  Less work relative to trying to preserve carnal assumptions within the implementation of those public methods.

The re2c tool takes an input file of your creation, conventionally named with an '.re' suffix, and injects generated code into a space demarcated by a special comment.  The result is your C code with a state machine stuck into a hole you created.  Most of the .re file I created was some definitions of a structure of my creation named 'Scanner', which contained the scanner's state.  This state consisted of a bunch of pointers into 'where are we now', 'where were we when we started this token', 'some scratch stuff that re2c needs if it needs to back up', 'what token did we find', etc.

typedef struct Scanner
	const char* top;	//start of token
	const char* _pchCur;	//cursor; will be just past token when emitting tokens
	const char* ptr;	//used by re2c to store position for backtracking
	const char* pos;	//
	const char* _pchBackup;	//if we need to do backup from a lookahead
	int line;		//line no in source file
	int current_token;
} Scanner;

also needed are some macros that re2c expects to exist.  These macros are what allows re2c code to interface to your code in a well-defined way so that you can adapt all sorts of I/O mechanisms (e.g. serial ports, whatever) as a source of data into the scanner in a uniform way.  In this case it's really easy, since all the program text is in one contiguous memory region, so even I can code that correctly!

#define YYCTYPE		char
#define YYCURSOR	s->_pchCur
#define YYMARKER	s->ptr
#define YYCTXMARKER	s->_pchBackup

What does all that mean?  Use the docco!

anyway, the rest of the file was the C code that provides the implementation of the public functions that are being replaced.  I won't bore you with that here, just take it as read that they were relatively simple implementations that worked off the state data in 'struct Scanner', of which there was defined a global instance.  The tasty bits are the re2c patterns that make up your lexer.  Again, the full definition is long, so I'm just going to show a piece of it:

int scan(Scanner* s)
	if ('\0' == *s->_pchCur)
	s->top = s->_pchCur;

	re2c:yyfill:enable = 0;	/*buffer contains all text, so 'no'*/

	whitespace = [ \t\v\f]+;
	dig = [0-9];
	any = [\000-\377];

	dig+		{ return (TOKENIZER_NUMBER); }
	["]([^"]+)["]	{ return (TOKENIZER_STRING); }
	[a-zA-Z]	{ return (TOKENIZER_VARIABLE); }
	"\r\n"|"\n"	{
			    s->pos = s->_pchCur;
			    return (TOKENIZER_CR);
	","		{ return (TOKENIZER_COMMA); }
	";"		{ return (TOKENIZER_SEMICOLON); }
	"+"		{ return (TOKENIZER_PLUS); }

//... more symbols

	'let'		{ return (TOKENIZER_LET); }
	'print'		{ return (TOKENIZER_PRINT); }
	'println'	{ return (TOKENIZER_PRINTLN); }
	'if'		{ return (TOKENIZER_IF); }
	'then'		{ return (TOKENIZER_THEN); }
	'else'		{ return (TOKENIZER_ELSE); }

//... more keywords

	whitespace	{ goto regular; }

		printf("unexpected character: %c\n", *s->_pchCur);


So, you can see that the tool works by having a conventional C comment block with a special token '!re2c' that demarcates some stuff that the re2c tool should process.  You can have as many of these blocks as you like, wherever you like.  Of course, it only really makes sense in the context of a scanner function.

The first '!re2c' block contains some directives and 'character class' definitions.  Really, much of this scanner came cut-and-paste from an example I found, and I hacked it down to the minimum I needed.  Hence the label 'regular' exists because there used to be some other labels for some different scanner states I didn't need.  Similarly is the case with the inconsistent naming of member variables in the 'struct Scanner'.  Since this was a proof-of-concept exercise, I didn't devote as much time to polishing for posterity.

Anyway, most of the interesting stuff is that you use regexes to define patterns to match abstract things, like 'text enclosed in quotation marks' or 'a well-formed number' or 'a case-insensitive string to match'.  These patterns, when matched, result in some code that is executed.  Usually, this code is simply to emit the token id.  But it can be arbitrary things, too.  For example, the CRLF or LF pattern match will emit the TOKENIZER_CR code, but before it does, it will increment a line counter, and set the current position to be after the found pattern.  This was done to support some debug output -- it's not required for what we are doing, but I used it in a testing function I created that barfs out all the parsed tokens for the input program.  I very much needed this to validate that my patterns were working correctly.  And similarly, the 'whitespace' match does a goto 'regular:' to effectively ignore whitespace matches.

Before moving on from the re2c syntax, some things seem worth mentioning:

The result of running the .re file through the tool is a C language file (ostensibly named '*.c' (or '.*\.c' if you want to get regexy with it)).  You're not meant to scrutinize that file, but it's interesting to take a gander to see how the sausage is made.  Here's some excerpts:

int scan(Scanner* s)
	if ('\0' == *s->_pchCur)
	s->top = s->_pchCur;

	YYCTYPE yych;
	unsigned int yyaccept = 0;

	yych = *YYCURSOR;
	switch (yych) {
	case '\t':
	case '\v':
	case '\f':
	case ' ':	goto yy54;
	case '\n':	goto yy23;
	case '\r':	goto yy25;
	case '"':	goto yy4;
	case '%':	goto yy42;
	case '&':	goto yy34;
	case '(':	goto yy44;
	case ')':	goto yy46;
	case '*':	goto yy38;
	case '+':	goto yy30;
	case ',':	goto yy26;
	case '-':	goto yy32;
	case '/':	goto yy40;
	case '0':
	case '1':
	case '2':
	case '3':
	case '4':
	case '5':
	case '6':
	case '7':
	case '8':
	case '9':	goto yy2;
	case ';':	goto yy28;
	case '<':	goto yy48;
	case '=':	goto yy52;
	case '>':	goto yy50;
	case 'A':
	case 'B':
	case 'D':
	case 'H':
	case 'J':
	case 'M':
	case 'Q':
	case 'V':
	case 'X':
	case 'Y':
	case 'Z':
	case 'a':
	case 'b':
	case 'd':
	case 'h':
	case 'j':
	case 'm':
	case 'q':
	case 'v':
	case 'x':
	case 'y':
	case 'z':	goto yy6;
	case 'C':
	case 'c':	goto yy8;
	case 'E':
	case 'e':	goto yy9;
	case 'F':
	case 'f':	goto yy10;
	case 'G':
	case 'g':	goto yy11;
	case 'I':
	case 'i':	goto yy12;
	case 'K':
	case 'k':	goto yy13;
	case 'L':
	case 'l':	goto yy14;
	case 'N':
	case 'n':	goto yy15;
	case 'O':
	case 'o':	goto yy16;
	case 'P':
	case 'p':	goto yy17;
	case 'R':
	case 'r':	goto yy18;
	case 'S':
	case 's':	goto yy19;
	case 'T':
	case 't':	goto yy20;
	case 'U':
	case 'u':	goto yy21;
	case 'W':
	case 'w':	goto yy22;
	case '|':	goto yy36;
	default:	goto yy56;
	yych = *YYCURSOR;
	goto yy216;
	{ return (TOKENIZER_NUMBER); }

//... more stuff

	yych = *YYCURSOR;
	switch (yych) {
	case '0':
	case '1':
	case '2':
	case '3':
	case '4':
	case '5':
	case '6':
	case '7':
	case '8':
	case '9':	goto yy215;
	default:	goto yy3;

So, you can see that the first part is stuff copied verbatim from our input file, as expected.  Then the tool squirts in it's own stuff.  That stuff comprises a state machine that traverses the paths defined by the sequences of symbols towards a completed match, at which point they invoke the code we defined in the .re file.

For example, if we started with pulling a digit, then we would goto label yy2: that would pull another character and proceed to label yy216: that would test to see if it is also a digit.  If it is, it would proceed to yy215: that pulls another character and falls though to yy216: that checks for digits.  Ultimately, a non-digit will be pulled, that will cause transfer to yy3: that will emit the TOKENIZER_NUMBER symbol.  Pointers in the 'struct Scanner' instances will be maintained along the way that can be used to demarcated where the matched pattern begins and ends.

Why is this useful?

An obvious benefit is that you can express the things you want in terms of patterns, rather than literal code.  In this case, much of tokenizer.c was matching against a list of string literals for keywords -- that is easy.  But there was also special code to match against quote enclosed strings, variable names, and numeric sequences.  Those were manually coded.  For ubasic, this was easy enough, but what if we supported anything fancier, e.g. floating pointer numbers, or scientific notation.  What a bear!

A perhaps less obvious benefit is that this is much more efficient that sequentially strcmp()'ing against a list of literals.  Not so much because of the magic of regular expressions, but more because of the magic of having a constant set of patterns against which to match (the language, lol).  Consider this pseudo-language:


If one were to use the strcmp-and-loop approach, what would happen would be something like:

The salient point being is that we compared the first character to 'A' three times.  And the second and third to 'ND' twice.  If all this testing were reconfigured as a tree of possible symbol sequences, then those comparisons can be factored, and done fewer (and often once) times.

So, in a way, you can look at the individual patterns as defining a little symbol sequence graph, and the lexer generator tool as a mechanism for merging a collection of those graphs into a master graph.  After having done so, there are there are algorithms that may be applied for further reducing the complexity of that aggregate graph.

For anything of more than minor complexity, you probably should reconsider hand coding this stuff.  There are exceptions, of course.  With this tool in particular, there is probably going to be a code size burden that is higher than looping over a list of string literals and repeatedly calling a stdlib routine.  But how much of a burden?  I suspect that this tool would be hard-pressed to generate anything useful for a PIC18, but I also suspect that it could generate usable code for a LPC1114.

In the end, the re2c-generated lexer was emitted, and verified against the test program by checking the the generated token sequence.  It took a few iterations to get all the patterns right.  But because I was newly hot on the trail improving the jump_linenum() logic, I proceeded towards that implementation before setting down to profile this one.