Line Number Cache

Jun-16 12:09 PM

As mentioned before, the core 'goto' implementation is in jump_linenum(), and works by restarting parsing of the entire program until the desired line if found. Here is a compacted form:

static void jump_linenum(int linenum)
	{
	tokenizer_init(program_ptr);
	while(tokenizer_num() != linenum)
		{
		do
			{
//... eat tokens; emit an error if we reach end-of-stream
			}
		while(tokenizer_token() != TOKENIZER_NUMBER);
		}
	}

I was not wanting to do a major re-engineering of the ubasic, so I decided to keep the rest of the system as-is, and only rework this method.

The strategy I used here was to make a 'cache' of line-number-to-program-text-offset mappings. Then, when this function is invoked, the cache is consulted to find the offset into the program, and set the parser up there. Because tokenizer_init() is a cheap call, I used it to do the setting up of the parser by simply pretending that the program starts at the target line we found, rather than try to fiddle with the internals of the parser state. A hack, but a convenient one, and ultimately this exercise was to prove the concept that improving jump_linenum() would have a material impact, and how much.

For simplicity, my 'cache' was simply an array of structs, and my 'lookup' was simply a linear search. Also, I simply allocated a fixed block of structs. In a production system, this could be improved in all sorts of ways, of course, but here I just wanted to see if it was a worthwhile approach.

The struct I used was simple, and there was added a 'setup' method that was called during program initialization time to populate the cache:

//HHH a line number lookup cache; this is just proof-of-concept junk
typedef struct
{
    uint16_t _nLineno;	//the line number
    uint16_t _nIdx;		//offset into program text
} LNLC_ENT;
LNLC_ENT g_nLines[1000];	//mm-mm-mm. just takin' memory like it's free, and limiting the program to 1000 lines
int g_nLineCount = -1;
#define COUNTOF(arr) (sizeof(arr)/sizeof((arr)[0]))

void __setupGotoCache(void)
{
    tokenizer_init(program_ptr);
    g_nLineCount = 0;	//the beginning is a very delicate time
	
    while ( tokenizer_token () != TOKENIZER_ENDOFINPUT )
    {
    	int nLineNo;
    	const char* pTok;
    	const char* __getAt (void);	//(added to tokenizer.c)

    	//we must be on a line number
    	if ( tokenizer_token () != TOKENIZER_NUMBER )
    	{
            g_nLineCount = -1;	//purge 'goto' cache
	    sprintf ( err_msg, "Unspeakable horror\n" );
	    stdio_write ( err_msg );
	    tokenizer_error_print ();
	    longjmp ( jbuf, 1 );
	}
	//remember where we are
	nLineNo = tokenizer_num ();
	pTok = __getAt ();
	//create an entry
	g_nLines[g_nLineCount]._nLineno = nLineNo;
	g_nLines[g_nLineCount]._nIdx = pTok - program_ptr;

	//whizz on past this line
	while ( tokenizer_token () != TOKENIZER_ENDOFINPUT &&
		tokenizer_token () != TOKENIZER_CR
		)
	{
	    tokenizer_next ();
	}
	if ( TOKENIZER_CR == tokenizer_token () )
	{
	    tokenizer_next ();
	}

	//next entry
	++g_nLineCount;
	//too many lines?
	if ( g_nLineCount >= COUNTOF ( g_nLines ) )
	{
	    g_nLineCount = -1;	//purge 'goto' cache
	    sprintf ( err_msg, "Program too long to cache!\n" );
	    stdio_write ( err_msg );
	    tokenizer_error_print ();
	    longjmp ( jbuf, 1 );
	}
    }
    //did we get anything?  if not, purge 'goto' cache
    if ( 0 == g_nLineCount )	//pretty boring program!
	g_nLineCount = -1;	//purge 'goto' cache

    //be kind, rewind
    tokenizer_init ( program_ptr );
}

So, we simply whizz through the program, tossing away tokens (much like the original jump_linenumber() implementation), and take note of the text offset when we reach a start-of-line. But... we do this only once, instead of every time there is a transfer-of-control.

The jump_linenumber() implementation was altered thusly:

/*static*/ void jump_linenum(int linenum)
    {
	//alternative implementation uses line number lookup cache
	int nIdx;
	for ( nIdx = 0; nIdx < g_nLineCount; ++nIdx )
	{
            //binary compare is better than 
	    if ( g_nLines[nIdx]._nLineno == linenum )
	    {
	        //our hacky 'goto' implementation is to re-init the tokenizer
		//at the start of the line we found.  This works, and saves a
		//lot of haquery in the tokenizer module to reposition within
		//the existing program image.
		tokenizer_init ( &program_ptr[g_nLines[nIdx]._nIdx] );
		break;
		}
	    }
	    if ( nIdx == g_nLineCount)
	    {
	        sprintf(err_msg,"Unreachable line %d called from %d\n",linenum,last_linenum);
	        stdio_write(err_msg);
	        tokenizer_error_print();
	        longjmp(jbuf,1);  // jumps back to where setjmp was called - making setjmp now return 1
	    }
	}

Ta-da!

So, obviously there are numerous improvements that can/should be made. Some that come to mind are:

the fixed allocation of the lookup structure is probably sub-optimal. I arbitrarily chose 1000 entries, which is wasteful and limiting. Some memory management is probably worthwhile.
it would be interesting to compare this against a trivial implementation that does not have a lookup cache, but instead simply does not tokenize to discard the line. Maybe using something like strchr() to find the end of line.
a different lookup structure other than a linearly searched list might be worthwhile.
...?

But this should be good enough for now. On to quantifying the impact of this stuff...

Proper Parsing with RE2C

Truth In Numbers

Discussions

Become a Hackaday.io Member