Close

here's an idea... parsing

A project log for limited-code hacks/ideas

When you're running out of code-space... sometimes yah's gots to hack. Here are some ideas.

eric-hertzEric Hertz 12/09/2016 at 17:345 Comments

This is just a random-realization while working on my project... maybe it's obvious to everyone in-the-know.

Say you're parsing something, like commands from a terminal-window...

Say you've got a whole bunch of commands, but they mostly all follow the same handful of formats, like Ax and Ay, Bx and By, etc.

Lemme think of an example...

Say motor commands (terminated with '\n'):

SMn = stop motor number n (where n is one character, 0-9)

AMnx = advance motor N by x millimeters (where x is any number)

FMnx = move motorN forwards at x mm/sec

and so-forth.

One could, obviously, parse each character as it comes through, then do a whole bunch of nested if-then statements.

Another idea is to combine the first and second characters into a single 16-bit variable, then use a switch() statement. Maybe not ideal for *this* example, but I've found it useful at times.

So, that's one consideration, here's another:

Say this motor-system also has LED-commands:

BLnx = blink LED n x times per second

Obviously the n and the x, here, don't apply to a motor...

So, then, the whole nested-if statement thing makes sense, again, right?

But we're worried about *size* here, not speed... (baud-rate's way slower than your processor, right?)

So, then, maybe it makes sense to parse 'n' and 'x' and store them in argument-variables, and only *after that* handle the actual Command characters.

"But wait! 'SMn' doesn't have an x!"... Right... but here's the idea:

Say everything's stored in a string-buffer... and whatever arrived after the '\n' may very well be data from a previous command... But, your numeric-parser for x terminates as soon as anything non-numeric comes through (\n (or just get rid of the \n and it'll be terminated with \0...))... the variable for argument x will be filled with 0, but even that doesn't matter, because it's not being used, in this case...

Then why parse it if it's not even part of the command, and isn't even there in the first place?

So, here it is without...

uint16_t command = string[0] | (string[1])<<8;

uint8_t deviceNum = string[2] - '0';
char *value = &(string[3]);

#define commandify(a,b) \
        ((uint16_t)a | ((uint16_t)b)<<8)

switch(command)
{
    case commandify('S','M'):
        motor_stop(deviceNum);
        break;
    case commandify('A','M'):
        mm = parseNumber(value);
        motor_advance(deviceNum, mm);
        break;
    ...
    case commandify('B','L'):
        rate = parseNumber(value);
        led_blink(deviceNum, rate);
        break;
    ...
}

So, now, for the example described, you've either got to call 'parseNumber()' three times for the three commands that use it, or explicitly handle the 'S' case separately from the switch, (makes sense, unless there are *several* such cases, in which case your test becomes quite large, maybe even a second switch-statement).

Or, just parse the number from the start, and don't use it if you don't need to.

And, let's make it even more interesting, what if there's another command:

PSs = Print string s

Could *still* call parseNumber, *and* fill deviceNum (both with garbage) and have a really simple switch-statement (maybe even a lookup-table, at this point):

[Note 2023: WHOOPS! This is glitchy!]

uint16_t command = string[0] | (string[1])<<8;

uint8_t deviceNum = string[2] - '0';

#define commandify(a,b) \
        ((uint16_t)a | ((uint16_t)b)<<8)

//### No way you're gonna get away with floats
// in a 1K project, without an FPU ;)
float value = parseNum(&(string[2]));

switch(command)
{
    case commandify('S','M'):
        motor_stop(deviceNum);
        break;
    case commandify('A','M'):
        motor_advance(deviceNum, value);
        break;
    ...
    case commandify('B','L'):
        led_blink(deviceNum, value);
        break;
    ...
    case commandify('P','S'):
       printf("%s", &(string[2]));
       break;
    ...
}

Discussions

Bharbour wrote 03/26/2023 at 23:58 point

A motor control project that I worked on back in the mid 2000's used a DSP processor (a Mot 56000 variant) that did not support byte addressing, only 16 bit quantities. I did something similar to this rather than waste half the text constant space. That is the only processor that I have worked on that does not support byte addressing.

  Are you sure? yes | no

Eric Hertz wrote 03/27/2023 at 01:13 point

oh, interesting point! So it'd usually store strings in the code-space as 16bits per character? Wait, does that mean if you declared a uint8_t in RAM  it'd actually put it in 16 bits?!

Now that you mention it, I wonder how AVRs pull off 8bit data in the code-space... I'm almost certain I recall the code-space addresses are two bytes apiece. I'll have to look that up.

Update: heh, "special handling"... the AVR's LPM instruction prepends an address bit indicating high/low byte. IOW, it uses 8bit addresses.

  Are you sure? yes | no

Eric Hertz wrote 03/27/2023 at 01:20 point

Gah! This opens up a HUGE can of worms. if a uint8_t is actually stored in a 16bit memory location, what happens with wraparound math?! E.G. 

uint8_t a =0xff;

a++;

printf("%d, %d", (int)a, *((int*)&a));

  Are you sure? yes | no

Bharbour wrote 03/27/2023 at 02:22 point

yes, it put uint8_t variable in a 16 bit space. I only did one project with that processor before work/life got in the way. When I got back to that kind of home projects, the ARM processors were showing up with quadrature encoders and I never went back to it.

I don't remember how it dealt with byte overflows. Probably with something like sign extension.

  Are you sure? yes | no

Eric Hertz wrote 03/22/2023 at 05:21 point

Just ran into this again the other day...

Ran some experiments with typecasting the data at the string's start-address directly to a uint16_t, thus the uint16 contains both the characters already shifted.

It's not easy, gcc doesn't like the idea unless you typecast pointers inbetween, but it is doable. But, beware of the [I'm guessing] WHY gcc doesn't make it easy:

Endianness varies from machine to machine... That's the more obvious caveat.

Sometimes endianness is not a all-big vs. all-little thing; some procs use big for some things and little for others! E.g. what direction does its stack addresses go when items, like a bunch of characters in a string, are added?

And, next we have bit-width and alignment.

e.g. in a 32bit proc you might find that 'char string[5]' always starts aligned at a 4byte boundary, which would probably work great with the typecasting. But what if it doesn't, and instead starts at the last byte? Can that address be typecasted as a uint16 extending past the 4byte boundary? I believe this is processor-dependent. Some might allow it, and issue two separate reads, whereas the compilers for another might complain. I dunno.

It's definitely NOT *portable* to do it this way.

The way shown in the code, above, should be.

But on an 8bitter it likely also means copying the two bytes, individually, into two new registers rather than just using the ones it's already in, so it's likely not as efficient. I dunno... Maybe not, if the string's in RAM anyhow, it'd have to copy to regs for comparison anyhow, eh? Might be *exactly* as-efficient, either way. With one's being architecture-independent and the other's not. You decide!

  Are you sure? yes | no