HTML, It’s Bigger on the Inside

From late March 2014

Before I solved the problem of unreliable downloading [1], I did a lot of messing around with linear and ring buffers to try and speed things up. I had some nice code for a ring buffer and was happily using that for the incoming HTML but, like the One Ring, it had the same problem. It was too heavy (in terms of KBs used).

So I had to toss that code into the volcano.

Then I had a radical thought, as mentioned in my previous post: Don't use a buffer at all. I thought I might just parse the HTML as it streamed into the Arduino.

It was at this point that I actually thought to do a web search to see if anyone else had created some HTML parsing code for an embedded CPU.

Of course no-one had. It's a silly thing to do. [2]

I needed something, not only for compact storage but it's much easier and faster to check the match of a single numerical value versus some arbitrary length string.

So, roll up sleeves and get thinking. Some kind of hash like md5 seemed in order. I had some criteria, which meant I couldn't use 'big' hashes. No, that would just be too easy.

It needed to be:

Fast to run on an Arduino (no complex maths)
Able to return a numerical hash
Able to fit in one byte, ideally, but definitely no more than two bytes

On the plus side, it only needs to return unique hash codes for the set of HTML tags I would be implementing.

After a bit of thinking, referring to HTML tags and experimental hacking I settled on the following simple function, which iterates over each char (c) of a string.

tagHash = (tagHash << 1) + ((int) toLowerCase (c)) - 32;

ASCII values less that 32 are not used, so most of the time the character values will be between 0 - 94 and HTML is not case sensitive anyway. [3]

The right-hand side of the function handles uniqueness between similar valued characters and the left-hand side handles spread for longer tags (by doubling the current tag value). This results in integer magnitude values rather than bytes, but I could live with that. A function to squeeze HTML into a single byte would be too complex.

Yes, with hind-sight, it would have been better to convert all characters to upper case, rather than lower case. This would have resulted in an input range of 0-64. [4]

Note that the enclosing angle brackets < > of a tag are not included in the hash (but I often include them in code comments for clarity). Not only are they not unique, but ignoring them allows me to do some interesting stuff later on.

Once I had the hashed HTML tags, I could use them for logic or match and replace them with a special ASCII code for writing to the SD card.

It's about time for some actual code. My definitions and conversion tables are below.

// Use non-printable ASCII characters for tag codes

#define TAG_CR       14
#define TAG_HEADING1 15
#define TAG_HEADING2 16
#define TAG_BOLD1    17
#define TAG_BOLD2    18
#define TAG_LINK1    19
#define TAG_LINK2    20
#define TAG_LINK3    21
#define TAG_HR       22
#define TAG_LIST     23
#define TAG_PRE      24
#define TAG_PRE2     25
#define TAG_HTTP     26

// Hash -> tag & flag mappings

#define NUMTAGS 76
static const uint16_t tag_codes[] PROGMEM = {
161, TAG_HEADING1, // <h1>
162, TAG_HEADING1, // <h2>
163, TAG_HEADING1, // <h3>
164, TAG_HEADING1, // <h4>
80,  TAG_CR,       // <p>
226, TAG_HR,       // <hr>
246, TAG_CR,       // <ul>
234, TAG_CR,       // <ol>
225, TAG_LIST,     // <li>
553, TAG_PRE,      // <pre>
221, TAG_HEADING2, // </h1>
222, TAG_HEADING2, // </h2>
223, TAG_HEADING2, // </h3>
224, TAG_HEADING2, // </h4>
443, TAG_CR,       // <br/>
214, TAG_CR,       // <br>
110, TAG_CR,       // </p>
294, TAG_CR,       // </ol>
306, TAG_CR,       // </ul>
285, TAG_CR,       // </li>
673, TAG_PRE2,     // </pre>
66,  TAG_BOLD1,    // <b>
96,  TAG_BOLD2,    // </b>
5199, TAG_BOLD1,   // <strong>
6159, TAG_BOLD2,   // </strong>
65,   TAG_LINK1,   // <a>
2253, TAG_LINK2,   // URL
95,   TAG_LINK3,   // </a>

1428, 38,  // Ampersand -> &
2682, 96,  // 8216; Curly single quote -> `
2683, 39,  // 8217; Curly single quote -> '
6460, 128, // nbsp;
2680, 34,  // 8220; Curly double quotes -> "
2681, 34,  // 8221; Curly double quotes -> "
2677, 45,  // 8211; Hyphens en -> -
2678, 45,  // 8212; Hyphens em -> -
368,  62,  // Greater than -> >
388,  60   // Less than -> <
};

While I was at it, there was something else I'd need to hash in a similar way. Every ampersand escaped HTML character would also need to be converted to a chosen ASCII equivalent.

An advantage of using this table for matching is that I can also use it to map any type of heading tag <h1>, <h2>, etc into just on heading tag. The table above also gives clues to the tags I was going to support for output.

Now, as the characters stream in I can flag a '<' character and start converting all following characters into the hashed value. If I hit a break character I set another flag and stop hashing. After reading a '>' character I can output the hash value, reset the flags and output plain text again.

Yeah, that should work.

[1] Or made it slightly LESS unreliable.

[2] Refer to my first post.

[3] See my last post. Those codes are for HTML tags aliases.

[4] There's no way I'm going back to change all my HTML hash references and changing the function now. It works as it.

Leaves on the line

A picture is worth 2 re-writes

Discussions

Become a Hackaday.io Member