Dealing with re-alignment and 2 string types only

A project log for byte string format

Is it novel or not ? I won't wait for prior art to study it and make my own library.

Yann Guidon / YGDESYann Guidon / YGDES 05/17/2020 at 15:040 Comments

The version 0.1 has some little issues when splitting an existing string. Nothing severe but there is room for improvements.

Oh and WHO uses strings longer than 64K ? If a block gets THAT long, there are many benefits to spitting it (such as : less frequent re-alignment).

I've also been playing with the idea of adding an offset field to the header/descriptor but that would break the whole principle of "the pointer is the start of the character string area".

I'm left with a couple of easy and practical formats :

Then the rules become even simpler :

Note :

In #gdups  I define the strings with this structure :

/* description d'un chemin */
struct noeud_chemin {
  /* identifiant unique du répertoire : */
  dev_t st_dev;
  ino_t inode;
  /* pointeur pour aller vers la racine */
  struct noeud_chemin * parent;
  /* le nom */
  short int taille_nom;
  char nom[2];

The first versions of gdups declared "nom" as type "[]" so it was an array of characters of undefined/unbounded size. Later compilers complained and I had to define a minimal size, then I cast-ed the pointer to the start of nom. That's ugly... That is why even though I can describe the variable format as an union of structs, this restricts the code too much and I will not use the union in practice (also because the pointer does NOT point to the union itself but one of its elements).

struct Str8_t {
  uint8_t size8;
  char str[] __attribute__((packed)) ;

struct Str16_t {
  uint16_t size16;
  char str[] __attribute__((packed)) ;

typedef union StrStruct {
  struct Str8_t;
  struct Str16_t;
} StrStruct;

The above code is descriptive but not practical because we don't use pointers to StrStruct : the pointer itself gives the type, without the need to check a "type" field in the struct.

One benefit is that we can "prepare" or prefetch some code while the actual data is being fetched, so it increases parallelism.

char * some_strptr; // this is not a pointer to the union !


int size;
char a = *some_strptr; // trigger the fetch from memory

// meanwhile we can calculate the pointer and type of the size prefix
if (some_strptr & 1) {
  // LSB=1 : odd pointer means str8
  size = * (uint8_t *) (some_strptr - 1)); // clear the LSB to get the prefix byte

else {
  // LSB=0 : even pointer means str16
  size = * (uint16_t *) (some_strptr - 1); // subtraction is necessary


This is pretty good for OOO cores. And to add more icing on the cake, most cores love loads with an immediate offset so it is good to not have boolean operations at the last stage of address calculation.

In-order cores don't like to jump around so a linearised version is required. This version "re-aligns" the pointer :

prefix_ptr= (some_strptr - 1) & (~1)
int size = * (uint16_t *) prefix_ptr; // some heavy casting here
if (some_strptr & 1)
  size &= 0xFF;  // keep only the lower byte (LittleEndian assumed)

 In assembly language, this can be reduced to only a few instructions if the processor supports predicates. The #YASEP Yet Another Small Embedded Processor  has a strange "load" method that drops the LSB so this would be even more simple.

Using only the LSB of the pointer simplifies a LOT of things and removes a lot of duplicate code and corner cases.

But we are left with no higher-level view of a composite, multi-part string : this would be a different layer of code to design.