Dealing with re-alignment and 2 string types only

The version 0.1 has some little issues when splitting an existing string. Nothing severe but there is room for improvements.

Oh and WHO uses strings longer than 64K ? If a block gets THAT long, there are many benefits to spitting it (such as : less frequent re-alignment).

I've also been playing with the idea of adding an offset field to the header/descriptor but that would break the whole principle of "the pointer is the start of the character string area".

I'm left with a couple of easy and practical formats :

Str8 has 1 byte of prefix and is odd-aligned (LSB=1)
Str16 has 2 bytes and is even aligned (LSB=0)

Then the rules become even simpler :

If your string is odd-aligned, make it a Str8, otherwise make it a Str16.
if your string is longer than the format, you can split it into suitably aligned strings and store their pointers in a vector/array...

Note :

When you split a string, you can control the alignment of the second sub-element by choosing the evenness of the size of the first sub-element.
You can make a Str16 shorter than 256 bytes because there is no smart-ass encoding to consider : the size field does not try to use every trick in the bag, such as adding a 255 offset as in serialised formats that boost entropy by avoiding any degenerate case.

In #gdups I define the strings with this structure :

/* description d'un chemin */
struct noeud_chemin {
  /* identifiant unique du répertoire : */
  dev_t st_dev;
  ino_t inode;
  /* pointeur pour aller vers la racine */
  struct noeud_chemin * parent;
  /* le nom */
  short int taille_nom;
  char nom[2];
};

The first versions of gdups declared "nom" as type "[]" so it was an array of characters of undefined/unbounded size. Later compilers complained and I had to define a minimal size, then I cast-ed the pointer to the start of nom. That's ugly... That is why even though I can describe the variable format as an union of structs, this restricts the code too much and I will not use the union in practice (also because the pointer does NOT point to the union itself but one of its elements).

struct Str8_t {
  uint8_t size8;
  char str[] __attribute__((packed)) ;
}

struct Str16_t {
  uint16_t size16;
  char str[] __attribute__((packed)) ;
}

typedef union StrStruct {
  struct Str8_t;
  struct Str16_t;
} StrStruct;

The above code is descriptive but not practical because we don't use pointers to StrStruct : the pointer itself gives the type, without the need to check a "type" field in the struct.

One benefit is that we can "prepare" or prefetch some code while the actual data is being fetched, so it increases parallelism.

char * some_strptr; // this is not a pointer to the union !

...

int size;
char a = *some_strptr; // trigger the fetch from memory

// meanwhile we can calculate the pointer and type of the size prefix
if (some_strptr & 1) {
  // LSB=1 : odd pointer means str8
  size = * (uint8_t *) (some_strptr - 1)); // clear the LSB to get the prefix byte

  ...
}
else {
  // LSB=0 : even pointer means str16
  size = * (uint16_t *) (some_strptr - 1); // subtraction is necessary

  ...
}

This is pretty good for OOO cores. And to add more icing on the cake, most cores love loads with an immediate offset so it is good to not have boolean operations at the last stage of address calculation.

In-order cores don't like to jump around so a linearised version is required. This version "re-aligns" the pointer :

prefix_ptr= (some_strptr - 1) & (~1)

If some_strptr was odd/Str8, the -1 clears the LSB and the mask does nothing.
If some_strptr was even/Str16, the -1 "odds" the pointer, which then is "evened" by the mask.

int size = * (uint16_t *) prefix_ptr; // some heavy casting here
if (some_strptr & 1)
  size &= 0xFF;  // keep only the lower byte (LittleEndian assumed)

In assembly language, this can be reduced to only a few instructions if the processor supports predicates. The #YASEP Yet Another Small Embedded Processor has a strange "load" method that drops the LSB so this would be even more simple.

Using only the LSB of the pointer simplifies a LOT of things and removes a lot of duplicate code and corner cases.

But we are left with no higher-level view of a composite, multi-part string : this would be a different layer of code to design.

Context

Discussions

Become a Hackaday.io Member