• Context

    Yann Guidon / YGDES05/19/2020 at 16:31 0 comments

    Any technology has a "preferred" or "appropriate" application domain/range. The aligned string format has been thought to address needs that arise when dealing with ASCII-Z strings during mundane operations, mainly POSIX API and other low level operations that must remain lightweight.

    It is meant to be "somewhat" compatible with POSIX but would ideally replace it (if I had the chance to redo it all over). This is why the trailing NUL is "optional" : recommended for ease and compatibility for working with Linux for example, but would be dropped in later implementations.

    It is not meant as an internal representation format for language because memory management is not considered (though it could be used when encapsulated with pointers, reference counts and other necessary attributes)

    The fundamental element is the 8-bits byte, or uint8_t in today's C. This deals with ASCII and UTF-8 only. But since the proposed format deals nicely with NUL chars inside the string, this is not inherently limited to text.

    The "lightweight" aspect is important : the code must run on microcontrollers (where the usual POSIX string.h is considered a nuisance)

    The most frequently used string function is xxprintf() which internally often performs string concatenation with variable formats. Otherwise we would use write() directly, right ?

    strlen() is a nuisance : it shouldn't have to exist, at least as an O(n) function.

    To prevent buffer overflows, sprintf() must not be used when coding defensively (the default approach, normally). This is why we have the snprintf() variant and its siblings, but they give the result after the fact ! So we must allocate a potentially large buffer, check the result and eventually resize the buffer. If we had the length before the concatenation, we would check early for overflow then directly allocate the right buffer size.

    Concatenation is one of the most frequent operations on strings because we often need to join several fields to create a message, for example to display the respective value of several numbers or even build the header of a HTTP packet.

    Wikipedia describes many operations :

    but their significance differs substantially from a text processor for example. The typical POSIX functions perform them with close O(n) costs but a binary tree structure is naturally a more appropriate higher-level representation. This is why the aligned strings are only the lower layer of a string system and only few operations are performed on them (those that allow the above list to be implemented).

    To conserve memory and speed, operations must be performed in place as much as possible. Moving blocks around is painful. With the aligned strings, it's even more important because the alignment is critical. Fortunately it is easy to re-align the start of a string by splitting it and fixing/using the appropriate alignment and size in the small prefix block that is generated.

    Str8 and Str16 are a subset of the 4 possible LSB codes. If we want to design a practical binary tree system, the pointer itself must say if it points to a node or a leaf... so we then have 2 variants :

    • aligned aStr8 and aStr16, that can also encode aStr24 and aStrNode
    • unaligned uStr8 and uStr16 that may not be directly pointed to by aStrNode in one half of the cases. If there are 2 unallocated bytes in front of a uStr8, the prefix can be turned into aStr24, otherwise the string would be realigned with a split and the pointer would point to a aStrNode that points to the moved first part and the remaining part of the original string.

    The difference between aStr and uStr is only at the syntactical level because the representation is the same but it prevents confusion and hints the type checker of the compiler that there is a potential incompatibility (that can be bypassed by a cast if needed).     

    To be continued...

    ...

  • Dealing with re-alignment and 2 string types only

    Yann Guidon / YGDES05/17/2020 at 15:04 0 comments

    The version 0.1 has some little issues when splitting an existing string. Nothing severe but there is room for improvements.

    Oh and WHO uses strings longer than 64K ? If a block gets THAT long, there are many benefits to spitting it (such as : less frequent re-alignment).

    I've also been playing with the idea of adding an offset field to the header/descriptor but that would break the whole principle of "the pointer is the start of the character string area".

    I'm left with a couple of easy and practical formats :

    • Str8 has 1 byte of prefix and is odd-aligned (LSB=1)
    • Str16 has 2 bytes and is even aligned (LSB=0)

    Then the rules become even simpler :

    • If your string is odd-aligned, make it a Str8, otherwise make it a Str16.
    • if your string is longer than the format, you can split it into suitably aligned strings and store their pointers in a vector/array...

    Note :

    • When you split a string, you can control the alignment of the second sub-element by choosing the evenness of the size of the first sub-element.
    • You can make a Str16 shorter than 256 bytes because there is no smart-ass encoding to consider : the size field does not try to use every trick in the bag, such as adding a 255 offset as in serialised formats that boost entropy by avoiding any degenerate case.

    In #gdups  I define the strings with this structure :

    /* description d'un chemin */
    struct noeud_chemin {
      /* identifiant unique du répertoire : */
      dev_t st_dev;
      ino_t inode;
      /* pointeur pour aller vers la racine */
      struct noeud_chemin * parent;
      /* le nom */
      short int taille_nom;
      char nom[2];
    };
    

    The first versions of gdups declared "nom" as type "[]" so it was an array of characters of undefined/unbounded size. Later compilers complained and I had to define a minimal size, then I cast-ed the pointer to the start of nom. That's ugly... That is why even though I can describe the variable format as an union of structs, this restricts the code too much and I will not use the union in practice (also because the pointer does NOT point to the union itself but one of its elements).

    struct Str8_t {
      uint8_t size8;
      char str[] __attribute__((packed)) ;
    }
    
    struct Str16_t {
      uint16_t size16;
      char str[] __attribute__((packed)) ;
    }
    
    typedef union StrStruct {
      struct Str8_t;
      struct Str16_t;
    } StrStruct;

    The above code is descriptive but not practical because we don't use pointers to StrStruct : the pointer itself gives the type, without the need to check a "type" field in the struct.

    One benefit is that we can "prepare" or prefetch some code while the actual data is being fetched, so it increases parallelism.

    char * some_strptr; // this is not a pointer to the union !
    
    ...
    
    int size;
    char a = *some_strptr; // trigger the fetch from memory
    
    // meanwhile we can calculate the pointer and type of the size prefix
    if (some_strptr & 1) {
      // LSB=1 : odd pointer means str8
      size = * (uint8_t *) (some_strptr - 1)); // clear the LSB to get the prefix byte
    
      ...
    }
    else {
      // LSB=0 : even pointer means str16
      size = * (uint16_t *) (some_strptr - 1); // subtraction is necessary
    
      ...
    }

    This is pretty good for OOO cores. And to add more icing on the cake, most cores love loads with an immediate offset so it is good to not have boolean operations at the last stage of address calculation.

    In-order cores don't like to jump around so a linearised version is required. This version "re-aligns" the pointer :

    prefix_ptr= (some_strptr - 1) & (~1)
    • If some_strptr was odd/Str8, the -1 clears the LSB and the mask does nothing.
    • If some_strptr was even/Str16, the -1 "odds" the pointer, which then is "evened" by the mask.
    int size = * (uint16_t *) prefix_ptr; // some heavy casting here
    if (some_strptr & 1)
      size &= 0xFF;  // keep only the lower byte (LittleEndian assumed)

     In assembly language, this can be reduced to only a few instructions if the processor supports predicates. The #YASEP Yet Another Small Embedded Processor  has a strange "load" method that drops the LSB so this...

    Read more »