Is it novel or not ? I won't wait for prior art to study it and make my own library.
To make the experience fit your profile, pick a username and tell us what interests you.
We found and based on your interests.
Any technology has a "preferred" or "appropriate" application domain/range. The aligned string format has been thought to address needs that arise when dealing with ASCII-Z strings during mundane operations, mainly POSIX API and other low level operations that must remain lightweight.
It is meant to be "somewhat" compatible with POSIX but would ideally replace it (if I had the chance to redo it all over). This is why the trailing NUL is "optional" : recommended for ease and compatibility for working with Linux for example, but would be dropped in later implementations.
It is not meant as an internal representation format for language because memory management is not considered (though it could be used when encapsulated with pointers, reference counts and other necessary attributes)
The fundamental element is the 8-bits byte, or uint8_t in today's C. This deals with ASCII and UTF-8 only. But since the proposed format deals nicely with NUL chars inside the string, this is not inherently limited to text.
The "lightweight" aspect is important : the code must run on microcontrollers (where the usual POSIX string.h is considered a nuisance)
The most frequently used string function is xxprintf() which internally often performs string concatenation with variable formats. Otherwise we would use write() directly, right ?
strlen() is a nuisance : it shouldn't have to exist, at least as an O(n) function.
To prevent buffer overflows, sprintf() must not be used when coding defensively (the default approach, normally). This is why we have the snprintf() variant and its siblings, but they give the result after the fact ! So we must allocate a potentially large buffer, check the result and eventually resize the buffer. If we had the length before the concatenation, we would check early for overflow then directly allocate the right buffer size.
Concatenation is one of the most frequent operations on strings because we often need to join several fields to create a message, for example to display the respective value of several numbers or even build the header of a HTTP packet.
Wikipedia describes many operations :
but their significance differs substantially from a text processor for example. The typical POSIX functions perform them with close O(n) costs but a binary tree structure is naturally a more appropriate higher-level representation. This is why the aligned strings are only the lower layer of a string system and only few operations are performed on them (those that allow the above list to be implemented).
To conserve memory and speed, operations must be performed in place as much as possible. Moving blocks around is painful. With the aligned strings, it's even more important because the alignment is critical. Fortunately it is easy to re-align the start of a string by splitting it and fixing/using the appropriate alignment and size in the small prefix block that is generated.
Str8 and Str16 are a subset of the 4 possible LSB codes. If we want to design a practical binary tree system, the pointer itself must say if it points to a node or a leaf... so we then have 2 variants :
The difference between aStr and uStr is only at the syntactical level because the representation is the same but it prevents confusion and hints the type checker of the compiler that there is a potential incompatibility (that can be bypassed by a cast if needed).
To be continued...
...
The version 0.1 has some little issues when splitting an existing string. Nothing severe but there is room for improvements.
Oh and WHO uses strings longer than 64K ? If a block gets THAT long, there are many benefits to spitting it (such as : less frequent re-alignment).
I've also been playing with the idea of adding an offset field to the header/descriptor but that would break the whole principle of "the pointer is the start of the character string area".
I'm left with a couple of easy and practical formats :
Then the rules become even simpler :
Note :
In #gdups I define the strings with this structure :
/* description d'un chemin */
struct noeud_chemin {
/* identifiant unique du répertoire : */
dev_t st_dev;
ino_t inode;
/* pointeur pour aller vers la racine */
struct noeud_chemin * parent;
/* le nom */
short int taille_nom;
char nom[2];
};
The first versions of gdups declared "nom" as type "[]" so it was an array of characters of undefined/unbounded size. Later compilers complained and I had to define a minimal size, then I cast-ed the pointer to the start of nom. That's ugly... That is why even though I can describe the variable format as an union of structs, this restricts the code too much and I will not use the union in practice (also because the pointer does NOT point to the union itself but one of its elements).
struct Str8_t {
uint8_t size8;
char str[] __attribute__((packed)) ;
}
struct Str16_t {
uint16_t size16;
char str[] __attribute__((packed)) ;
}
typedef union StrStruct {
struct Str8_t;
struct Str16_t;
} StrStruct;
The above code is descriptive but not practical because we don't use pointers to StrStruct : the pointer itself gives the type, without the need to check a "type" field in the struct.
One benefit is that we can "prepare" or prefetch some code while the actual data is being fetched, so it increases parallelism.
char * some_strptr; // this is not a pointer to the union !
...
int size;
char a = *some_strptr; // trigger the fetch from memory
// meanwhile we can calculate the pointer and type of the size prefix
if (some_strptr & 1) {
// LSB=1 : odd pointer means str8
size = * (uint8_t *) (some_strptr - 1)); // clear the LSB to get the prefix byte
...
}
else {
// LSB=0 : even pointer means str16
size = * (uint16_t *) (some_strptr - 1); // subtraction is necessary
...
}
This is pretty good for OOO cores. And to add more icing on the cake, most cores love loads with an immediate offset so it is good to not have boolean operations at the last stage of address calculation.
In-order cores don't like to jump around so a linearised version is required. This version "re-aligns" the pointer :
prefix_ptr= (some_strptr - 1) & (~1)
int size = * (uint16_t *) prefix_ptr; // some heavy casting here
if (some_strptr & 1)
size &= 0xFF; // keep only the lower byte (LittleEndian assumed)
In assembly language, this can be reduced to only a few instructions if the processor supports predicates. The #YASEP Yet Another Small Embedded Processor has a strange "load" method that drops the LSB so this...
Read more »
Create an account to leave a comment. Already have an account? Log In.
this would be possible if the context was well confined in a language framework for example (such as Python, JS and others).
In Delphi for example, there is a 32b prefix but everything is aligned.
In my case I want to perform operations that differ from the safe&sane environment of managed languages and I want to avoid "shifting" blocks all the time. I want to do thing "in place" as much as possible. So the byte prefix is useful BUT often not large enough. 2 bytes forces alignment of the start of the character area.
I'll have to write a log about the use cases & environment...
Notes for later : Some random googling and readings...
https://en.m.wikipedia.org/wiki/Null-terminated_string
https://stackoverflow.com/questions/7648947/declaring-pascal-style-strings-in-c
https://stackoverflow.com/questions/4418708/whats-the-rationale-for-null-terminated-strings
https://en.m.wikipedia.org/wiki/String_(computer_science)#Length-prefixed
https://en.m.wikipedia.org/wiki/Rope_(data_structure)
https://queue.acm.org/detail.cfm?id=2010365
https://wiki.freepascal.org/Character_and_string_types
http://web.archive.org/web/20151125114234/http://www.codexterity.com/delphistrings.htm
https://wiki.freepascal.org/TStringList-TStrings_Tutorial
http://www.paulgraham.com/hundred.html
https://bobobobo.wordpress.com/2008/01/28/how-to-use-variable-argument-lists-va_list/
This one will be handy later :-)
Actually you can make the format more useful by extending the record backwards:
-4: uint_24: capacity, uint_8: flags
0: length (0-24b), uint_8 data[]
length: \0
This allows you to carry information about the capacity of the buffer in the record, for passing to str(n)* and snprintf routines. The flags could encode options like extend on demand, etc. which wrapper functions could use. Thus you still have compatibility with C library functions but can support enhanced behaviour.
Since this constitutes public disclosure, this extension cannot be patented from now on. 😉
yes, it is of course possible to extend backwards.
Now, I don't know how to tell a C compiler how to do it.
in C, I used to use a struct for Pascal-type strings, and it starts at address 0.
My idea is to use a 3rd bit of the pointer to flag the extra fields, which will take 4 more bytes like in you case.
There are various ways. One is to use an enclosing struct:
struct {
uint_32 capflags;
uint_8 data[];
}
Then mask the low 2 bits and subtract sizeof(uint_32) from the address you have to get to the capflags.
If you align to 8 bytes, you don't need to subtract the offset from the address :-)
Yes, but that might be a waste of memory and you would have to make sure that any heap allocation routines return that alignment.
@Ken Yap you can always re-align a pointer to a block you allocated.
Or even allocate a large block and make your own allocation system with the desired granularity :-)
Of course, but you have to overallocate in malloc then round up to the desired alignment. So 4 byte alignment may already be the default and you have to ensure 8 byte alignment with this method.
never mind : i'm about to settle for 16-bits alignment :-P
You can probably define a C++ class to handle these "ystrings" and to convert to and from C strings. Conversion to a C string is easy, just return the pointer. Conversion from a C string is more work, you have to allocate space and do a copy to ensure you have the alignment you want. It makes operations like ystrlen O(1) rather O(n) complexity.
Another consideration is whether you will normalise ystrings after an operation. For example all three represent the same string:
0 0 5 h e l l o \0
0 5 h e l l o \0
5 h e l l o \0
C++ is not for me... I'm having more fun with its bastard grandkid JavaScript, who rediscovers life's rules and hardships after turning 20 ;-)
"ystring" does not roll on my tongue but... nah :-P
Yes, conversion to C is easy. That's the point. It would ease communication with POSIX stuff (hopefully).
Conversion from C is less frequent and would occur in more deterministic situations. Most of the runtime cost (for my cases) is when strings are concatenated, when stuff is assembled, so the net gain should be positive. "reshifting" would occur only at the last moment when the whole batch is serialised/streamed/concatenated, normally only once.
You made a mistake with the degenerate cases : it's little endian ;-)
5 h e l l o \0
5 0 h e l l o \0
5 0 0 h e l l o \0
This really helps because the function only has to read the first 32-bits word and mask according to the pointer.
so normalisation is not critical : there is only one or two bytes of penalty for using a larger format, while enforcing a strict resize all the time is costly due to the constant reshifting.
If there is an overflow, you can avoid the cost of shifting by making a tree of strings.
But concatenation of C strings is already O(n) so can't get any better than that. Only if you have spare space behind one string can you reduce that to the length of one string. Anyway most of the work in s(n)printf is converting the data to characters so saving a bit of string manupulation won't make much difference to CPU work.
managing memory during multiple concatenations can be a P.I.T.A. and it happens a lot...
Remember : I am implementing a HTTP server, which has to manage quite a number of strings of various and variable sizes...
You can only do so much with snprintf() in this case : you can say what is too much but this is arbitrary... and you have to deal with the eventual error anyway.
If you concatenate with a tree first, you get the size of the whole string to allocate before serialising it. And you can process/manage the error before the serialisation happen.
@Ken Yap : this format actually can get better than that. It can concatenate two strings into one "non-ASCIIZ-compatible two-pointer" format "ystring" in O(1) time. You are right that creating an ASCIIZ string of length N by concatenation will inevitably take at least O(N) no matter how you do it. However, concatenating N separate strings of average length M with strcat() requires O( M*N^2 ) ( Schlemiel the Painter's Algorithm https://www.joelonsoftware.com/2001/12/11/back-to-basics/ ) but O( N + N*M ) by making a tree of two-pointer nodes, and flatting down to one long ASCIIZ string only once at the end.
I like this. I may use it myself in the future. Please do not patent! lol
There is no risk : a patent can't apply on a format. Only on methods.
I write everything under AGPLv3 (unless required by a client).
Many software use a prefixed format with size being 4 bytes, which is a waste but keeps the code short and fast. I'm trying to find a compromise...
Tell us what you think :-)
I had occasion to use pascal-style strings just yesterday. It wouldn't have helped in that particular case because all the data were small so I could get away with a fixed size of 8-bits for the length, but I like your adaptable solution for other cases where it might need to spill over to 16-bits or more occasionally, rather than bumping up all lengths to 16-bit just because a few happened to be long.
Vaguely it reminds me of how UTF8 achieves bisexuality with the ANSI world, but your use of the lower two bits of the pointer to encode the length of the length is particularly clever.
I was joking about patents, of course, however this is in fact a 'method'.
Feel free to use that trick :-)
If you only have 1 and 2 bytes for the size, you can simply test the LSB of the pointer and select the appropriate processing code. I am considering that "degraded mode" for 16-bits platforms.
I am thinking a lot about it, now, it's my new pet idea and a lot of things are flashing through my head, we'll see in the next days and weeks after the brightest ideas have percolated to the top...
Using the LSB to encode meta-informations is not new : I think some LISP and IA machines use this technique (for example, add a flag to indicate the type of a node in a linked list or tree) but the "smart" idea here is that the pointer serves 2 purposes at once ! It can indicate the format and still point to the start of the string ! It sounded so cool that I wondered why nobody did it before and I am still digesting a lot of documentation on the subject.
Yes I was thinking that a better analogy than UTF8 might be ARM Thumb instructions: the LSB of the address indicates whether the target code (of a call or interrupt vector) is actually Thumb instructions. Since instructions can never be on an odd address, they re-purposed that otherwise unused bit. (I know this well from disassembling, lol)
Your encoding technique is still useful on 32-bit platforms.
It will be interesting to hear your thoughts about it's applicability to trees and linked lists when you get to that.
Yes, playing with pointers is a risky business...
I've been a CPU architecture geek for a long while and it is always tempting to use "some bits here and there" for "this and that"... And it often backfires.
Here it might be a sweet spot, we'll see. Anything is better than sprintf() ;-)
Become a member to follow this project and never miss any updates
Why not just use a 16-bit length field all the time and avoid the complexity of deciding whether which variant it is, in return for wasting one byte? What you lose in data space you may save in code space.