Aligned Strings format

Description

This string format (now called Aligned String(s)) is partially compatible with ASCIIZ formatted strings and has a size prefix (like "Pascal strings") which ranges from 0 to 64KiB (or 0 to 255 to save a byte in the constant area).
The trick is that the prefix has a variable size from 1 to 2 bytes, the type is not encoded in the prefix itself but the 2 LSB of the pointer to the string. And this encoding is congruent with the real start of the byte string !
Additionally, type 3 encodes a "list" structure that contains pointers to sub-strings, to ease with concatenation and corner cases.
The last format is type 0 which encodes a single Unicode point (for use in the lists, not as a real pointer)

An extended version adds "Flexible" strings (through the 3rd LSB of the pointer) that can be dynamically resized within the range that was previously allocated.

This solves quite a few problems inherent with the existing Pascal and ASCIIZ formats !

Details

This project documents the design and implementation of a string format that is flexible and

efficient in code space
efficient in data space
efficient in code execution

This is in part due to

its flexible implementation that instantiates the required features only
a definition that is adapted to static and dynamic processing
compatibility with POSIX's ASCIIZ format (that can be disabled)
code and data that can be accessed at any level of abstraction.

Unicode is supported as UTF-8 byte sequences in types 3/F3 lists, and as integer code points in type UP.

This makes it suitable from basic, barebone utilities to more elaborate applications.

To define constant strings, the basic structure is defined as:

typedef struct {
  uint8_t len  __attribute__ ((aligned (4)));
  char text[];
} aString_8;

A variant (with bit 0 of the pointer set) adds another field to keep the allocated size, helping to perform variable-sized operations:

// Same as above but with an extra 32-bit size field:
typedef struct {
  uint32_t allocated __attribute__ ((aligned (8)));
  uint8_t len;
  char text[];
} Flex_aString_8;

The same goes for the 16-bit version (types 2 and F2) and the lists (types 3 and F3 ).

-o-O-0-O-o-

Logs:
1. Dealing with re-alignment and 2 string types only
2. Context
3. 2023 : a new version
4. Evolution...
5. Merge works
6. Holding back a bit
7. More food for thoughts
8. Article !
9. Another possible extension
10. More than an error
11. Fuzzing and safety
.

Files

AS20230117.tgz

gracefully handles NULL, plus found a stupid break bug

x-compressed-tar - 5.03 kB - 01/17/2023 at 06:37

Download

AS20230112.tgz

can merge Unicode points !

x-compressed-tar - 4.76 kB - 01/12/2023 at 07:44

Download

AS20230110.tgz

latest source code with example programs for concat and merge.

x-compressed-tar - 2.60 kB - 01/10/2023 at 13:44

Download

aligned_strings.h

New, better version

x-chdr - 2.56 kB - 01/09/2023 at 18:09

Download

PascalZ.h

First definitions for the 2LSB and 3LSB variants

x-chdr - 1.12 kB - 01/04/2023 at 16:29

Download

Project Logs

Collapse

Fuzzing and safety
Yann Guidon / YGDES • 03/22/2024 at 01:14 • 0 comments

The aligned format is great, while it remains used inside a safe and controlled context.

It can get ugly though when an "aligned string" pointer is transmitted through an untrusted piece of code. This unsafe code could be prevented from dereferencing the string's value but this is not enough. If the pointer itself is modified, all kinds of problems arise.
Receiving a pointer to an aligned string from a dubious source can be less dangerous if the type is restricted. The type 2 (16-bit length) is the safest and it's easy to filter. The Type 3 creates indirect (de)references and the flexible types should be cast back to constant strings (it might not be possible to modify or reallocate the target anyway).
Use Type 2.
More than an error
Yann Guidon / YGDES • 03/11/2024 at 03:04 • 0 comments

Type 0 is NULL/Error and detected from the three LSB.
But the remaining MSB are not checked...
So if the MSB are not cleared, this could lead to yet another type, 8-byte aligned. What could it be ?
Another possible extension
Yann Guidon / YGDES • 09/25/2023 at 01:12 • 0 comments

The recent article Faster String Processing With Bloom Filters got my attention. It refers to https://codeconfessions.substack.com/p/cpython-bloom-filter-usage. And I thought: What if...

It is possible to create a larger structure that contains an even larger field, to hold either a Bloom sieve or a hash/signature, for 2 use cases : sub-string search or string matching. For example, if a string is determined to be tested for equality at some later point, a programming language / Compiler would "tag" the string so the hash is created once the value is assigned. Then the actual test becomes a simple scalar value comparison, followed by a real comparison in case of a hit, which might save some CPU cycles when misses are likely.

Similarly, before performing a sub-string search, the strings (possibly many, again) can be pre-processed so only the 64-bit Bloom filter is used at first, to sieve candidates faster than the naïve method.

However, with already 3 LSB used, such an extension is impractical. The memory losses from alignment might become ridiculous. The hash or sieve values would be handled by the programmer, case-by-case at first, in a separate structure.

And speaking of hashes, I might have another trick up my sleeve, look up #PEAC Pisano with End-Around Carry algorithm !
Article !
Yann Guidon / YGDES • 07/01/2023 at 17:25 • 0 comments

I have published a thorough article on this subject in GNU/Linux Magazine France: "Sus à l’ASCIIZ, vive les chaînes alignées !" in March 2023, with a summary of all the design decisions and more.
More food for thoughts
Yann Guidon / YGDES • 01/17/2023 at 02:59 • 1 comment

I have to credit @Gravis for challenging me in the comments. Even though it does not go in the direction he seems to want, feedback is progress and this has brought some indirect useful bits. Even though the broad lines are defined, there are still details to perfect.

A quirky little detail I didn't properly consider before is how to manage or handle the NULL value. It belongs to the type 0 so the 2 LSB are cleared. There is no incentive to swap it with type 3 again, as type 0 is not dereferenced (so no risk of a NULL pointer crashing the program) and type 3 has an easy mask-based routine to fetch the total length up to 24 bits.

However type 0 gives code points and NULL translates to a character of value 0. This potentially causes confusion in cases where a pointer is returned as an error code, like malloc() failing. It shouldn't insert a 0 in the stream (eventually a Unicode glyph that says "error").

Furthermore a "placeholder" value is defined to indicate that the entry in the list must be skipped. Currently this is defined as value 4 (100b).

Something has to be shuffled around to make the whole thing safer. It is easier to re-attribute the less-used placeholder value, giving the following table :

Attribution MSB other bits
b2 b1-b0
NULL

0 00
Unicode point (including value 0)
0
Unicode value
1 00
Placeholder / skip marker
1

1 00

The value of the placeholder/skip can now be -4 because Unicode values can't affect the MSB.

I don't know yet how to display a NULL value. Maybe ⚠ - U+26A0 - ⚠ ? The Replacement character U+FFFD � is already used for a different purpose and should be avoided to prevent confusion. U+1F6AB 🚫 or U+1F5F2 🗲 could also represent an error.

The summary of this change is below:

Type UP means "Unicode Point or Placeholder". Now let's transform the specification into working code.
Holding back a bit
Yann Guidon / YGDES • 01/13/2023 at 03:02 • 0 comments

These have been awesome 2 weeks of development and maturation of this project that languished for 2 years already. I now have to take a step back for a while because it uncovered another C flaw that is at least as important.

So I just started another project #A software stack..

It will help with dynamic allocation without a garbage collector for a lightweight framework. This will in turn solve quite a few issues with memory allocation I encountered during the design of this code.

I'll have to implement the lists (type 3) (it's pretty easy now since merge() implements most of the logic already) and some printf-like features (school-level number conversion stuff) but they'll depend on the above stack mechanism, so be patient. Meanwhile, enjoy AS20230112.tgz !
Merge works
Yann Guidon / YGDES • 01/10/2023 at 13:53 • 0 comments

I just uploaded a new version ! AS20230110.tgz can now merge strings of types 1 and 2, even multiple in a row. This involved solving many programming challenges and overcoming the reticence of the compiler. For example, declaring and allocating a string is a crazy mess and it is still far from perfect.
But the basic features are functional.
Evolution...
Yann Guidon / YGDES • 01/09/2023 at 18:15 • 0 comments

Progress is strong these days.

I shuffled things around : types 0 and 3 are swapped so optimised code can more easily read the size of lists.

The Flexible strings are also adapted a bit.

The whole thing is now defined as a "descriptor" or "handler" type, called aStr_t and declared in C as uintptr_t

As a result we get this new map :

Type F0 is undefined and should be congruent with type 0 "points".
I uploaded the latest aligned_strings.h here.

I now move on to implement the required support functions.
2023 : a new version
Yann Guidon / YGDES • 01/04/2023 at 17:40 • 0 comments

I have explored the whole system again these last days and came to a new definition and allocation. So you can scrap parts of the 2 year old versions and take this instead:
We have new and refined features, such as an optional "Flexible" flag and a direct Unicode point (shifted) to be used in the type 0 lists !
I have uploaded a first definition at PascalZ.h though it's still preliminary.
Context
Yann Guidon / YGDES • 05/19/2020 at 16:31 • 0 comments
Any technology has a "preferred" or "appropriate" application domain/range. The aligned string format has been thought to address needs that arise when dealing with ASCII-Z strings during mundane operations, mainly POSIX API and other low level operations that must remain lightweight.

It is meant to be "somewhat" compatible with POSIX but would ideally replace it (if I had the chance to redo it all over). This is why the trailing NUL is "optional" : recommended for ease and compatibility for working with Linux for example, but would be dropped in later implementations.

It is not meant as an internal representation format for language because memory management is not considered (though it could be used when encapsulated with pointers, reference counts and other necessary attributes)

The fundamental element is the 8-bits byte, or uint8_t in today's C. This deals with ASCII and UTF-8 only. But since the proposed format deals nicely with NUL chars inside the string, this is not inherently limited to text.

The "lightweight" aspect is important : the code must run on microcontrollers (where the usual POSIX string.h is considered a nuisance)

The most frequently used string function is xxprintf() which internally often performs string concatenation with variable formats. Otherwise we would use write() directly, right ?

strlen() is a nuisance : it shouldn't have to exist, at least as an O(n) function.

To prevent buffer overflows, sprintf() must not be used when coding defensively (the default approach, normally). This is why we have the snprintf() variant and its siblings, but they give the result after the fact ! So we must allocate a potentially large buffer, check the result and eventually resize the buffer. If we had the length before the concatenation, we would check early for overflow then directly allocate the right buffer size.

Concatenation is one of the most frequent operations on strings because we often need to join several fields to create a message, for example to display the respective value of several numbers or even build the header of a HTTP packet.

Wikipedia describes many operations :
- 2.1 Index ...
- 2.2 Concat ...
- 2.3 Split ...
- 2.4 Insert ...
- 2.5 Delete ...
but their significance differs substantially from a text processor for example. The typical POSIX functions perform them with close O(n) costs but a binary tree structure is naturally a more appropriate higher-level representation. This is why the aligned strings are only the lower layer of a string system and only few operations are performed on them (those that allow the above list to be implemented).

To conserve memory and speed, operations must be performed in place as much as possible. Moving blocks around is painful. With the aligned strings, it's even more important because the alignment is critical. Fortunately it is easy to re-align the start of a string by splitting it and fixing/using the appropriate alignment and size in the small prefix block that is generated.

Str8 and Str16 are a subset of the 4 possible LSB codes. If we want to design a practical binary tree system, the pointer itself must say if it points to a node or a leaf... so we then have 2 variants :
- aligned aStr8 and aStr16, that can also encode aStr24 and aStrNode
- unaligned uStr8 and uStr16 that may not be directly pointed to by aStrNode in one half of the cases. If there are 2 unallocated bytes in front of a uStr8, the prefix can be turned into aStr24, otherwise the string would be realigned with a split and the pointer would point to a aStrNode that points to the moved first part and the remaining part of the original string.
The difference between aStr and uStr is only at the syntactical level because the representation is the same but it prevents confusion and hints the type checker of the compiler that there is a potential incompatibility (that can be bypassed by a cast if needed).
To be continued...
...

View all 11 project logs

Discussions

Yann Guidon / YGDES wrote 02/10/2023 at 13:53

https://hackaday.com/2023/02/10/modernizing-c-arrays-for-greater-memory-safety/

https://developers.redhat.com/articles/2022/09/29/benefits-limitations-flexible-array-members#nonconforming_compiler_extensions

https://people.kernel.org/kees/bounded-flexible-arrays-in-c

Are you sure? yes | no

Yann Guidon / YGDES wrote 02/02/2023 at 21:26

https://www.joelonsoftware.com/2001/12/11/back-to-basics/

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/22/2023 at 12:04

https://v8.dev/blog/pointer-compression :-)

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/22/2023 at 12:05

from the link :

The tag bits serve a dual purpose: they signal either strong/weak pointers to objects located in V8 heap, or a small integer. Hence, the value of an integer can be stored directly in the tagged value, without having to allocate additional storage for it.

V8 always allocates objects in the heap at word-aligned addresses, which allows it to use the 2 (or 3, depending on the machine word size) least significant bits for tagging. On 32-bit architectures, V8 uses the least significant bit to distinguish Smis from heap object pointers. For heap pointers, it uses the second least significant bit to distinguish strong references from weak ones:

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/15/2023 at 23:51

15 years ago : http://www.and.org/vstr/comparison lists 28 string libraries. I don't see any with a similar pointer trick.

Are you sure? yes | no

Gravis wrote 01/13/2023 at 15:05

Unless I'm missing something, the benefit of type 1 is so low that differentiating between it an type 2 is almost certainly going to require more code than the single bytes saved in each string.

However, if you keep type 1 then you need to bulletproof implementation that will recoup the cost of the extra code needed in only a couple dozen strings. However, if you have dozens of strings then it's unlikely to be a small program making the savings moot. Finally, if you every merge a type 1 and 2 then the code for that is going to require you to use hundreds of type 1 strings to recoup the savings. It would be logical do do away with type 1 as it's almost certain that type 2 strings will be used for a "how to use this program" aka "--help" string.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/13/2023 at 16:36

The cost for having both type 1 and 2 is a few mask & shift operations.

inline int aStr_length_fast(aStr_t p) {
uint32_t *q = (uint32_t *)(p & ~3);
int LSB = p & 3;
return (*q) & ~((~0) << (LSB<<3));
}

This works for types 1 , 2 and 3 (with type 3 limited to 24b/16M).

Are you sure? yes | no

Gravis wrote 01/13/2023 at 19:06

The problem is that you have to do it for all the functions. It may only require a handful of operations but when it's in multiple functions, you start needing dozens of strings to balance the space savings.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2023 at 01:05

@Gravis OTOH don't you already have to call strln() at these places ?

In modern OoOx CPUs (since the Pentium III) what costs time is random memory accesses and execution flow breaks (calls, jumps, conditional branches in particular).

In the above case, there is no branch, and memory access is to the very same cache line (by construction) so there is only a marginal cost.

Are you sure? yes | no

Gravis wrote 01/14/2023 at 02:27

@Yann Guidon / YGDES

There is more to programming than ye olde C strings. There are C++ strings and NT has the UNICODE_STRING type. The days of relying on strlen are long gone.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2023 at 03:52

@Gravis I know this well, what is your point ?

https://en.wikipedia.org/wiki/String_(computer_science)#Representations

https://en.wikipedia.org/wiki/Comparison_of_programming_languages_(string_functions)

The last link compares about 50 languages.

Furthermore, the class of applications I try to address, including the constraints, is probably not well stated. I swap between many projects and drop ideas here and there...

Are you sure? yes | no

Gravis wrote 01/14/2023 at 06:04

@Yann Guidon / YGDES

"if you keep type 1 then you need to bulletproof implementation that will recoup the cost of the extra code needed in only a couple dozen strings. However, if you have dozens of strings then it's unlikely to be a small program making the savings moot." - Me

"The cost for having both type 1 and 2 is a few mask & shift operations." - You

"The problem is that you have to do it for all the functions." - Me

The point being that the benefits of type 1 are outweighed by it's requirements.

You went on about needing strlen and I pointed out that that's no longer the case as many string implementations exist that hold the length of the string making the benefit a moot point.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2023 at 20:14

@Gravis

"many string implementations exist that hold the length of the string making the benefit a moot point."

FORTH calls them "counted strings" and these were already well accepted in the 60s (before PDP-7 dwellers at Bell Labs selected null-terminated for their pet project and language). That's also called "Pascal style" and certainly something else by other languages.

So yes there are many string implementations, I never pretended inventing them. Now, maybe you don't have to interact with POSIX often, or related tools or interfaces.

Due to alignment, Type 1 provides a marginal benefit, but what would you do with a byte prefix instead ?

The project is not as documented and elaborated as some of my other projects so you might miss some context and other elements. At least there is preliminary example code and I'm working on more elaborated features. They probably don't fit your needs and you are luckily not affected by the tyranic anachronism of <string.h>, and that's good for you.

Are you sure? yes | no

Gravis wrote 01/15/2023 at 22:29

@Yann Guidon / YGDES

I'm more acquainted with POSIX than you may have thought. There are many POSIX compatible implementations already and there is literally no need to rely on POSIX string functions.

Here's the description for the "Better String" library: https://github.com/websnarf/bstrlib/blob/master/bstrlib.txt

If nothing else, you could use bstrlib as a better starting point.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/15/2023 at 23:28

@Gravis

Thanks for the link ! And I presume nothing about your skills and knowledge.

"there is literally no need to rely on POSIX string functions."

I'd have a few things to say about this but I'll try to stay on-topic.

Of course I know I'm not the first one to complain about strings. I didn't know Paul's lib, which has not taken over the traditional libs yet. His author Paul Hsieh is not a mere beginner as I used to read him on c.l.a.x back in the Usenet days. I also used his references for x86 asm optimisation and followed his exploration of hashing algos. So I have deep respect for him.

Now let's simply read the link you sent :

A bstring is basically a header which wraps a pointer to a char buffer. Lets start with the declaration of a struct tagbstring:

struct tagbstring {
int mlen;
int slen;
unsigned char * data;
};

OK that's it, just like in almost every other HLL or interpreter. Delphi adds a refcounter, but so far nothing breathtaking:

* you have to carry a whole struct around, which is less convenient to return from a function than a plain pointer,

* constant strings have mlen=slen,

* the pointer uses 4/8 more bytes to the struct,

* the effort is mostly the same,

* and it adds one level of indirection.

I'm sure Paul's work is a few leagues above mine and he takes a reasonable idea to turn it into a great piece of code that I'd be happy to see integrated into POSIX for example. But you failed to convice me and, replying to

"you could use bstrlib as a better starting point."

I know it's better than standard C (and there is quite a lot to get inspiration from) but I see no benefit over the aligned strings. And it's not a NIH syndrome.

Are you sure? yes | no

Gravis wrote 01/16/2023 at 00:47

@Yann Guidon / YGDES

"you have to carry a whole struct around, which is less convenient to return from a function than a plain pointer,"

What are you talking about? It's a pointer to a struct. Your stuff can be represented the exact same way. You're giving me the feeling that you don't quite grasp how compilers deal with structs.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/16/2023 at 01:46

@Gravis

"You're giving me the feeling that you don't quite grasp how compilers deal with structs."

Feelings are treacherous. The C syntax allows it and it opens a whole can of worms. If it could, should you? You know C's x86 ABI returns the function's result in EAX (or RAX), which clearly can't fit a struct. You can and will play around. Each compiler will do their own thing, copying or dereferencing, sometimes in inefficient or unexpected ways. Whether it returns a pointer (to a struct allocated where ?) or bulk data (that are filled/copied where?) creates secondary issues.

Funnily, I used to decompile and single-step binaries generated from Ada-like VHDL (have a deep look at GHDL) and that crazy language family has no problem returning a struct or more, using other techniques even on a GCC backend. It makes me feel "too tight" when I must code in C again.

"It's a pointer to a struct. Your stuff can be represented the exact same way."

No.

Read again:

struct tagbstring {
int mlen;
int slen;
unsigned char * data;
};

You see, at the end, it's a pointer.

Now:

typedef struct {
uint32_t allocated __attribute__ ((aligned (8)));
uint16_t len;
char text[];
} Flex_aString_16;

You should see that "text" is not a pointer, it's the start of the array, using C99's "Flexible array" notation, that was indeed designed for this purpose. No indirection. You can printf() the pointer directly with a simple (char*) cast to the value, which points to Flex_aString_16.text. The pointer is self-sufficient and that's the whole point.

Anyway, that's a stimulating discussion, thank you.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/13/2023 at 16:40

OTOH, compilers have to align, sometimes more than required so the saving of Type1 is almost irrelevant but it keeps compatibility with some other format.

And with the forced alignment the savings are not huge and only for size=4n+3

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/13/2023 at 17:24

If you dropped type 1, what would you use if for ? I think I have covered the bases so far.

Are you sure? yes | no

Gravis wrote 01/13/2023 at 19:24

Honestly, I think you are attacking the wrong problem. Strings aren't really a problem when it comes to memory with the exception of very special programs. I would say the real issue is efficient allocation and management of memory pools in order to minimize memory fragmentation.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2023 at 01:09

@Gravis I agree that memory allocation is a huge .... "problem" doesn't even come close to the scale of this catastrophic mess.

I'm working on it too, already.

Are you sure? yes | no

Gravis wrote 01/14/2023 at 02:33

@Yann Guidon / YGDES

Then you should look at "two-level segregated fit" because it's a good solve. You will need to be exceptionally clever to make an allocation method that could even compete with it.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2023 at 03:56

I started to work on another aspect/side of the problem.

Are you sure? yes | no

Thomas Castello wrote 01/10/2023 at 09:47

Hello word world!

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/10/2023 at 09:56

Welcome !

Right now I write example code and the primary functions, to be uploaded here soon.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/09/2023 at 22:23

https://www.youtube.com/watch?v=Qyn1qxi73u8

Even Dave's Garage is talking about this obsolete dumpster fire...

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/04/2023 at 06:36

I'm back on the subject, dudes ! With more hindsight and it's slowly coming together...

Are you sure? yes | no

Ken Yap wrote 05/18/2020 at 00:22

Why not just use a 16-bit length field all the time and avoid the complexity of deciding whether which variant it is, in return for wasting one byte? What you lose in data space you may save in code space.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/18/2020 at 17:33

this would be possible if the context was well confined in a language framework for example (such as Python, JS and others).
In Delphi for example, there is a 32b prefix but everything is aligned.

In my case I want to perform operations that differ from the safe&sane environment of managed languages and I want to avoid "shifting" blocks all the time. I want to do thing "in place" as much as possible. So the byte prefix is useful BUT often not large enough. 2 bytes forces alignment of the start of the character area.

I'll have to write a log about the use cases & environment...

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/18/2020 at 00:11

Notes for later : Some random googling and readings...

https://en.m.wikipedia.org/wiki/Null-terminated_string
https://stackoverflow.com/questions/7648947/declaring-pascal-style-strings-in-c
https://stackoverflow.com/questions/4418708/whats-the-rationale-for-null-terminated-strings
https://en.m.wikipedia.org/wiki/String_(computer_science)#Length-prefixed
https://en.m.wikipedia.org/wiki/Rope_(data_structure)
https://queue.acm.org/detail.cfm?id=2010365
https://wiki.freepascal.org/Character_and_string_types
http://web.archive.org/web/20151125114234/http://www.codexterity.com/delphistrings.htm
https://wiki.freepascal.org/TStringList-TStrings_Tutorial
http://www.paulgraham.com/hundred.html

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/18/2020 at 00:31

https://bobobobo.wordpress.com/2008/01/28/how-to-use-variable-argument-lists-va_list/

This one will be handy later :-)

Are you sure? yes | no

Ken Yap wrote 05/15/2020 at 09:00

Actually you can make the format more useful by extending the record backwards:

-4: uint_24: capacity, uint_8: flags

0: length (0-24b), uint_8 data[]

length: \0

This allows you to carry information about the capacity of the buffer in the record, for passing to str(n)* and snprintf routines. The flags could encode options like extend on demand, etc. which wrapper functions could use. Thus you still have compatibility with C library functions but can support enhanced behaviour.

Since this constitutes public disclosure, this extension cannot be patented from now on. 😉

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/15/2020 at 10:51

yes, it is of course possible to extend backwards.

Now, I don't know how to tell a C compiler how to do it.

in C, I used to use a struct for Pascal-type strings, and it starts at address 0.

My idea is to use a 3rd bit of the pointer to flag the extra fields, which will take 4 more bytes like in you case.

Are you sure? yes | no

Ken Yap wrote 05/15/2020 at 11:13

There are various ways. One is to use an enclosing struct:

struct {

uint_32 capflags;

uint_8 data[];

}

Then mask the low 2 bits and subtract sizeof(uint_32) from the address you have to get to the capflags.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/15/2020 at 11:21

If you align to 8 bytes, you don't need to subtract the offset from the address :-)

Are you sure? yes | no

Ken Yap wrote 05/15/2020 at 11:34

Yes, but that might be a waste of memory and you would have to make sure that any heap allocation routines return that alignment.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/17/2020 at 20:56

@Ken Yap you can always re-align a pointer to a block you allocated.

Or even allocate a large block and make your own allocation system with the desired granularity :-)

Are you sure? yes | no

Ken Yap wrote 05/17/2020 at 23:01

Of course, but you have to overallocate in malloc then round up to the desired alignment. So 4 byte alignment may already be the default and you have to ensure 8 byte alignment with this method.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/18/2020 at 00:10

never mind : i'm about to settle for 16-bits alignment :-P

Are you sure? yes | no

Ken Yap wrote 05/14/2020 at 23:36

You can probably define a C++ class to handle these "ystrings" and to convert to and from C strings. Conversion to a C string is easy, just return the pointer. Conversion from a C string is more work, you have to allocate space and do a copy to ensure you have the alignment you want. It makes operations like ystrlen O(1) rather O(n) complexity.

Another consideration is whether you will normalise ystrings after an operation. For example all three represent the same string:

0 0 5 h e l l o \0

0 5 h e l l o \0

5 h e l l o \0

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/15/2020 at 00:11

C++ is not for me... I'm having more fun with its bastard grandkid JavaScript, who rediscovers life's rules and hardships after turning 20 ;-)

"ystring" does not roll on my tongue but... nah :-P

Yes, conversion to C is easy. That's the point. It would ease communication with POSIX stuff (hopefully).

Conversion from C is less frequent and would occur in more deterministic situations. Most of the runtime cost (for my cases) is when strings are concatenated, when stuff is assembled, so the net gain should be positive. "reshifting" would occur only at the last moment when the whole batch is serialised/streamed/concatenated, normally only once.

You made a mistake with the degenerate cases : it's little endian ;-)

5 h e l l o \0

5 0 h e l l o \0

5 0 0 h e l l o \0

This really helps because the function only has to read the first 32-bits word and mask according to the pointer.

so normalisation is not critical : there is only one or two bytes of penalty for using a larger format, while enforcing a strict resize all the time is costly due to the constant reshifting.

If there is an overflow, you can avoid the cost of shifting by making a tree of strings.

Are you sure? yes | no

Ken Yap wrote 05/15/2020 at 00:22

But concatenation of C strings is already O(n) so can't get any better than that. Only if you have spare space behind one string can you reduce that to the length of one string. Anyway most of the work in s(n)printf is converting the data to characters so saving a bit of string manupulation won't make much difference to CPU work.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/15/2020 at 10:48

managing memory during multiple concatenations can be a P.I.T.A. and it happens a lot...

Remember : I am implementing a HTTP server, which has to manage quite a number of strings of various and variable sizes...

You can only do so much with snprintf() in this case : you can say what is too much but this is arbitrary... and you have to deal with the eventual error anyway.
If you concatenate with a tree first, you get the size of the whole string to allocate before serialising it. And you can process/manage the error before the serialisation happen.

Are you sure? yes | no

David wrote 12/01/2020 at 17:58

@Ken Yap : this format actually can get better than that. It can concatenate two strings into one "non-ASCIIZ-compatible two-pointer" format "ystring" in O(1) time. You are right that creating an ASCIIZ string of length N by concatenation will inevitably take at least O(N) no matter how you do it. However, concatenating N separate strings of average length M with strcat() requires O( M*N^2 ) ( Schlemiel the Painter's Algorithm https://www.joelonsoftware.com/2001/12/11/back-to-basics/ ) but O( N + N*M ) by making a tree of two-pointer nodes, and flatting down to one long ASCIIZ string only once at the end.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/09/2023 at 18:21

Well, @Ken Yap it seems I have settled on a flexible mix of features, look at the latest version :-)

I'm currently writing design decision docs, I'll post a summary here later.

Are you sure? yes | no

Ken Yap wrote 01/09/2023 at 22:49

I hope you will thoroughly audit your implementation for weaknesses because imagine what havoc an explotable bug in such a fundamental data structure shared library would cause if it's widely adopted. You could become famous for the wrong reasons. ☹

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/09/2023 at 23:05

@Ken Yap that's what regression tests and fuzzing are, right ?

Are you sure? yes | no

Ken Yap wrote 01/09/2023 at 23:08

That's only the start of it. If you look at CVEs even things like Perl's sort becoming quadratic on certain inputs was regarded as a weakness.

Are you sure? yes | no

ziggurat29 wrote 05/14/2020 at 17:52

I like this. I may use it myself in the future. Please do not patent! lol

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/14/2020 at 18:25

There is no risk : a patent can't apply on a format. Only on methods.
I write everything under AGPLv3 (unless required by a client).
Many software use a prefixed format with size being 4 bytes, which is a waste but keeps the code short and fast. I'm trying to find a compromise...

Tell us what you think :-)

Are you sure? yes | no

ziggurat29 wrote 05/14/2020 at 19:01

I had occasion to use pascal-style strings just yesterday. It wouldn't have helped in that particular case because all the data were small so I could get away with a fixed size of 8-bits for the length, but I like your adaptable solution for other cases where it might need to spill over to 16-bits or more occasionally, rather than bumping up all lengths to 16-bit just because a few happened to be long.
Vaguely it reminds me of how UTF8 achieves bisexuality with the ANSI world, but your use of the lower two bits of the pointer to encode the length of the length is particularly clever.
I was joking about patents, of course, however this is in fact a 'method'.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/14/2020 at 19:44

Feel free to use that trick :-)

If you only have 1 and 2 bytes for the size, you can simply test the LSB of the pointer and select the appropriate processing code. I am considering that "degraded mode" for 16-bits platforms.

I am thinking a lot about it, now, it's my new pet idea and a lot of things are flashing through my head, we'll see in the next days and weeks after the brightest ideas have percolated to the top...

Using the LSB to encode meta-informations is not new : I think some LISP and IA machines use this technique (for example, add a flag to indicate the type of a node in a linked list or tree) but the "smart" idea here is that the pointer serves 2 purposes at once ! It can indicate the format and still point to the start of the string ! It sounded so cool that I wondered why nobody did it before and I am still digesting a lot of documentation on the subject.

Are you sure? yes | no

ziggurat29 wrote 05/14/2020 at 20:03

Yes I was thinking that a better analogy than UTF8 might be ARM Thumb instructions: the LSB of the address indicates whether the target code (of a call or interrupt vector) is actually Thumb instructions. Since instructions can never be on an odd address, they re-purposed that otherwise unused bit. (I know this well from disassembling, lol)
Your encoding technique is still useful on 32-bit platforms.
It will be interesting to hear your thoughts about it's applicability to trees and linked lists when you get to that.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/14/2020 at 20:54

Yes, playing with pointers is a risky business...

I've been a CPU architecture geek for a long while and it is always tempting to use "some bits here and there" for "this and that"... And it often backfires.

Here it might be a sweet spot, we'll see. Anything is better than sprintf() ;-)

Are you sure? yes | no

Aligned Strings format

Description

Details

Files

AS20230117.tgz

AS20230112.tgz

AS20230110.tgz

aligned_strings.h

PascalZ.h

Project Logs

Collapse

Fuzzing and safety

More than an error

Another possible extension

Article !

More food for thoughts

Holding back a bit

Merge works

Evolution...

2023 : a new version

Context

Discussions

Similar Projects

Simple Compiler

Adventures in Rust

NS 32000 cross-assembler in C

Argentum programming language

Attribution	MSB	other bits	b2	b1-b0
NULL			0	00
Unicode point (including value 0)	0	Unicode value	1	00
Placeholder / skip marker	1		1	00

Aligned Strings format

Become a Hackaday.io member

Just one more thing

Description

Details

Files

Project Logs Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse