Project | n00n - Real Time Music Sensor Streaming Protocol

« Back to project details Sort by:

Non-volatile configuration parameters
06/25/2023 at 00:14 • 0 comments
Each device (in particular the sensors such as keyboards, pedals...) need to store parameters in either battery-backed up SRAM or EEPROM (for example). All active devices (those that do not affect the stream) should be able to answer the -3 Label and -4 Serial messages with the proper UTF-8 payload.
- The Serial is a guaranteed unique string that is used by DAW for example, under the hood. It is not meant to be user-editable, and it's stored by the manufacturer in the firmware for example. It is a free-form UTF-8 string that describe the manufacturer, model, revision, serial number, date of manufacture...
- The Label is a user-editable and performance-dependent UTF-8 string with a short description of the device in the context of the current n00n stream: "left keyboard" "pedal 1" "mixer"... This should be stored in the device and can be edited directly or through commands sent by a DAW.
- The last 16-bit address/ID should also be stored in non-volatile memory to preserve known-good values. Even better, the last 3 valid addresses are stored, for convenience and quick recovery if the device goes back and forth from one environment to another.
- More values may need to be stored. For example, a keyboard needs calibration values for min and max positions of the keys or the analog wheel, or the current transposition.
.
(note : the Serial string can be used as a seed to generate the candidate 16-bit IDs)
Addressing and naming
06/24/2023 at 20:54 • 0 comments
First drafted in log 3. Even more drafting

n00n is not a traditional network protocol and this is obvious when looking at the naming system. It borrows a few principles from MIDI of course, and some fuzzy ideas from ATM and DHCP might also linger somewhere.

n00n has 3 complementary levels or types of naming:
1. the device-level ID or address,
2. the logic-level Serial string
3. the user-defined Label for convenience.
-o-O-0-O-o-
At the very bottom, there is only one 16-bit ID (or "address") field in the packets (of course associated to only one device). As noted before, this is overkill for the expected type of application (a dozen devices at most, maybe) but the first argument is "hey, they are available". And from there, it enables new constructs, such as dynamically allocated addresses (unlike MIDI which requires careful manual configuration because the range is pretty narrow: only 16 channels !). A 16-bit space is great because it keeps the chances of collisions low if the addresses are "chosen at random" (but see The Birthday problem). This way, a device can be plugged and operational in the n00n signal path with almost no effort : a new stream appears from a given address and that would be all.

In practice it's a bit more complex because collisions remain possible and must be avoided. The first draft proposed that an "upstream" device could force downstream devices to reallocate to new addresses, because the serial link is considered as a "one-way path" and an upstream device can not know what's going on at the end of the daisy chain. An ID reallocation could eventually cause an avalanche of reallocations downstream so it's not a preferred method.

This is solved by closing the daisy-chain and turning it into a ring. The Global Timestamp Generator can also serve as the closing point, filtering all the raw instrument data and letting the control messages pass for one more round in the ring (eventually setting a bit to prevent more than one round). This way, a device can "probe" the ring by sending a "-2: ping" to the desired address, and if it comes back unaltered, then the candidate address is adopted (otherwise, try another pseudo-random address).

Of course the system is different for a star topology.

The candidate address can be generated from any source of entropy, one of them is consecutive checksums of the unique Serial string. When it is adopted, the known-good address is stored in non-volatile memory to reduce the chances of collisions in the future.

.

There is no "special" address. There is no "broadcast" system because the daisy-chain or ring transmit all the data to all the devices downstream, unless one device recognises its own address : the 16-bit ID is thus both a source and destination address. Of course this makes a driver a bit more delicate for star topologies likes USB and Ethernet but it's not impossible to solve with enough embedding and extra parameters.

.

The device ID (address) is implicitly handled in the protocol stack, however the Label and the Serial are explicit higher-level messages whose values are stored in non-volatile memories. Here are relevant Packet types:

-3 : Label

This packet allows getting and setting the "label" of a device (when possible). This packet can be emitted when the user changes the label on the device itself, so the DAW can update its display. But the DAW can also send this packet to inquire and/or change the label remotely.

When receiving a Label message, check if the ID matches, otherwise forward.

If the ID matches, check the set/get flag : if the flag is "set" then update the device's local label.

Then confirm by sending a Label message containing the device's label (1 to 256 UTF-8 bytes).

-4 : Serial

Get the non-volatile identification of the device : manufacturer, model, revision, date, serial number...

When receiving an empty Serial message with an ID that matches the device, the device sends a Serial message with a UTF-8 payload containing the detailed information.

More options or details could be obtained or selected using the Flags field. TBD.
-o-O-0-O-o-
So the label is what gets displayed on the DAW's GUI, next to the relevant information. It could change from project to project, it's totally dynamic but this ephemeral short string is meant to be practical and configurable. It could be displayed on the device's screen and/or edited there as well.

The Serial string is guaranteed unique and fixed so it is certain to identify a single device, and it doesn't change, so this is what the DAW uses to internally identify the device. The volatile ID and label are associated to this internal string.
In the end, the whole system built from these devices should be "plug and play", with a very short automatic setup. Any new data stream that is detected from an unknown address can be immediately classified by the DAW which will enquire the capabilities, the eventual existing label, the Serial string, and deduce some temporary label name from the type of packets that are received, which the user can later edit in the project. The DAW can keep a small archive of already plugged devices to remember the label and address, as well.
The only problem with the ring topology is when a device is plugged or unplugged, which breaks the ring and interrupts the stream. Each device must be able to detect when its input is "disconnected" (nothing is received) so only it sends replacement timestamps. Otherwise, other devices downstream may also decide to fire their own timestamps and that would break all the timings. But these are considerations for another log.
Timestamps
06/23/2023 at 22:51 • 0 comments
Timestamps have been discussed in the logs 2. More drafting and 9. High-resolution timestamps. They should be reasonalby precise, useful and convenient. For example: time arithmetic is straight-forward, shifting a stream in time is a simple 32-bit addition with saturation.

This is different from many timing systems, such as sample-based timing which is too dependent of sampling rates, and n00n is meant to mix several streams with possibly irregular sampling rates. A fractional Herz system is the easiest common denominator and rounding to the next few points does add very minimal jitter.

65536Hz is also more fine-grained than the usual 44100Hz and 48000Hz sampling rates so the stream can be synchronised to within one sample.

There are two special values :
- Time zero (0x0000 : 0x0000) is the start of an actual track's contents.
- Time "min" (minimal) (0x8000 : 0000) corresponds to the least possible timestamp and means that the value is not valid so it is not to be trusted or taken into account (like a "no timestamp") and the schedule it taken from context and external data.
So there are two ranges of 32787 seconds, that's 546 minutes or a bit more than 9 hours, which should be long enough for long outtakes, rehearsals, you name it. And if it's not enough, it can be doubled because negative timestamps are totally fine, so the recording could be started from (0x8001 : 0000) and proceed up to (0x7FFF:FFFF). If the recording is longer than 18 hours and 12 minutes, the counter overflows to (0x8000 : 0000) to signal an invalid timestamp.
- The overflow is a corner case that must be properly managed by the DPLL,
- Erratic values must also be filtered out. The timestamp can suffer from some jitter and lag from networks but the overall speed should be within +/- 5% of actual real time to not be rejected.
- Gaps should also be gracefully handled. Maybe the timestamp is reset to minimal sometime during the recording, which could also be a case of two recordings that are merged.
Here are relevant Packet types:

-1 : Tick

It's just a header that is passed along the chain to synchronise all the other clocks downstream. No data payload. Upon reception :
- Make sure the time code is coherent (sample for 1 second ?)
- Update the local clock (and/or the soft PLL)
- Make sure that the Channel_ID does not collide (otherwise, rename yourself randomly and send a rename message downstream)
If no tick received within 1 second, send your own tick (8-16Hz ?)

-7 : JUMP

This special packet could be sent to warn about the shift to a different timestamp range, backwards or forward. The local timestamp counter is reset to a new value, but DPLL parameters should not be changed. Could be used to extend the duration of a recording or stream.

.
Streams vs packets
06/23/2023 at 02:58 • 0 comments
n00n data are sequences of packets that can be embedded in a continuous stream, such as digital sound data, carried in a .WAV or over S/PDIF, or burned on a CD. That's why n00n is structured in words that are pairs of 16-bit values, mapped to left/right channels.

Transforming the packets to/from the continuous streams is not difficult, it's just a matter of proper "framing".
- The header has 32 constant bits for header alignment, backed by 16 bits of checksum, so that's 2^-48 chances of random sync.
- Then there is 32 bits more for the payload checksum that help confirm that the header is good if the payload is good.
This is a bit overkill when transmitting data over UDP packets for example but the format remains the same in any case.

When using a fixed bandwidth, continuous stream (such as S/PDIF or a high-speed serial link), packets are separated by at least one 0 word (32 bits), and more if/when there is temporal alignment. However the real timestamp is in the header and the stream should not be trusted for absolute timing, as it could be interrupted or tampered with. In fact, several simultaneous "recordings" of the stream could happen on different unrelated files, and the header's timestamp helps recover and realign all the streams so they can be remultiplexed. That's why negative timestamps are possible.
On the above diagram, the packet is surrounded by 0h padding, which is present in a "continuous stream". UDP datagrams however do provide the necessary framing so the padding is removed.
The payload checksum is filled in the header, which is then itself checksumed to protect the whole packet.
Filtering
06/22/2023 at 20:31 • 0 comments
The n00n protocol is destined to be used in a modern Max/MSP-style system that is totally software configurable, or like a VST-style virtual mesh that processes the sensor data instead of sounds.
n00n is designed to be easy to stream as digital audio (just don't listen to it !) so some tricks are possible. Many functions can be implemented at the sensor level or at any level downstream, recording taps can be placed at any place in the stream... The idea is that a performance can be "recorded" right at the sensor level, and then "replayed" exactly as is, but all the technical parameters can be readjusted later (a bit like what photographers do with "raw" files versus .jpeg files).
So physical interfaces can be as simple as pots (for example) directly tied to an ADC and a microcontroller or FPGA reads it continuously, with only minimal processing. The stream can then be turned into useful data with a DAW that manages "filters" to get the desired user response, for example from a keyboard:
1. Scaling (requires calibration of min and max values to fit the range of 0 to 65535)
2. Curve (could be linearisation, logarithm, exponential, sigmoid...)
3. integration/low pass filtering (to reduce noise, smooth the data, increase accuracy)
4. Acceleration, triggering, hysteresis, ADSR-like shaping...
5. Eventually some time-dependent processing to add "echo", bouncing, delays etc. for added artistic effects...
6. Interpolation/extrapolation to match the synthesiser sampling rate, to provide a smooth parameter.
Then the filtered value can be routed to a synthesis parameter input or filter banks, to be turned into continuous waves.
Due to all the possibilities offered by this type of filtering, the sensor could have a raw 8 or 10 bits range (filled by 0s in the LSB) but this is increased through the filtering and the oversampling. Hight sampling rate is preferred to absolute precision because precision can be recreated during the last step of interpolation, and the synthesizer units work more smoothly with high frequency updates of the parameters (finer steps create fewer artefacts and they are pushed higher in the spectrum). The user can tune the level of sensor smoothing to apply while mixing...
Sensor filtering could take place next to the sensor but this would also increase the complexity of the unit, because all the parameters need to be input somehow (with a fancy physical interface or through a dedicated protocol). Raw sensor output should still be available anyway (enabled through an option) if a keyboard provides a filter. The only critical parameter for a sensor (potentiometer, optical, capacitive, inductive ...) is the calibration : making sure min and max values are correctly set internally so the output values are properly scaled to the whole 16-bit range. This means that the sensor's controller must be able to perform at least efficient multiplication.
Updated header structure
06/21/2023 at 02:25 • 0 comments
I decided to change the structure of the header : it now contains the checksum of the payload, so it is not appended after the payload. This "protects" the checksum with the header's checksum.
The checksums are nested now, which imposes a sequence for the checksums :
1. The payload's checksum must be computed first (while the header is created)
2. The header's checksum is computed when all the header's fields are filled.
This creates a new situation. The header is now 20 bytes long and the payload checksum might be cleared when there is no payload.
But if the payload has a length of one word (32 bits) or less, then it could be stored in the header instead...
The source code has been modified, it was not complex. The log 4. Header and payload checksums with PEAC is thus deprecated for the detailed description and code, though many aspects are still relevant. The header is now:
```
typedef struct {
  uint16_t
     Sign1,  // 'N', '0'
     Sign2,  // '0', 'n'
     Timecode_Frac,
     Timecode_Sec,
     Channel_ID,
     Type_Flags,
     Payload_checksum1,
     Payload_checksum2,
     Payload_Words,
     Header_checksum;
} N00N_header_struct;
```
That's still 16 bytes of actual data, the 4 first bytes are only for static validation and resynch. Everything remains 32-bit-aligned as before, and the payload is appended as is after the header.
High-resolution timestamps
06/20/2023 at 15:26 • 0 comments

One quite unusual aspect of the protocol is the use of a power-of-two resolution timestamp.

Usually one would use a 10MHz-derived signal, and end up with powers of ten, but that doesn't fit well with a 16-bit field that is so characteristic of the format. Using 65536Hz looks like a natural choice and the trick is that the timestamp can be simply set to null (MSB set, others cleared) in case the device does not implement a timestamp (in that case, its stream could be mixed with other timestamped streams which will fill the empty field). The unusually high resolution of the timestamp helps with dealing with multiple data sources sampled at irregular intervals or with wildly different rates, so this relaxes the requirement of a tightly coupled system, making it much more resilient. You could sample something at 37Hz and something else at 1337Hz, who cares.

A 65536Hz clock or timebase generator is ... rare. However : stable, cheap, available 32768Hz sources are made for watch and timekeeping purposes. Many Dallas Semiconductor chips, marketed as clocks and calendar with integrated temperature compensated oscillators (such as DS3231) provide a digital 32768Hz output signal for auxiliary use. Thus there is no need of a PLL : if the duty cycle is 50% then every transition creates a rather good 65536Hz event : a FPGA can detect a level change and update an internal counter for example.

The log Integrated 32KHz clock source shows the use of the DS32KHZ integrated oscillator and how easy it is (more details at #ScoPower ). Thus an autonomous device can generate a very precise timestamp signal to synchronise the whole setup, at clock stability and resolution.

Of course, not all devices need to follow such a high precision, because accuracy, precision and resolution are not the same things. The format simply provides headroom and more or less bits can be exploited. Just be careful of the jitter. There is also significant overhead and lag all over the system : data processing, buffers, transmissions... Small packets get emitted faster and more often so they suffer less jitter. It also depends on when you count a packet as emitted or received : when the header is detected, or when the whole packet is received ?

The main timestamp is emitted at at least 1Hz. It can be increased to 2, 4, 8Hz or whatever but this increases the overhead over the links so it's always a matter of compromises. The timestamp generator is considered lost if 3 consecutive packets are not received as expected (then the local generator takes over the chain). A downtime of 1 second is the target for maximal recovery time in case of failure so 4Hz or 8Hz is a reasonable compromise. Each device in a ring can take over as a master clock in case it doesn't receive it but they also add their own timestamp down the chain, increasing the resolution for downstream devices. For a star network, the global timestamp must be broadcast : easy for Ethernet/UDP/IP, less so for USB but doable.

So what matters most is that each device has a pretty accurate local clock with a high-resolution timer (multi-MHz ?) to create a sort of digital PLL in phase with the received global clock reference. A first-order filter is necessary for basic operation, maybe a 2nd order error filter will help in case of a loss of the main signal during more than 10 seconds. When a device takes over, the PLL parameters are slowly adjusted to reach a fixed model-dependent value that creates a "known-good 1Hz" period.
Ring-or-star : General dataflow structure of a device
06/16/2023 at 05:01 • 0 comments
Software blocks should be reused as much as possible, if, how and where needed. The same code is used for the generators (timing/tick/global clock), the transceivers (instruments, recorders) or the receivers (synthesisers). This should even be independent from the actual interface or topology of the links between the devices.
- The only variable module is the "Transport" which interfaces with either SPDIF, USB, serial, Ethernet... This Transport Module manages the topology, the eventual buffers and multiplexing, connexion/disconnect events, and crude filtering. There can even be a "null" transport inside the computer, VST-like.
- Incoming and outgoing packets then go through the checksum module : the checker/generator(s) can be separate if needed but it's basically the same thing.
- From there on, each data packet is considered destined to/coming from the device, inside the memory space of the processor. The null transport can connect directly here, bypassing the checksum module. The packet is assembled and/or dispatched by a mux/demux. There are at least 3 sub-modules that do the actual processing (we're getting there at last):
  - The name/address module manages the filtering, the serial number, the dynamic address, the user label... so it is closely linked to the transport layer.
  - The timestamp module receives clock events, keeps an internal timebase at 65536 Hz (or more) so it can generate fractional timestamps for internal events, in sync with the external timebase. It contains a sort of DCO or "digital PLL".
  - The instrument module does the real data stuff. It can receive external configuration messages, or data streams, but mostly emits data.
  - The Instrument module streams can go through an optional compression module to save bandwidth.
Note: some internal messages could be created for the init, connect and disconnect events...

With the modular architecture, each module can be included depending on the type of device : timestamp generator, sensor, recorder, filter, synthesiser...
Furthermore, the packet format has a separate checksum for the payload and the header so the addressing and the timestamp modules can be addressed without waiting for the payload checksum to be completely received.

.
Topology
12/15/2022 at 22:56 • 0 comments
Similarly to MIDI, N00N can use a model where the devices are chained, which creates a hierarchy and enforces some priorities. But this is not the only possible topology. This mostly depends on the type of physical transmission interface, and we suppose a unidirectional stream.

Each device should have an input and an output so it can be integrated in a chain, like below, otherwise it is forced to be at one end or another of the chain, which makes it harder to integrate smoothly.
"Hubs" or concentrators are also possible but they require enough CPU to resynchronise the incoming streams. This does not solve many issues and introduces more of them.

N00N over IP (UDP) is possible but this requires more complex methods to configure each device. Static or dynamic addresses are only one easy aspect, devices must also "discover" where to send their own stream... The LSB of the IP address then matches the device's ID. 10BaseT and 100BaseT have latencies in the millisecond range when taking all the OS overhead into account, and it's a big source of jitter... S/PDIF has a much lower latency because data are processed at sample speed, in very small chunks and with very short FIFO.

I was able to reach sub-millisecond latencies with Wiznet modules in a barebones configuration but I'm not sure it applies here.

It would be interesting to use RJ45+50ohms twisted pairs for connection between devices because the parts and cables are very widely spread, though there is always a risk of compatibility with other standards (and PoE could burn the devices). RCA and 75 ohms is used in some older video applications and is not expensive.

___________________

Back to the daisy chain.

The simplest configuration is a source device (like a keyboard) directly connected to a sink device (an expander).

The chain can be expanded by adding more sources (more keyboards, more knobs) upstream, and more sinks downstream. The ends of the chain allow this addition, though there is the risk of breaking it in the middle.

Performance recorders can be added at the end of the chain to capture (and replay) the stream, there can be sequencers as well... But most probably, a computer will act as a DAW at the ends of the chain :
- upstream : it will generate the Tick (beware of OS jitter !) and send configuration messages (label, ID etc.)
- downstream : it will be able to receive all the source data, massage and process the raw information, implement flexible synthesis...
Packet types
12/15/2022 at 20:32 • 0 comments
Here I will try to list all the types of packets I can think of, to allocate numerical identifiers.

Each data source, or device, can generate any type of packet. A device can generate "keyboard" and "knobs" packets, but can only have one set thereof. So if your device has 2 keyboards, you have to emulate 2 separate devices. Note : after being suggested the mixing tables, it seems I'll have to create structured packets with sub-fields, a "composite packet" but since it is not yet defined, the former will be simpler at first.

There are 16 bits in the header for the type and flags fields so each could take one byte at first glance, though there is no strict boundary. Flags could supplement the type and implement subtypes, for example, or an important type could use fewer bits and allocate more flags.

The 2 LSB are not reserved but that's where the compression flags are located :
- 00 : no compression, raw data
- 01 : 16-bit 3R
- 1x : reserved, could be LZW or others TBD.
Compression is optional and provided to save bandwidth so it might be ignored by some data sinks. That's why the source should send raw data once per second. Furthermore : compression does not work all the time and could eventually expand data, so raw transmission is always possible as a failsafe.

There are 4 big classes of packet types, using the 2 MSB :
- 00 : Instrument (actual useful data)
- 01 : Experimental (playground in the protocol, expect things to break)
- 10 : Reserved (leave it alone)
- 11 : Management (non-music data that keeps the system working)
As usual, if a packet type is not recognised or understood, it is ignored.

-o-O-0-O-o-

The Instrument class :

Type 0 is invalid (just in case).
1. Knobs : just a collection of 16-bit values that could represent potentiometers, buttons, slide pots, ribbons, whatever : it's the most generic type.
  Format : 16-bit size prefix, followed by as many 16-bit unsigned values (that may be compressed, see the compression flags).
  The data sink is in charge of "patching"/associating each value with a meaning or function.
2. Keys : a string of 16-bit values representing the key's absolute position for all the keyboard.
  Format : 1 byte of length prefix (the number of keys),
  1 byte of offset (the position of the lowest note/key)
  Followed by as many 16-bit unsigned values (that may be compressed, see the compression flags).
  Note : no other flags than the 2 compression LSB are defined so there is a lot of room to play with.
3. Aftertouch : a shadow of the Keys type that adds pressure information. Could be redundant but it's available anyway. Same format as Keys.
4. Bend : another shadow of the Keys for lateral pressure on the keys. A fun novelty. Same format as Keys.
  A second Bend might be possible, like BendY and BendZ...
5. Mixing table : a collection of Knobs packets ?
.

Well, that's about it for this class, there is not much else to add, but there is room for extension. Knobs can represent about anything anyway, so other types would be required if a different data format is required, for example events for drums ?

.
Other possible type : Tunnel (to encapsulate other types of traffic, such as MIDI, files, sound, network data...)

-o-O-0-O-o-

The Management class

That's where the kludges are.

For convenience, the numbers are decreasing so I'll use increasing negative numbers at this moment.

-1 : Tick

It's just a header that is passed along the chain to synchronise all the other clocks downstream. No data payload. Upon reception :
- Make sure the time code is coherent (sample for 1 second ?)
- Update the local clock (and/or the soft PLL)
- Make sure that the Channel_ID does not collide (otherwise, rename yourself randomly and send a rename message downstream)
If no tick received within 1 second, send your own tick (8-16Hz ?)

-2 : Ping

This is another header with no payload, that is used to enumerate the devices and their IDs in a chain.

When a device receives this header, it must send its own Ping header before forwarding the one(s) it received. This is to prevent storms of packets, FIFO stuffing/overflow, and it also preserves the order of the chain, unlike other methods. The device must then ignore other Pings for a second or until it receives several packets that are not Pings.

-3 : Label

This packet allows getting and setting the "label" of a device (when possible). This packet can be emitted when the user changes the label on the device itself, so the DAW can update its display. But the DAW can also send this packet to inquire and/or change the label remotely.

When receiving a Label message, check if the ID matches, otherwise forward.

If the ID matches, check the set/get flag : if the flag is "set" then update the device's local label.

Then confirm by sending a Label message containing the device's label (1 size prefix byte and 0 to 255 UTF-8 bytes).

-4 : Serial

Get the read-only identification of the device : manufacturer, model, revision, serial number...

When receiving an empty Serial message with an ID that matches the device, the device sends a Serial message with a UTF-8 payload containing the detailed information.

More options or details could be obtained or selected using the Flags field. TBD.

-5 : Capabilities

This is where it gets somewhat MIDI-ish but capability enumeration is not strictly necessary because the available data streams are output anyway.

Maybe this could even be used to "tune"/enable/disable/configure each stream type from the device (sensitivity, max/min values, speed...)

-6 : Rate

Manage the update frequency, the relative priority for stream insertion...

-7 : JUMP

The timestamp counter is reset to a new value. Could also be used to extend the duration of a recording or stream.

.

.

-o-O-0-O-o-

This numbering with 2 ends (management at the Max of the range, and instruments at the bottom) allows the numbering and attributions to grow towards the middle of the range, at their own pace, while allowing efficient table-driven decoding of the most frequent messages.

-3 : Label

-4 : Serial

-1 : Tick

-7 : JUMP

The Instrument class :

The Management class

-1 : Tick

-2 : Ping

-3 : Label

-4 : Serial

-5 : Capabilities

-6 : Rate

-7 : JUMP