-
Decoder architecture
12/12/2025 at 06:47 • 0 commentsThese days, the ideas are
- combine the running sum with the error detection/correction bits
- borrow some design principles from TCM
- Implement the Viterbit-like decoder using an array of asynch gates, a sort of physical treillis
- move to 3 or 4 bits per sample instead of a simple 3-level input
- let the asynch gates treillis handle clock resynch
The size of the codes are still undetermined. The ratio could at worst be 1B1T, at best 3B2T or so.
.
1T
32T
93T
274T
815T
2436T
7291B
2X X X X X 2B
4X X X X X 3B
8X X X X 4B
16X X X X 5B
32X X X X 6B
64X X X 7B
128X X X X .
There is an "upper diagonal" (quite arbitrary where B=T, and a lower slop forced by 2^n >= 3^m.
If possible the words should be short to keep the circuit small but this reduces the encoding efficiency.The remaining candidates are
- 1B1T : huh...
- 2B2T : like above.
- 3B2T : 9/8=1.125 => very little margin for ECC and no way to fight baseline wander
- 4B3T : 27/16=1,68 => first efficient BLW code but no room left for ECC
- 4B4T : 81/16=5.0625 => one added trit for ECC but the internal state is getting very large !
- 5B4T : 81/32=2.53125 => looks interesting, there is a bit more than 1 bit of extra data
- 5B5T : 243/32=7.59 looks overkill, almost 3 bits of added data
So how many bits do we need for both ECC and BLW? => this sets the ratio between B and T
And what could/should the ADC resolution be? => number of pins, 4 pins / 16 values looks like the maximum, but also 4 pairs makes a 5-value Flash ADC, that's 8 pins already...
-
Ternary Viterbi
12/02/2025 at 04:53 • 0 commentsOver at the #miniMAC project, there was this log 108. Error correction and the realisation that the channel coding should be done directly in ternary. The encoder is pretty straight-forward (see above log) but the decoder is a different beast. Still, I know it's possible since GbEthernet uses it (TCM) over 4 simultaneous channels so there should be a way, right ?
http://pl91.ddns.net/viterbi/algrthms2.html has some good ideas for the binary case and I don't want the system to get out of hand (complexity, size, latency). But I'm somehow glad that others have already studied the subject of conversion to ternary.
- 2023 https://www.ijfmr.com/papers/2023/2/1757.pdf "Design of Ternary Convolutional Code Using
Reconfigurable Architecture" - 2017-2021 : https://dspace.library.uvic.ca/server/api/core/bitstreams/faafcf78-7e00-4fe4-b599-95f97f9cafa5/content by Bharath Rao Madela
- https://arxiv.org/pdf/2209.01360 Henri Mertens and Marc Van Droogenbroeck: "Error-rate in Viterbi decoding of a duobinary signal in presence of noise and distortions: theory and simulation", 2022
- 2002-2014 Khmaies Ouahada: several publications starting from the thesis, university work and more later. Found also on ResearchGate, great reading, that seems to converge to several of my conclusions.
- https://www.researchgate.net/publication/266389989_Viterbi_Decoding_of_Ternary_Line_Codes
- https://www.vodafone-chair.org/pbls/legacy/gerhard-fettweis/High-rate_Viterbi_processor_a_systolic_array_solution.pdf (not ternary though)
So the subject is not novel (which is both sad and great). I have actual data to crunch. Note that several studies just do the FPGA compilation and simulation tests but real field tests are lacking, I would love to see the actually measured waveforms. Where are the oscilloscopes ?
.
"Soft-decision Viterbi" seems to point to using more input bits, like a 3- or 4-bit flash ADC but this becomes impractical. Actually, that's where TCM leads us. And with 2 input bits, that's a 4-bit ADC. If the 2 differential inputs are used as 4 single-ended ones, that's a 4-bit, 16-level Flash ADC.
.
So far the idea is to do the DSP part in ternary though I doubt I could achieve Viterbi decoding at 50M trits/second (equivalent to about 60Mbps of useful bandwidth). Some parallelism becomes necessary. And other tricks too.
.
Anyway, using a convolution code solves quite a few things and the Viterbi decoder is somehow simplified (a tiny bit) because the binary-to-ternary (3B2T) conversion leaves one code unused (8 out of 9 codes). However the mechanism against baseline wander goes out of the window. Unless I can make a wander-reduction mechanism that also acts as a convolution code ? Then the extra data works for 2 effects ?
- 2023 https://www.ijfmr.com/papers/2023/2/1757.pdf "Design of Ternary Convolutional Code Using
-
Gearbox
11/30/2025 at 21:01 • 0 commentsThe latest log 108. Error correction over at #miniMAC - Not an Ethernet Transceiver brings an interesting idea for the adjustment of transmission parameters : using a LUT with different GCR patterns, such that the speed is not the only adjustment knob to turn.
The LUT can be hardwired or reprogrammable and contains basic binary patterns (such as tweaked Manchester codes) or ternary (MLT3, 4B3T, 3B2T...)
A link is established with the basic code and frequency increases until the error rate becomes significant, then the GCR LUT is changed to a more efficient one as long as the peer supports it.
I'm not sure yet how to classify the FEC in the pipeline yet, it would be an extra optional layer...
-
4B3T
07/10/2025 at 19:35 • 0 commentsBaseline wander (BLW) is a very concerning and underrated problem, covered in the last log 4B4T: An extended ternary Manchester code and its implications.
MLT-3 has two compelling aspects:
- the spectrum is shifted toward lower frequencies (EMI compliance) and
- the coder/decoder looks quite simple. Roughly.
The MLT-3 encoder is very simple but the decoder is very sophisticated due, indeed, to BLW. So the spectrum spread and the AFE complexity are linked: higher code efficiency and reduced EMI require more high-speed DSP effort.
This project's focus is on low cost, ease of implementation with very affordable and simple parts, not absolute speed, and the receiver's AFE must be kept as simple as possible, to keep the BOM very low. It's not possible to do much analog magic, like active filters. One of the first logs (AGC) shows that the AGC could be done almost with passive parts (and a few diodes) but I'm not sure this feat can be repeated for other functions.
If needed, the clock speed can be reduced to conform to EMI rules. But it is important that overall noise and BLW remain low, low enough that analog filtering is almost unnecessary. I'd say that the level of BLW must be below 1/2 the difference between 0 and +.
This is why I should evaluate the performance of the 4B3T FoMoT code and see if it fits my requirements. The 4-level Running Digital Sum (RDS) (actually -3/+3) looks promising.
I also consider a protocol (passive/implicit or active/explicit) to "drift" the clock and adapt to the line's capacitance/inductance. Transmission would start at a standard, low frequency (10MHz ?) and slowly increase as long as both receivers see some "margin".
-
4B4T: An extended ternary Manchester code and its implications
06/04/2025 at 23:56 • 0 commentsSo the last log re-emphasised the importance of BaseLine Wander on the design of the AFE.
Modern designs have sophisticated hyper-fast ADCs and perform complex DSP to compensate for many line effects including droop and BLW. This is totally out of the realm of possibility, the miniPHY must be very simple.
On the other end of the spectrum, 10Mbps Ethernet uses Manchester code which is very inefficient: 2baud/bit, the bit value is followed by its inverse. However is has a wonderful property: there is no space for BLW, as each code is "neutral" by definition.
Hybrid_ternary_code has an intriguing and very simple encoding scheme with 1bit/baud. Not great, not bad, it's a baseline.
The 3B2T code (9 symbols) is pretty efficient (density/packing=1,5bit/baud) but the balance/neutrality is data dependent. Trying to preprocess the data to prevent unwanted patterns is hard, expensive, ... The hardware overhead is significant (it adds latency and bloats the circuit) and the packing density is still unclear: adding one bitrit worth of information (8 symbols) to 7 bitrits would reduce BLW to "a certain amount" but it's still too data-dependent and has insufficient effect/leverage. 1/7th overhead (14%) can not ensure DC balance in all cases.
4B3T has a slightly worse density (1.3b/baud) but can ensure DC balance, using a 3-bit running disparity counter, a reasonably-sized LUT and a pretty simple decoder. It links consecutive words/nibbles but it looks like it's the smallest such scheme, simpler than the 2-LUT 8b/10b system.
Some interesting analysis can be found at Block_Coding_with_4B3T_Codes
![]()
Let's say "it's interesting"...
...
But what if we don't want to link nibbles ? We end up with needing a scheme where all the codes are DC balanced, just like Manchester. HTC (see above) also has to link consecutive bits to work. In ternary, we can also get the equivalent of Manchester with a triplet of codes : +- / -+ / 00 But then the long runs of 0s must be prevented. So it's basically Manchester (2baud/bit), with S code.
Going to 3 trits, we get 6 non-null codes: +0- / -0+ / 0+- / 0-+ / +-0 / -+0 which amounts to 2,5b/3T. Not great.
Four trits gets interesting though : 9 non-zero invertible codes (18 total) gives something like 4B/4T:
00+- +00- +-00 / 00-+ -00+ -+00 0+0- 0+-0 +0-0 / 0-0+ 0-+0 -0+0 ++-- +--+ +-+- / --++ -++- -+-+
This gives 16 data codes, 2 control codes and one "quite/silent/same" marker. This almost looks like something!
Packing-wise, it has 33% overhead compared to 4B3T so the data bandwidth drops by 25%. It is stateless though and the LUT is smaller.
But compared to HTC, the density is almost the same: 1b/baud ! The control codes are nice but not a significant bandwidth concern and HTC is way simpler.
I intend the miniPHY to have various (incompatible) versions so it is good to start with the simplest possible code. HTC does not have a "Same/Silence" code though that helps with the signalling and protocol so let's skip it.
So the development course would be :
- Start with 4B4T, simple/easy/low bandwidth which can be implemented in either 2send/2receive or 1send/3receive if bandwidth matters, and see how it works in practice.
- Increase the bandwidth usage with 4B3T, as a simple upgrade on the FPGA side
- Meanwhile, see if I can figure out a balancing scheme to retrofit into 3B2T with a smaller overhead than 4B3T.
This whole analysis has brought a lower bound of coding overhead to bring DC balance. Looking at 8B6T, it does not look like this packing ratio can be easily improved.
From there, if a line frequency of 30MHz can not be exceeded, and 1MHz=2baud,
- 4B4T will bring about 60Mbits per lane (hypothetically and unlikely, let's say 20Mbps)
- 4B3T increases to 75Mbps (ok let's say 25 or 30Mbps)
- 3B2T could reach 80 (25 to 33Mbps in good conditions)
The cool thing with a custom miniPHY is that the clock frequency could be adjusted according to the line's characteristics (length, capacitance...) and we could add lanes as needed...
-
Reverse antibias
06/04/2025 at 01:23 • 0 commentsThe last log has shown that the running disparity of a whole word can be computed in parallel but at high costs.
Wouldn't it be better to compute only one word's disparity then deduce the correction ?
That's what the previous systems (NRZ and MLT3) enabled, with simple parity as well as mod4. Yet it was still not satisfying.
The baseline wander can be attributed to a "random walk" with no limit on the excursion, and the limit requires extra coding, which I'd like to minimise. This is the territory of 4B3T and its cousin 8B/6T with a very short range disparity, very low excursion, hence high overhead. I'd like to keep it at or below 3 bits/8 codes per 20-bit word so the idea of tweaking the data from the source is pretty interesting.
I don't know why but what I imagine right now is the bias evaluation starting from the middle of the word, going in both directions, seeing how the wander evolves, then at 1/4, 1/2 and 3/4, "swap" something to invert the bias slope. Thus the disparity counter can get higher values but clumps can be broken up. I think. Aaaand it looks a bit (from afar) like Knuth's idea. (D.E. Knuth, “Efficient Balanced Codes”, IEEE transactions on Information Theory, vol it-32, no.1, January 1986)
And there are 7 trits so it's not as easy.
------------------
3B2T is very efficient, 4B3T is less dense but provides relatively easy and very short-term DC balance, which would be good for the analog front-end. There is a tension/compromise between coding efficiency (bandwidth usage) and BLW resilience...
BLW and code disparity is fortunately studied in length. Howard Johnson has an interesting analysis at https://sigcon.com/vault/pdf/7_09_addenda.pdf
https://imapsource.org/api/v1/articles/57229-line-coding-methods-for-high-speed-serial-links.pdf
But one thing I have not yet seen covered is "clipping". Applying a clipping with a pair of diodes adds some non-linearity and some hysteresis but it reduces the absolute excursion. Another trick is to use the midpoint tap of the transformer. Absolute levels don't seem to really matter, but the amplitude and direction of the pulses count the most.
For now, the emphasis is on the simplicity of the analog front-end, where a good portion of the manufacturing complexity and costs lies. In fact at this stage, even Manchester coding (like 10Base-T) would be nice, though would it work at a higher speed (at, say, 30MHz), and how is it possible to apply this principle to ternary coding ?
-
Drift/Bias evaluation
05/25/2025 at 19:04 • 0 commentsThe constellation has a nice property that has been already highlighted:
encoding: bits trits weight 000 - 0 - \____NOR2 001 0 - - / 010 + + ++ ---NOR+AND 011 - - -- ---ANDN+AND 100 + 0 + \____ANDN 101 0 + + / 110 - + 0 111 + - 0
The net sum of the levels does not need a lot of gates to evaluate: the circuit takes about 4 gates.
![]()
Of course, let us not forget the activation/enable, coming from the circuit we have already designed in the last log 2. The "Same" circuit. Since 11x totals 0, then we can just OR the result on b2 and b1 as in the circuit below:
![]()
Now the goal is to evaluate the total bias of the encoded 20-bit word, for each of 8 "fumblings" of the 7 tribits. Initially I imagined an incrementer but there is much simpler than that: XOR each tribit with the output of a 3-bit counter. Then the winning count gets encoded with the others. The cost is one more layer of XOR2 at the input of the circuit:
![]()
And from there we can simply add a popcount7 for each of the 7 outputs and combine them in a 32-bit "weight". But even though it's already "done", we can already simplify it a bit by noticing that neighbours can cancel each other. So let's introduce another new circuit: the reduction. To simplify it, I need to add a "zero" output to the bias decoder. And then, things took a weird turn. Here is the new circuit that combines 2 tribits:
![]()
Now there is a big binary encoder and the 4 bits require about 7 gates of propagation. The output is a signed number so no need to process negative and positive values separately.
This circuit uses 52 gates to process 2 bitrits, it simply amounts to a 64×4 ROM, and it must be replicated 3,5× so it's not very compact, and the rest of the adders require even more gates.
Furthermore, the running disparity must also be injected somehow : that's the 8th value to add, since there are 7 bitrits and the adder tree would be unbalanced. So the running disparity accumulator from the last word is added with the 7th bitrit.
The circuit needs to be run (pipelined) 8 times, for each of the 8 possible counter values, while the serialiser outputs 8 bitrits (7 data, 1 counter), so the phases could overlap but the evaluation must be complete before serialising can start: it's a pipeline (eval, serialise) with 8 sub-cycles
....
Reducing the bias apparently requires a lot of effort. More than would be reasonable, probably.
Modern links rely on the scrambler to even things out, methods like 8b/10b are out of fashion for a decade now.
Better DSP front-ends can digitally handle the droops and wanders... I can't afford that though.
Adding a bitrit expands the words to 16 bauds, to transmit 16 bits : the ternary recoding allows the 50% expansion. And there is still one unused bit.
I'd like to avoid the above circuit but I know my AFE is lousy and would need some serious help, but is the expense/complexity/latency justified ? Is there a simpler method ?
-
The "Same" circuit
05/25/2025 at 17:26 • 0 commentsThe symbol/BiTrit "S" (0,0) during a data word means "repeat the precedent biTrit". I have limited the number of repetitions of this meta-sequence to allow clock recovery and reduce droop, but I have found that limiting to 1 S is too hard on the circuit if it is parallelized. This is important because we want to be able to evaluate the "droop" of a whole world in parallel and this info is important.
So we get 20 bits, extended to 21 with the LSB cleared for now, and compare pairs of tribits. The first tribit is not subject to substitution, so there are 6 such comparisons, each with 3×XOR2 and 1×NOR3. Here is the circuit:
![]()
So far, nothing weird.
The trick is to "length limit" the sequences of Tx=1. Of the 64 possible cases, 20 have a "suppressed" bit:
000111 00011x 001110 0011x0 001111 0011x1 010111 01011x 011100 011x00 011101 011x01 011110 011x10 011111 011x11 100111 10011x 101110 1011x0 101111 1011x1 110111 11011x 111000 11x000 111001 11x001 111010 11x010 111011 11x011 111100 11x100 111101 11x101 111110 11x110 111111 11x11x
There is only one case where the suppression occurs twice, but this would be more complex with only 1 Same only.
With 2 consecutive "Same" bitrits, the circuit is somewhat shorter:
![]()
Total depth: 7 gates.
See you in the next log.
-
Rotating constellations
05/25/2025 at 13:34 • 0 commentsFollowing the logs
Today's concern is about "optimising" a word's coding to reduce droop. There should be approximately as many -1 and +1 levels. This increases the number of bits per word but we have one possible bit:
Words (so far) are 20 bits wide, and symbols represent 3 bits, 3×7=21, so we could select one encoding among two... What could this bit affect ? It's not a XOR of the input bits since the constellation seems to be quite symmetrical and XORing the inputs would just flip the signal's polarity and nothing is solved.
Let's go back to the current constellation:
![]()
The corresponding tables:
encoding: bits trits 000 - 0 001 0 - 010 + + 011 - - \ 100 + 0 / 101 0 + 110 - + 111 + - decoding: trits pos neg - - 011 010 - 0 000 100 - + 110 111 <= requirement for polarity sense 0 - 001 101 0 + 101 101 + - 111 110 <= + 0 100 000 + + 010 011
As previously noted, the table is almost symmetrical, but not completely (and this is on purpose), so the polarity is handled at the output of the comparators, not at the table level.
- One possibility to affect the balance of output levels would be to "rotate" the constellation by 90°, as it would preserve the Hamming distances between consecutive codes. But then it amounts to the same thing as swapping the first and second trit, which does not change much of anything.
- The second possibility is to rotate by 45° (clockwise or anti, it's mostly the same overall, I think) so a second table is required. Expensive.
- The 3rd possibility would be to rotate the input bits of the table: one table, one weighing circuit (2×popcount), 6 input permutations... and several cycles to elect the best one, possibly also considering the past values, to reduce baseline wander. This is possible since it still operates rather "slowly" and there is no need to test all permutations in parallel.
6 permutations require a whole bitrit to encode, and 2 symbols are lost (out of 8 tribits).
- 4th possibility is to "increment" all the tribits and test 8 cycles => a whole bitrit is used. This would help in the cases were long strings of 0s or 1s are sent, or any 3-bit pattern.
- The balancing system should account for the "Same" symbol, which outputs 00 when the tribit is the same as the previous one.
.
Some of the permutations/increments should have some symmetries and could be discarded, I guess.
Each tribit is expanded to four bits, 2 pairs that represent one trit each, with the encoding
00 => 0 10 => + 01 => -
=> It is possible to replicate a popcount circuit for the droop estimate, 20 bits => 14 trits => 4 bits per polarity suffice and such a circuit has already been designed for the ParPop circuit. Well, the sum can vary from -14 to +14 so that's 5 bits total.
A pre-decoding table can be used, derived from the original encoding table.
encoding: bits trits weight 000 - 0 - 001 0 - - 010 + + ++ 011 - - -- 100 + 0 + 101 0 + + 110 - + 0 111 + - 0
Nice, the LSB is almost unused. The 5 output codes need 3 bits, or 4 bits to encode them separately (-, +, -- and ++) to be processed by a less dense circuit.
Yann Guidon / YGDES






