Close
0%
0%

miniPHY

the thing you need to plug to a miniMAC

Similar projects worth following
263 views
0 followers
(hopefully) simple, full-duplex coder/decoder, analog front-end and clock (re)generator for a twisted-pair transceiver, a module connected to the all-digital MAC in a different project.
Can come standalone for one pair of pairs (10/100MagJack with 2 transfos), or duplicated for two pairs of pairs (Gigabit MagJack with 4 transfos)

Spinoff from #Not an Ethernet Transceiver

I need to split the serdes/coder/AFE part from the scrambler/parity/buffer/FSM part to keep them manageable, there are so many aspects to handle at once... See earlier discussions at:

and much more...

 
-o-O-0-O-o-
 

Logs:
1. Rotating constellations
2. The "Same" circuit
3. Drift/Bias evaluation
4. Reverse antibias
5. 4B4T: An extended ternary Manchester code and its implications
.
.
.
.
.
.
.
.

_

  • 4B4T: An extended ternary Manchester code and its implications

    Yann Guidon / YGDES06/04/2025 at 23:56 0 comments

    So the last log re-emphasised the importance of BaseLine Wander on the design of the AFE.

    Modern designs have sophisticated hyper-fast ADCs and perform complex DSP to compensate for many line effects including droop and BLW. This is totally out of the realm of possibility, the miniPHY must be very simple.

    On the other end of the spectrum, 10Mbps Ethernet uses Manchester code which is very inefficient: 2baud/bit, the bit value is followed by its inverse. However is has a wonderful property: there is no space for BLW, as each code is "neutral" by definition.

    Hybrid_ternary_code has an intriguing and very simple encoding scheme with 1bit/baud. Not great, not bad, it's a baseline.

    The 3B2T code (9 symbols) is pretty efficient (density/packing=1,5bit/baud) but the balance/neutrality is data dependent. Trying to preprocess the data to prevent unwanted patterns is hard, expensive, ... The hardware overhead is significant (it adds latency and bloats the circuit) and the packing density is still unclear: adding one bitrit worth of information (8 symbols) to 7 bitrits would reduce BLW to "a certain amount" but it's still too data-dependent and has insufficient effect/leverage. 1/7th overhead (14%) can not ensure DC balance in all cases.

    4B3T has a slightly worse density (1.3b/baud) but can ensure DC balance, using a 3-bit running disparity counter, a reasonably-sized LUT and a pretty simple decoder. It links consecutive words/nibbles but it looks like it's the smallest such scheme, simpler than the 2-LUT 8b/10b system.

    Some interesting analysis can be found at Block_Coding_with_4B3T_Codes

    Let's say "it's interesting"...

    ...

    But what if we don't want to link nibbles ? We end up with needing a scheme where all the codes are DC balanced, just like Manchester. HTC (see above) also has to link consecutive bits to work. In ternary, we can also get the equivalent of Manchester with a triplet of codes : +- / -+ / 00 But then the long runs of 0s must be prevented. So it's basically Manchester (2baud/bit), with S code.

    Going to 3 trits, we get 6 non-null codes: +0- / -0+ / 0+- / 0-+ / +-0 / -+0 which amounts to 2,5b/3T. Not great.

    Four trits gets interesting though : 9 non-zero invertible codes (18 total) gives something like 4B/4T:

    00+-  +00-  +-00  /  00-+  -00+  -+00
    0+0-  0+-0  +0-0  /  0-0+  0-+0  -0+0
    ++--  +--+  +-+-  /  --++  -++-  -+-+

    This gives 16 data codes, 2 control codes and one "quite/silent/same" marker. This almost looks like something!

    Packing-wise, it has 33% overhead compared to 4B3T so the data bandwidth drops by 25%. It is stateless though and the LUT is smaller.

    But compared to HTC, the density is almost the same: 1b/baud ! The control codes are nice but not a significant bandwidth concern and HTC is way simpler.

    I intend the miniPHY to have various (incompatible) versions so it is good to start with the simplest possible code. HTC does not have a "Same/Silence" code though that helps with the signalling and protocol so let's skip it.

    So the development course would be :

    1. Start with 4B4T, simple/easy/low bandwidth which can be implemented in either 2send/2receive or 1send/3receive if bandwidth matters, and see how it works in practice.
    2. Increase the bandwidth usage with 4B3T, as a simple upgrade on the FPGA side
    3. Meanwhile, see if I can figure out a balancing scheme to retrofit into 3B2T with a smaller overhead than 4B3T.

    This whole analysis has brought a lower bound of coding overhead to bring DC balance. Looking at 8B6T, it does not look like this packing ratio can be easily improved.

    From there, if a line frequency of 30MHz can not be exceeded, and 1MHz=2baud,

    1. 4B4T will bring about 60Mbits per lane (hypothetically and unlikely, let's say 20Mbps)
    2. 4B3T increases to 75Mbps (ok let's say 25 or 30Mbps)
    3. 3B2T could reach 80 (25 to 33Mbps in good conditions)

    The cool thing with a custom miniPHY is that the clock frequency could be adjusted according to the line's characteristics (length, capacitance...) and we could add lanes...

    Read more »

  • Reverse antibias

    Yann Guidon / YGDES06/04/2025 at 01:23 0 comments

    The last log has shown that the running disparity of a whole word can be computed in parallel but at high costs.

    Wouldn't it be better to compute only one word's disparity then deduce the correction ?

    That's what the previous systems (NRZ and MLT3) enabled, with simple parity as well as mod4. Yet it was still not satisfying.

    The baseline wander can be attributed to a "random walk" with no limit on the excursion, and the limit requires extra coding, which I'd like to minimise. This is the territory of 4B3T and its cousin 8B/6T with a very short range disparity, very low excursion, hence high overhead. I'd like to keep it at or below 3 bits/8 codes per 20-bit word so the idea of tweaking the data from the source is pretty interesting.

    I don't know why but what I imagine right now is the bias evaluation starting from the middle of the word, going in both directions, seeing how the wander evolves, then at 1/4, 1/2 and 3/4, "swap" something to invert the bias slope. Thus the disparity counter can get higher values but clumps can be broken up. I think. Aaaand it looks a bit (from afar) like Knuth's idea. (D.E. Knuth, “Efficient Balanced Codes”, IEEE transactions on Information Theory, vol it-32, no.1, January 1986)

    And there are 7 trits so it's not as easy.

    ------------------

    3B2T is very efficient, 4B3T is less dense but provides relatively easy and very short-term DC balance, which would be good for the analog front-end. There is a tension/compromise between coding efficiency (bandwidth usage) and BLW resilience...

    BLW and code disparity is fortunately studied in length. Howard Johnson has an interesting analysis at https://sigcon.com/vault/pdf/7_09_addenda.pdf

    https://imapsource.org/api/v1/articles/57229-line-coding-methods-for-high-speed-serial-links.pdf

    But one thing I have not yet seen covered is "clipping". Applying a clipping with a pair of diodes adds some non-linearity and some hysteresis but it reduces the absolute excursion. Another trick is to use the midpoint tap of the transformer. Absolute levels don't seem to really matter, but the amplitude and direction of the pulses count the most.

    For now, the emphasis is on the simplicity of the analog front-end, where a good portion of the manufacturing complexity and costs lies. In fact at this stage, even Manchester coding (like 10Base-T) would be nice, though would it work at a higher speed (at, say, 30MHz), and how is it possible to apply this principle to ternary coding ?

  • Drift/Bias evaluation

    Yann Guidon / YGDES05/25/2025 at 19:04 0 comments

    The constellation has a nice property that has been already highlighted:

    encoding:
    bits  trits  weight
    000    - 0   -  \____NOR2
    001    0 -   -  /
    010    + +   ++ ---NOR+AND
    011    - -   -- ---ANDN+AND
    100    + 0   +  \____ANDN
    101    0 +   +  /
    110    - +   0
    111    + -   0

    The net sum of the levels does not need a lot of gates to evaluate: the circuit takes about 4 gates.

    Of course, let us not forget the activation/enable, coming from the circuit we have already designed in the last log 2. The "Same" circuit. Since 11x totals 0, then we can just OR the result on b2 and b1 as in the circuit below:

    Now the goal is to evaluate the total bias of the encoded 20-bit word, for each of 8 "fumblings" of the 7 tribits. Initially I imagined an incrementer but there is much simpler than that: XOR each tribit with the output of a 3-bit counter. Then the winning count gets encoded with the others. The cost is one more layer of XOR2 at the input of the circuit:

    And from there we can simply add a popcount7 for each of the 7 outputs and combine them in a 32-bit "weight". But even though it's already "done", we can already simplify it a bit by noticing that neighbours can cancel each other. So let's introduce another new circuit: the reduction. To simplify it, I need to add a "zero" output to the bias decoder. And then, things took a weird turn. Here is the new circuit that combines 2 tribits:

    Now there is a big binary encoder and the 4 bits require about 7 gates of propagation. The output is a signed number so no need to process negative and positive values separately.

    This circuit uses 52 gates to process 2 bitrits, it simply amounts to a 64×4 ROM, and it must be replicated 3,5× so it's not very compact, and the rest of the adders require even more gates.

    Furthermore, the running disparity must also be injected somehow : that's the 8th value to add, since there are 7 bitrits and the adder tree would be unbalanced. So the running disparity accumulator from the last word is added with the 7th bitrit.

    The circuit needs to be run (pipelined) 8 times, for each of the 8 possible counter values, while the serialiser outputs 8 bitrits (7 data, 1 counter), so the phases could overlap but the evaluation must be complete before serialising can start: it's a pipeline (eval, serialise) with 8 sub-cycles

    ....

    Reducing the bias apparently requires a lot of effort. More than would be reasonable, probably.

    Modern links rely on the scrambler to even things out, methods like 8b/10b are out of fashion for a decade now.

    Better DSP front-ends can digitally handle the droops and wanders... I can't afford that though.

    Adding a bitrit expands the words to 16 bauds, to transmit 16 bits : the ternary recoding allows the 50% expansion. And there is still one unused bit.

    I'd like to avoid the above circuit but I know my AFE is lousy and would need some serious help, but is the expense/complexity/latency justified ? Is there a simpler method ?

  • The "Same" circuit

    Yann Guidon / YGDES05/25/2025 at 17:26 0 comments

    The symbol/BiTrit "S" (0,0) during a data word means "repeat the precedent biTrit". I have limited the number of repetitions of this meta-sequence to allow clock recovery and reduce droop, but I have found that limiting to 1 S is too hard on the circuit if it is parallelized. This is important because we want to be able to evaluate the "droop" of a whole world in parallel and this info is important.

    So we get 20 bits, extended to 21 with the LSB cleared for now, and compare pairs of tribits. The first tribit is not subject to substitution, so there are 6 such comparisons, each with 3×XOR2 and 1×NOR3. Here is the circuit:

    So far, nothing weird.

    The trick is to "length limit" the sequences of Tx=1. Of the 64 possible cases, 20 have a "suppressed" bit:

    000111 00011x
    001110 0011x0
    001111 0011x1
    010111 01011x
    011100 011x00
    011101 011x01
    011110 011x10
    011111 011x11
    100111 10011x
    101110 1011x0
    101111 1011x1
    110111 11011x
    111000 11x000
    111001 11x001
    111010 11x010
    111011 11x011
    111100 11x100
    111101 11x101
    111110 11x110
    111111 11x11x

    There is only one case where the suppression occurs twice, but this would be more complex with only 1 Same only.

    With 2 consecutive "Same" bitrits, the circuit is somewhat shorter:

    Total depth: 7 gates.

    See you in the next log.

  • Rotating constellations

    Yann Guidon / YGDES05/25/2025 at 13:34 0 comments

    Following the logs

    Today's concern is about "optimising" a word's coding to reduce droop. There should be approximately as many -1 and +1 levels. This increases the number of bits per word but we have one possible bit:

    Words (so far) are 20 bits wide, and symbols represent 3 bits, 3×7=21, so we could select one encoding among two... What could this bit affect ? It's not a XOR of the input bits since the constellation seems to be quite symmetrical and XORing the inputs would just flip the signal's polarity and nothing is solved.

    Let's go back to the current constellation:

    The corresponding tables:

    encoding:
    bits  trits
    000    - 0
    001    0 -
    010    + +
    011    - -  \
    100    + 0  /
    101    0 +
    110    - +
    111    + -
    
    decoding:
    trits   pos  neg
     - -    011  010
     - 0    000  100
     - +    110  111   <= requirement for polarity sense
     0 -    001  101
     0 +    101  101
     + -    111  110   <=
     + 0    100  000
     + +    010  011

    As previously noted, the table is almost symmetrical, but not completely (and this is on purpose), so the polarity is handled at the output of the comparators, not at the table level.

    • One possibility to affect the balance of output levels would be to "rotate" the constellation by 90°, as it would preserve the Hamming distances between consecutive codes. But then it amounts to the same thing as swapping the first and second trit, which does not change much of anything.
    • The second possibility is to rotate by 45° (clockwise or anti, it's mostly the same overall, I think) so a second table is required. Expensive.
    • The 3rd possibility would be to rotate the input bits of the table: one table, one weighing circuit (2×popcount), 6 input permutations... and several cycles to elect the best one, possibly also considering the past values, to reduce baseline wander. This is possible since it still operates rather "slowly" and there is no need to test all permutations in parallel.

    6 permutations require a whole bitrit to encode, and 2 symbols are lost (out of 8 tribits).

    • 4th possibility is to "increment" all the tribits and test 8 cycles => a whole bitrit is used. This would help in the cases were long strings of 0s or 1s are sent, or any 3-bit pattern.
    • The balancing system should account for the "Same" symbol, which outputs 00 when the tribit is the same as the previous one.

    .

    Some of the permutations/increments should have some symmetries and could be discarded, I guess.

    Each tribit is expanded to four bits, 2 pairs that represent one trit each, with the encoding

    00 => 0
    10  => +
    01  => -

    => It is possible to replicate a popcount circuit for the droop estimate, 20 bits => 14 trits => 4 bits per polarity suffice and such a circuit has already been designed for the ParPop circuit. Well, the sum can vary from -14 to +14 so that's 5 bits total.

    A pre-decoding table can be used, derived from the original encoding table.

    encoding:
    bits  trits  weight
    000    - 0   -
    001    0 -   -
    010    + +   ++
    011    - -   --
    100    + 0   +
    101    0 +   +
    110    - +   0
    111    + -   0

    Nice, the LSB is almost unused. The 5 output codes need 3 bits, or 4 bits to encode them separately (-, +, -- and ++) to be processed by a less dense circuit.

View all 5 project logs

Enjoy this project?

Share

Discussions

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates