A Bit of Hindsight: Part 2: Byte Lanes vs. Width Hierarchies

A project log for Backbone Bus

Backbone is my proposal for an off-chip, Wishbone-inspired backplane interconnect that supports multiple bus masters.

Samuel A. Falvo IISamuel A. Falvo II 05/31/2016 at 18:350 Comments

Backbone, as it's currently defined, is basically Wishbone exposed to the world. It's an almost purpose-built bus interface just for the Kestrel-3's hardware development as I work towards a single-board version of the computer. Its mission, and thus its criteria for success are:

  1. It lets me explore different pieces of the Kestrel-3 in isolation of other components. With an SBC, this is not possible; I'd have to refab the entire board if I changed even just one circuit.
  2. It lets me explore bus architecture design. This is already a resounding success; I don't even have a board fabbed yet, and have already identified two things I would do differently next time I need a parallel bus. I've already documented one of these things in the previous log; this log is devoted to the second.

One characteristic of the Wishbone bus is that, per the specification, wide interfaces need to be qualified with one or more select signals; these select signals function the same as BEx in Intel CPUs, DSx in 68K CPUs, etc. SEL0, when asserted, means that valid data appears on DAT0-DAT7. SEL1 means data appears on DAT8-DAT15, and so on. (All assuming an 8-bit granular interface, of course.) This also implies that the address bus is split into two parts: ADR0..ADRx is literally hidden from the outside world, since it combined with the desired transfer size is used to calculate the proper SEL line settings, and ADRx+1..ADRy (where y is your highest address bit; typically 15, 31, or 63 for 16-, 32-, or 64-bit address spaces). More concretely, a 64-bit wide, 8-bit granular bus will not expose A0, A1, or A2, since the meaning of these bits are used to determine which of SEL0, SEL1, SEL2, SEL3, SEL4, SEL5, SEL6, or SEL7 are asserted for bytes, which pair is asserted for half-world transfers, etc.

This is a great optimization if you're addressing memory. Memory is inherently amenable to such row/column decomposition of an address space like this, so it makes perfect sense. The problem is that literally everything else you'd ever want to talk to on the bus is not so amenable.

Consider the KIA, which I introduced first for the Kestrel-2, which also used a Wishbone bus. Its registers are only 8-bits wide, and the core has only a single address input. You'd expect its registers to appear at KIA+0 and KIA+1; however, this is a mistake. Because A0 is not exposed to the world, it does not participate in address decoding. Instead, A1 is attached (the Kestrel-2 is a 16-bit CPU and bus system), which means its registers are actually located at KIA+0 and KIA+2. So what appears at KIA+1 and KIA+3? Nothing. If the KIA had writable control registers, and you attempt to write to those locations, you run the real risk of loading garbage into those control registers, since the state of the byte lanes those registers would talk to exclusively would be completely undefined.

A much better approach is to use High Enables instead. Instead of a linear decomposition of the bus lanes (where a 64-bit bus has 8 lanes of 8-bits each), a logarithmic decomposition is used instead (a 64-bit bus has 1 32-bit high word, 1 16-bit high half-word, 1 8-bit high byte, and 1 low byte). Such a bus allows 8-bit devices to focus just on D0-D7 without concern for which byte-lane it should attach to, 16-bit devices to D0-D15, and so forth.

It is also naturally supportive of upward compatibility. To illustrate, let's start with a simple nybble-wide bus.


Pretty simple; it allows us to read or write any nybble in a 16 nybble address space. We can expand the address space easily by just tacking on more address bits: this doesn't affect old hardware since they just ignore the upper address bits.


But, if we now want to address bytes, we need to tack on another set of data bits. The CPU would tell the addressed peripheral that it wants to transfer a full byte by using a "Nybble High Enable" (NHE) control signal.


We need to know if D0-D3 or if D0-D7 are valid. That's the purpose of NHE, and it behaves like so:

A0   NHE    D0-D3        D4-D7
0    0      Nybble A
0    1      Nybble A     Nybble A+1    
1    0      Nybble A+1
1    1      Impossible condition.

If NHE is negated, then A0-A7 determines what value appears on D0-D3 just like the old 4-bit bus. But, if NHE is asserted, then A1-A7 (NOTE! A0 not involved!) determines which byte to read from or write to. A0 will always be zero, since that will make the address byte aligned. Accessing data with both NHE and A0 set would be an alignment violation.

This can be expanded upwards to support a 16-bit bus as well, and it can be done in a completely backward compatible manner:

A1    A0    BHE    NHE    D0-D3        D4-D7        D8-D15
0     0     0      0      Nybble A
0     1     0      0      Nybble A+1
1     0     0      0      Nybble A+2
1     1     0      0      Nybble A+3
0     0     0      1      Nybble A     Nybble A+1
0     1     0      1      Impossible condition.
1     0     0      1      Nybble A+2   Nybble A+3
1     1     0      1      Impossible condition.
-     -     1      0      Impossible condition.
0     0     1      1      Nybble A     Nybble A+1    Byte A+2
0     1     1      1      Impossible condition.
1     -     1      1      Impossible condition.

Trivia: why must BHE and NHE be asserted at the same time? Because all byte accesses are also nybble accesses. Likewise, all 16-bit word addresses are also byte and nybble accesses as well. NHE needs to be asserted because hardware unaware of BHE will not know to drive D4-D7 during a byte or word-sized transaction.

And this keeps scaling up and up. I used nybbles to illustrate in a more or less convenient way, but in the real world, you'd typically use Byte Enables instead of Nybble Enables. If you just widen everything by 4 bits above, you'll notice that we described a 32-bit bus with the same number of total signals as a byte-lane type bus, but which retains full backward compatibility with a simple 8-bit bus.

Once you go beyond 32-bits, though, this is where the savings come in big. To widen the bus to 64 bits, you need one new high-enable, and another 32-bit data lane. Let me repeat that: you have a total of three high enables, not eight like you'd have with a typical laned bus. For a 128-bit bus, you'll add a 64-bit data lane, and one more high enable. If we compare bus data and lane select bits, we see the following trend (assuming a 64KB address space; add pins as needed):

Data bits   8    16    32    64    128
Addr bits   16   15    14    13    12
SEL bits    0    2     4     8     16

Totals      24   33    50    85    156
Data bits    8    16    32    64    128
Addr bits    16   16    16    16    16
HE bits      0    1     2     3     4

Totals       24   33    50    83    148

In the worst-case, you're at parity with the number of signals you need to route, and in the best case, you have (potentially quite a bit of) a savings.

In terms of compatibility, you can certainly make something like a packed KIA address layout work with a laned bus too; but, the target hardware has to be aware of the bus architecture for this to work right. In the worst case, you'd basically need a new hardware spin with each widening of the bus (except in those cases where the base address remains naturally aligned with the bus word size). In the best possible case, you need a "bus bridge" to perform lane management on behalf of the older peripheral hardware. You'll need to recover lower address bits based on received SEL lines, and that assumes no illegal bit patterns!

All in all, using a logarithmic bus decomposition with high-enables seems to offer a ton of advantages over a flatly decomposed lane-based bus. Probably about the only time a laned bus will demonstrate any superiority is in those cases where the bus controller write-combines non-adjacent transactions. Except for video controllers, I can't think of any time you'd want to do this. Maybe I'm wrong though.

EDIT: Looking at the tables above, it's clear to me now why Wishbone B4 spec limits the port size to 64 bits.