You might have guessed already that this post is not directly related to the main subject of this project but I'm not feeling (yet) the need to spin it off to a new separate project. So here it is: I think I have solved the problem with the "token ring".
A "ring" has been chosen for many reasons:
- it is directly scalable, just plug more units and they will deal with addressing themselves.
- a ring is probably the most routable topology, giving the most freedom to the P&R software of the FPGA. With manual placement, a hamiltonian path can be built.
- There might be 2 separate rings because each is synchronous to the node's clock, and some nodes might run faster with the use of the Math block.
- Latency is not critical in this application so having 1000 nodes in series does not change much. At a supposed speed of 400MHz, 1000 nodes are scanned at 400KHz, or every 2.5µs. Good enough.
- I need the least wires, gates and control signals. The system must be really lightweight, because an gate used for interco is wasted for raw computations. Saving 1 gate will allow one more compute node to be instatiated.
- The same wires will transport the commands from the host controller, and the results of the jobs.
Now, if the basic idea of a ring bus or a token ring interco sounds simple and easy, the reality is much less so. In particular, classic ring networks use "slots" of fixed sizes so a data packet hops on the ring when one slot is free. For our jobs, this does not work as well because there are 2 sizes of packets:
- The "write" packets that load a new value for computation on a given node:
- w bits of address (plus a command word for extra features if needed, such as reset or force read),
- w bit of value.
- The "read" packets are sent by the node when the job succeeded:
- w bits of address (plus some extra data if needed)
- w bits for the resulting X value
- 2w bits for the cycle counter
Let's say we have a datapath with w bits, then there are already 6 types of packets, plus the empty type of packet. The ring is also synchronous to the main compute clock to keep things simple. And to keep up with 400MHz, they HAVE to be simple.
The challenge is to insert the words in the shift register of the ring, without any risk of creating collisions, because a sender must look up to 4 previous units to see if they request to send data. This is not desirable and this has left me scratching my head for a long while...
The first inspiration was the token ring system but this couldn't work as is. The idea sounds good however and the trick is to adapt how the "token" is managed, to make sure there are no collisions. Anyway the principle remains the same: whichever node "has" the token can insert any stream of data on the ring.
The ring in fact is not one, it's open indeed (a bit like a JTAG scan chain):
- on one end, the host enqueues "write" commands,
- on the other end, the host reads a stream of "read" packets.
so the classic "token ring" system must be adapted. The behaviour, nature and algorithm of the token is changed a bit...
- The host enqueues requests (2 words) on the "ring" (going through a RAM buffer because messages must have consecutive words and the host might be slow)
- The host terminates the string of requests with a "token" packet, then fills the "ring" with empty packets.
- Each node look for an "address write" packet and compares the address with its own (constant) position.
- If a match is found, the cycle counter is cleared, Y is cleared, Xinit is written from the ring and the job can start.
- When the job is over, the node stops and waits for the appearance of the "token" type packet.
- When encountered, this token is held in place while the node serialises (MUX4...) the 4 words of the packet.
- When the 4 words are sent, the token is released, so more nodes downstream can insert their data after this node's.
- At the end of the chain, the write packets are omitted and the read packets are stored in a buffer/RAM so the host can read the results later.
- When the end receives the "token", it sends a signal to the host.
The packet types could be encoded in additional 3 bits, representing:
000 : empty word 001 : token word 010 : write address word 011 : write Xinit word 100 : read address word 101 : read Xres word 110 : read cycle counter LSB 111 : read cycle counter MSB
It's a lot for today, stay tune and come back tomorrow !