LRU

Project Logs

Collapse

More complete myLRU
Yann Guidon / YGDES • 5 days ago • 0 comments
Once again, by writing a log, I realise something I missed before. This new version of myLRU (I decided to give it a cute little egotistical name) is more complete now.

I take the previous version and add 2 signals :
- a HIT input flag
- a WAY output field
This only adds a few more gates and makes it easier to incorporate into a whole cache system.

The HIT input enables the match/comparators, so spurious invalid inputs don't affect the LRU algorithm. It's only a matter of extending the NOR to 3 inputs.

......

Yeah that was a bit premature. So here I start again from scratch:

....

In this circuit, the matching way is encoded to 2 bits and fed to LRU circuit that works in parallel.
- If there is a hit, the "way selector" output is a copy of the input (no need to change it).
- If there is a miss, the "way selector" output is A decoded.
Meanwhile,
- if MRU is set, LRU will keep A and B as is
- otherwise A<=B and B<=new_B
....

So overall, there are 4 cases:
1. A<=A, B<=B when MRU=0 & A!=W & B!=W (hit C or D)
2. A<=A, B<=new_B when MRU=0 & B=W (hit B)
3. A<=B, B<=new_B when MRU=0 & B!=W (hit A)
4. A<=W, B<=A when MRU=1 & HIT=1 & A!=W
Note: A=W and B=W is the error condition so there are only 3 relevant conditions for MRU=0, respectively:
1. A!=W & B!=W
2. A!=W & B=W
3. A=W & B!=W
This updates the previous list such that
1. A<=A, B<=B when MRU=0 & A!=W & B!=W (hit C or D) or MRU=0 and A=W (hit A)
2. A<=A, B<=new_B when MRU=0 & A!=W & B=W (hit B)
3. A<=B, B<=new_B when MRU=0 & A=W & B!=W (hit A)
4. A<=W, B<=A when MRU=1 & HIT=1 & A!=W (hit B, C or D)
.

temporary playground.

.

The effect of HIT is not well defined for the MRU=0.
Hit or miss
Yann Guidon / YGDES • 6 days ago • 0 comments
The LRU and pseudo-LRU methods have a different behaviour, depending on the hit or miss event. The influence of read or write is not yet taken into account.

Considering the ideal, true LRU : the tag bits are only updated if the last way is different than the MRU field. my LRU behaves this way too but with fewer fields to compare, hence fewer chances of update.

The whole cache cycle is a 4-step dance:
1. take a portion of the address bits to address one set of the cache, and read it all, latch it in a large buffer.
2. compare the (4) address tag fields with the upper part of the desired address, encode the result down to 3 bits : 2 "way" bits (2× OR2) and 1 "hit" bit (OR4).
3. LRU magic : the just read tag bits are mixed with the "way" bits to create new LRU tag bits.
4. the eventual new data is sent along with an eventual new address to the selected way, and the LRU tag bits are updated for the whole set (when applicable).
Note : here I don't consider the (unlikely) case of partial access. The whole line is used, 16 or 32 bytes at once.

The link between 3 and 4 has some implicit assumptions that must be clarified. It all depends if there was a hit or miss, and read or write.

......
to be continued
MRU in L1
Yann Guidon / YGDES • 6 days ago • 0 comments

I have seen that a lot of research has been done on enhancing the behaviour of L2 and L3 caches, in particular to prevent "thrashing", using deeper histories and complex replacement policies. I see little references on this subject concerning L1 and I have only found references to "minimising the LRU bits" which was a legitimate concern 30 years ago, but today... What's one more bit ?

A better hit rate on L1 reduces the pressure on L2, which is often seen as a compromise to the weaknesses of L1. This is why MRU mode in L1 is interesting: it helps when thrashing is detected. But thrashing occurs between consecutive lines so the LRU bits of a given set can't do much...

So a memory data read/write port must keep some statistics of the hit/miss ratios, miss sequences (a shift register ? resetable saturating register ?) and issue a LRU/MRU signal to the appropriate cache block. My algorithm (and pureLRU) can accommodate this and help with the reduction of thrashing.
The other LRU
Yann Guidon / YGDES • 05/02/2024 at 19:49 • 0 comments
If I want to be sure that my "invented here" LRU4 is both valid and better than other systems, I must compare them, which means I must also care for the existing systems.

The first obvious choice is the IBM-invented 3-bit tree-LRU (as named and described in the wikipedia page). Which is actually a recursive version of the obvious 1-bit LRU for 2-way associative cache. You can't make simpler than that: it's an inverter!

But for 4 ways, it gets a bit more complex, as quoted in the project description. Fortunately a sort of lookup table is provided:
```
state | replace      ref to | next state
------+--------      -------+-----------
 00x  |  line_0      line_0 |    11_
 01x  |  line_1      line_1 |    10_
 1x0  |  line_2      line_2 |    0_1
 1x1  |  line_3      line_3 |    0_0
   ('x' means don't care)
   ('_' means unchanged)
```
This can be condensed in a few gates:

No wonder this is a favourite method, but is it actually good ? It seems this is out of fashion in the last 20 years.

This is to be compared to the "true LRU", which uses 5 bits in the optimised case.

An MRU mode could be implemented by not updating the bits.

------------------------------------------------------

Another, more modern method (Itanium, SPARC...) is the 4-bit "bit-LRU" with a pretty weird scheme. Quoting Wikipedia:

Bit-PLRU

Bit-PLRU stores one status bit for each cache line. These bits are called MRU-bits. Every access to a line sets its MRU-bit to 1, indicating that the line was recently used.
At this point, OK I get it, pretty simple and it makes sense: "mark the accessed way as it is used". But then, it is different because after a while, everything will be marked. The method here looks trivial but unheard of:

Whenever the last remaining 0 bit of a set's status bits is set to 1, all other bits are reset to 0.
WHAT ? you just clear the whole history ???

So far this is the corresponding circuit:

At cache misses, the leftmost line whose MRU-bit is 0 is replaced.

There is a priority encoder now? why not a random value to prevent biases ?

This increases the logic depth, unless a ROM is used. The priority encoder is added below:

This circuit provides both the binary encoded number of the new set to replace, as well as the 4-bit set. The logic depth is 5 here (like the one I designed).

And I'm not sure how to design a MRU mode with it.

Maybe the trick for this system is to send 5 signals to the target line : 4 "set" signals, and one "reset all" signal (overdriven by the one set signal). Set-reset flip-flops take fewer transistors (one half) than DFF gates.

The bias might be how, over time, some lines might get moved toward the "left" and create a sort of natural LRU where the "rightmost" are evicted more often but I have only vague ideas about it and it should be tested as well as documented.

---------------------------------------------------

Update:

I looked at the referenced paper from 2010, where it's called "NRU" (Not Recently Used) and the description is "slightly different" !

So the position of the first candidate rotates, preventing a bias. Is the counter general, or stored with the lines ? The latter would take 2 more bits per line. And a counter is more "concerning" than a PRNG.

...............................................................................................................

And since we're in 2024, power consumption matters, and in this case signal activity (level changes) have a significant importance.
So the rate of changes must be included in the comparisons.

Tree-LRU changes the polarity of 2 signals at every cycle ! (and only uses 8 of the possible 24 permutations).

Bit-LRU might swap 4 bits at once from time to time, only setting one most of the time.
...............................................................................................................
Another detail I seem to have missed is the Hit/Miss behaviour. That means another log about this subject...
MRU mode
Yann Guidon / YGDES • 04/30/2024 at 13:19 • 0 comments
To implement on-the-fly MRU, it's simple:
- if A!=S then B <= A (and enable update) (else B)
- S goes to A.
That's it.

This takes precedence over the rest so adds a layer of MUX and some latency...

.

preliminary interactive playground here.

.

The added MUX layer is quite OK for the A output:
```
A <= S when MRU='1'
  else B when A=S
     else A;
```
Easy !

But for B the equation is more complex.
```
If MRU='1' then
  if A=S then
    B <= A;
  else
    B <= B;
  end if
else
  if ... then
    B <= newB;
  else
    B <= B;
  end if;
end if;
```
This "folding" else is not obvious to redesign. Ideally only a pair of MUX2 should exist.
And let's not forget the WB output.

Here is another temporary version.

.....

But I have finally found a trick by swapping the MUX, giving more time for the decoding logic and reducing redundancies. The current version is there:
The logic depth is about 5 gates and it supports both LRU and MRU modes. The approximate number of CMOS gates is 30, though going down to NOR/NAND logic could bring some enhancements, by sharing inverters for example.
Can I consider the case closed ?
I don't think so because I'll have to transcribe it to VHDL and then stress-test it with simulated worloads.
That will come later :-P
PLRU4
Yann Guidon / YGDES • 04/28/2024 at 18:51 • 0 comments
I'm starting to play with circuitjs. You can try a first playground here. It's still buggy and incomplete but you get the idea of the actual complexity.

For now I have dropped the requirement for B to avoid the 00 state, as well as the XOR. When the circuit detects A=B, it outputs an "error" status bit and forces the update: this makes the circuit self-correcting after a couple of cycles, like the old 4017.

This changes the boolean equations that I already found the other night, and I must redesign them. There are only 64 cases to consider, with many possibilities, but few are minimal.

The following lookup table shows the values injected into B, since A is usually a copy of the old B.
```
      |  S:
A: B: |  00        01        10        11
------+-----------------------------------------
00 00 |  R:01,10,11...........................
00 01 |  A:10,11   B:10,11   x         x
00 10 |  A:01,11   x         B:01,11   x
00 11 |  A:01,10   x         x         B:01,10

01 00 |  B:10,11   A:10,11   x         x
01 01 |  R:00,10,11...........................
01 10 |  x         A:00,11   B:00,11   x
01 11 |  x         A:00,10   x         B:00,10

10 00 |  B:01,11   x         A:01,11   x
10 01 |  x         B:00,11   A:00,11   x
10 10 |  R:00,01,11...........................
10 11 |  x         x         A:00,01   x

11 00 |  B:01,10   x         x         A:01,10
11 01 |  x         B:00,10   x         A:00,10
11 10 |  x         x         B:00,01   A:00,01
11 11 |  R:00,01,10............................

x : no modification
R: error (A=B) =>      B=new (A=B implicit, harmless)
A: Match A     => A=B, B=new
B: Match B     =>      B=new
```
In all cases, the output A is either input_A or input_B. So there is nothing to compute on this part.

The only thing to compute is the new_B : 2 bits depending on 6 input bits. Make it 7 for the random input. And an 8th input for MRU mode but it doesn't affect the computation of new_B. And the table shows that the possible outputs only depend on the A and B inputs (4 bits only).

Condensed array:
```
A: B: A^B |  Possible values:
----------+------------------
00 00  00 |     01,10,11
00 01  01 |        10,11
00 10  10 |     01,   11
00 11  11 |     01,10

01 00  01 |        10,11
01 01  00 |  00,   10,11
01 10  11 |  00,      11
01 11  10 |  00,   10

10 00  10 |     01,   11
10 01  11 |  00,      11
10 10  00 |  00,01,   11
10 11  01 |  00,01

11 00  11 |     01,10
11 01  10 |  00,   10
11 10  01 |  00,01
11 11  00 |  00,01,10
```
All entries have at least 2 possible outputs, to be selected by a random bit.

The possible outputs must not be A or B so A^B is a very interesting value.

The table shows that ~(A^B) is a valid value for most entries except for B=11, where A^B is valid, so we have something here! A NAND and a few XOR will work.

Version 2 seems to not work correctly though:

It seems that my initial boolean analysis was wrong.
```
A: B: A^B /A^B |  Possible values:
---------------+---------------------
00 00  00   11 |     01,10,11
00 01  01   10 |        10,11
00 10  10   01 |     01,   11
00 11  11   00 |     01,10

01 00  01   10 |        10,11
01 01  00   11 |  00,   10,11
01 10  11   00 |  00,      11
01 11  10   01 |  00,   10

10 00  10   01 |     01,   11
10 01  11   00 |  00,      11
10 10  00   11 |  00,01,   11
10 11  01   10 |  00,01

11 00  11   00 |     01,10
11 01  10   01 |  00,   10
11 10  01   10 |  00,01
11 11  00   11 |  00,01,10
```
There is obviously a requirement to use a "diagonal" of the squares (even permuted) because focusing on a column would create an unacceptable bias of the sequences, particularly if 11 is constantly output (except 00 when A=11). So both bits should be inverted when A!=11.
But there is more happening with B=11 that requires special care.

In the new circuit, A=00 B=11 and A=11 B=00 both resolve to the unwanted value 11.

For A=00 B=11 this creates the forbidden state 11 11 when the desired set S is also 11. This case is only 1/4th and the result can later be recovered because the next state will become 11 00. Adding an extra gate can avoid the copy of B into A2.

For A=11 B=00, the forbidden state 11 11 is also output when S=00.

A few more gates (A[0] xor A[1], NAND3 and XOR3 replacing XOR2) solves this and we get :
- A=00 B=11 S=00 => A2=11, B2=01
- A=11 B=00 S=11 => A2=00, B2=01
And that's it.
```
A: B: | value:
------+---------
00 00 |  11
00 01 |  10
00 10...
```
Read more »
4-way Pseudo-LRU
Yann Guidon / YGDES • 04/28/2024 at 16:23 • 0 comments
The last post was the first ever in this project, which was created about 3 years ago as a placeholder, I just wanted to keep track of interesting algos and have fun with them. I didn't expect to stumble upon what I just described... So I start to take it more seriously, not a one-white-night curiosity.

As always, the goal is to get a fairly good replacement behaviour with the least resources and latency. I found something with a good ratio: at 4 bits of history, it's 1 more bit than the IBM tree-PLRU used in the 486 and others, but this is not a significant overhead, because you'll need 2 more MESI bits per line. So 4 ways require (2*4)+4=12 bits per set of ways, instead of 11, no big deal here. And it's not far from the ideal 5.

The decoding can be done with logic gates only, no ROM required. And it can be simply switched from LRU to MRU on the fly (by forcing a hit on a field).

The algo only compares one half of the ways : that's just 2×2 bits to compare. And it compares only the ones to be soon evicted, leaving the uncertainty to the MRU half. So there are fewer bad surprises about lines that get evicted too early : there are at least 2 turns before it happens.

I also like that it is scalable and easily configured. Higher numbers of ways can be handled with little effort, since only one half of the fields are compared.

It is also energy efficient since repeated hits on the MRU half do not change the state: thus it is not needed to rewrite it back. Extra energy is needed when actual replacement occurs.
-o-O-0-O-o-
The system is based on the ideal LRU queue. For N entries, there are N fields (A, B, C, D, E... from the LRU to the MRU) that act as a FIFO until it is full. Let's have an 8-deep queue with slots A to H containing a permutation of numbers 0 to 7, initialised in increasing order:
```
input sequence: 0 1 2 3 4 5 6 7

      LRU                  MRU
slot:  A  B  C  D  E  F  G  H
val:   0  1  2  3  4  5  6  7
```
When a new value is added from the sequence, it is inserted at the MRU level, and the other values are shifted down the queue, until the inserted value is found. For example let's add 2:
```
      LRU                  MRU
slot:  A  B  C  D  E  F  G  H
val:   0  1  3  4  5  6  7  2 <==insertion
```
The value 2 is found at the slot C so all the values at the slots on the right move down one position, overwriting the 2, which is re-inserted at the MRU slot H.

Here is the problem:
- The comparisons and shifting get heavy very fast as the queue depth increases.
- We only care about the LRU field because this is the one we need to replace an old entry.
Here is the solution:

Only keep the M least recently used slots.

For example, for N=8, we can set M to 3. This requires 3×3 bits&comparisons for the queue. This means that during replacement, a given entry must not appear M times to be evicted.

The other beauty is that neither N or M need to be a power of 2. A 5-deep would require 6 bits of history since the
- A needs 5 values
- B needs 4 values
- C needs 3 values
B is encoded in 2 bits, A and C are encoded together in 4 bits because 5×3=15.

Here is the new problem:

Since the MRU slots are ignored, during shifting there must be one new value inserted. The value could be random but must not be one of the inputs. This is a less trivial question but it is still limited since only log2(N) bits must be "guessed". The wide range of possible values make it a good candidate for boolean simplifications (where creativity can be unleashed). The guess only depends on the existing M×log2(N) inputs, not on the desired/matching set.
- For N=4 M=2 it requires a 16×2 lookup and 2×2 comparisons.
- For N=8 M=3, there is a 512×3 lookup and 3×3 comparisons.
- For N=8 M=4, there is a 4096×3 lookup and 4×3 comparisons.
Of course, some heuristics can further decrease the LUT size, using common boolean techniques.
4-LRU
Yann Guidon / YGDES • 04/28/2024 at 03:16 • 0 comments
The permutation deamon is biting again, while I investigate a related caching subject. I am now considering the full LRU for 4 ways, using 5 bits arranged in 2-2-1 fields.

Field A : 2 bits, it is a fully encoded (4 values from 00 to 11) that directly points to the least recently used way. It is handy because the value is directly available.

Field B : The middle 2-bit field has only 3 values, and 00 is forbidden (marks a cleared set). so the number of the 2nd-to-least recently used is a XOR with field A. The XOR is not required, after all, but not too heavy on the circuit anyway.

Field C: 1 bit, selects which one of the 2 remaining sets comes first. The logic gets complex here, but now I wonder if that part matters at all. Maybe this field can be simply skipped and removed, because it does not affect the LRU much.

In fact what matters is the least recently used, not the most recently used. So the first 2 can be skipped.

All one has to do is compare a new set with the fields A and B, and if there is no match, no update.

If there is a match then the matched field is replaced with the one on the left : this acts like a sort of shift register. And given the fields A and B, some boolean logic is enough to restore one of the missing set numbers.

This greatly reduces the size of the LRU logic, probably as small as the IBM 3tree system but slightly better LRU behaviour. And there is no large lookup table as suggested in the project intro.

***************************************

What's crazy is : while writing it, it all made sense and it was totally unexpected.

Here is some pseudocode:
```
A : 2 bits \ from LRU
B : 2 bits /
B' : 2 bits, is actual B.
S : 2 bits (new set to update)
A2, B2 : updated LRU fields.

B' = A xor B;

If S != A and S != B' : Do nothing.
Else
   if S == A then
     A2 <= B
     B2 <= newB(A, B', S)
   else
     if S == B' then
       -- A is not modified
       B2 <= newB(A, B', S)
```
So that's actually a few XOR and MUX.

The secret sauce is in the function newB that creates the new value of B, which is derived from A, B' and S. This is used in both branches of the IF so we implicitely know that either S==A or S==B', as well as A!=B. A first approximation uses B, because it's already the difference which is never 0 ! (which could add a bias...)

Here is the updated pseudo-VHDL code:
```
B' := A xor B;
newB := B;

A2 <= B'          when  S == A               else A;
B2 <= newB xor A2 when (S == A) or (S == B') else B;

update if (S == A) or (S == B')
```
The correct formula of B2 is not yet defined but you get the idea : it's quite simple. No weird LUT to precompute.

newB can have two sets of values:
- when B is 11, newB can be A^1, A^2, B^1 or B^2 (whichever you want),
- when B is 01 or 10, newB can be /A or /B' (whichever you like),
- B can't be 00 by the way. But if it is, let's set newB to /A2 to force a sane value.
There is the risk that choosing only one field all the time (always A or always B) creates a bias that will disturb and affect the cache behaviour. This can be solved by using a 5th bit (resurrect field C) but that might be overkill so a random source is used.

Some boolean simplifications could propagate through the circuit. At this point we do not really care that field B has only values 01, 10 or 11 so we spare some gates at the output. However B' =A^B is still required by newB.
.
Fun fact : this method can choose between LRU and MRU on the fly (for example if a stream is detected) by forcing a "hit" on field A (field B can have a match but will do the same as with A).