Close
0%
0%

Libre Gates

A Libre VHDL framework for the static and dynamic analysis of mapped, gate-level circuits for FPGA and ASIC. Design For Test or nothing !

Similar projects worth following
As the project "VHDL library for gate-level verification" (https://hackaday.io/project/162594 ) was progressing, more features and more abstraction were developed, such that libraries other than the ProASIC3 family could be implemented. ASIC libraries such as sxlib or wsclib are good candidates, and more will appear with the surge in DIY ASIC projects spurred by Google & Skywater.

This project brings its pure GHDL-driven code to more technologies, allowing users to check their circuit, verify BIST coverage and eventually generate test vectors with only a set of really Libre files in pure VHDL (thus avoiding expensive and restrictive license fees from niche proprietary vendors). Oh, did I mention how awesome GHDL is ? But you could use any simulator that is fully VHDL'93 compliant.

Import from, and export to other netlist formats are also in the air...

This project contains a set of tools that process VHDL files mapped to FPGA/ASIC gates, so it is useful as a step between the synthesis of a circuit and the place&route operation.

You can:

  • Simulate the circuit (for example, if your synthesiser has mapped the gates to a given PDK but you don't have the corresponding gates in VHDL)
  • Perform static analysis of the netlist (unconnected inputs or outputs, and other common mistakes)
  • Extract dynamic activity statistics (how often does a wire flip state, if at all ?)
  • Verify that any internal state can be reached (thus helping with logic simplifications)
  • Alter any boolean function, inject arbitrary errors and prove your BIST strategy
  • Extract logic traversal depth and estimate speed/latency
  • Inspect logic cones, see what inputs and outputs affect what
  • Help with replacing DFFs with transparent latches
  • Ensure that the circuit is correctly initialised with the minimal amount of /RESET signals
  • Detect and break unexpected logic loops or chains

Some day, it could be extended to

  • Pipeline a netlist and choose the appropriate strategy (will require detailed timing information)
  • Transcode/Transpile a netlist from one family/technology to another
  • Import/export to EDIF or others ?

 
Note: Since the tool typically processes netlists before place&route, no wiring parasitics data are available yet so no precise timing extraction is possible and it doesn't even try. It can however help, in particular with extraction of the criticality of each path then the mapping of gates to the proper fanout.

The project started as #VHDL library for gate-level verification but the scope keeps extending and greatly surpasses the mere ProASIC3 domain. For example I also study the addition of the minimalist OSU FreePDK45. More unrelated libraries would be added in the near future, depending on applications : Skywater PDK and Alliance could follow. Contact me if you need something !


Logs:
1. First upload
2. Second upload
3. Rewrite
4. More features ! (one day)
5. Another method to create the wrapper
6. inside out
7. OSU FreePDK45 and derivatives
8. The new netlist scanner
9. Chasing a "unexpected feature" in GHDL
10. Polishing and more bash hacking
11. Completeness of a simple heuristic
12. Benchmarking results
13. Wrapper rewrite
14. A smarter sinks list allocator
15. Strong typing snafu
.

LibreGates_20201202.tbz

netlist probe ok

x-bzip-compressed-tar - 224.07 kB - 12/02/2020 at 03:17

Download

LibreGates_20201128.interim.tbz

rebuilding the netlist probe code.

x-bzip-compressed-tar - 234.94 kB - 11/28/2020 at 09:32

Download

LibreGates_20201121.tbz

New wrapper generator, new brute-force test/benchmark...

x-bzip-compressed-tar - 232.98 kB - 11/21/2020 at 20:17

Download

LibreGates_20201116.tbz

Faster fault test with parallel execution, weird bash bug squashed.

x-bzip-compressed-tar - 199.16 kB - 11/16/2020 at 03:17

Download

LibreGates_20201108.tbz

self-tests pass OK, OSU barely started, wrapper generator OK with split source files, netlist-probe remains to be done, some obsolete stuff will be removed later.

x-bzip-compressed-tar - 198.16 kB - 11/08/2020 at 05:05

Download

View all 10 files

  • Strong typing snafu

    Yann Guidon / YGDES3 days ago 0 comments

    VHDL is a strongly typed language : you can define types that, though similar, can't intermix unless you cast them. This is good for code robustness. As long as you get them right. My recent code used strong typing to prevent mixing two similar-looking types, that I finally illustrated on this diagram:

    The two types share the "signedness" trick where a number less than 1 indicates a port. The differences are significant though:

    • One points to an input while the other points to an output.
    • A positive sink number must also specify the gate's input number, encoded in the 2 LSB.

    At one point I must have been confused and the corresponding diagram made no sense... sink_number_type was used instead of driver_number_type and sinks would point to sinks. That's a virtue of thorough documentation: it helps catch logic errors :-)

    The above diagram and the one below are the keys to understand how the netlist is structured. Each sink points to their driver, and each driver points to one sink or a list of sinks. Easy to say, delicate to code :-)

    So the confusion is now cleared and I can resume development.

    .

    .

    .

    And the result is there : LibreGates_20201202.tbz

  • A smarter sinks list allocator

    Yann Guidon / YGDES6 days ago 0 comments

    I don't know why I heard about the radix sort algorithm only recently. I have seen it mentioned, among many other algorithms, but I had not looked at it, leaving this subject to the sorting nerds. Until I watched this :

    This is very smart and useful but I'm not building a sorting algorithm now. However I can reuse some of these tricks to build, or compile, a semi-dynamic data structure (write-only once) that is both compact and efficient.

    Let's go back to my first version of the netlist probe:

    Each signal driver must manage a list of its sinks and I first implemented it as a linked list, borrowing some words of memory from each sink structure. Memory-wise, the memory overhead is minimal because one sink can be linked to only one source. The "pointer" of the linked list is both the desired value and the pointer to the next sink (unless it's the last). It avoids any malloc() and meshes directly in the existing structures, however in return they become more complex and harder to mange. The insertion and scan algorithms are a bit cumbersome...

    The new version allocates a single chunk of memory to store simple lists of sink addresses (called sinks_list).

    This allows the sink descriptors to be a single number, which is the address of the source/driver. It's easy to manage. As shown in log 3. Rewrite, the driver only contains a fanout number and an address:

    • If the fanout is 1, the address is a direct sink address.
    • If the fanout is 2 or more, the address is an index in the unified array of addresses, pointing to fanout× contiguous sink addresses.

    Building an appropriate compact structure in one pass is not impossible but would require many inefficient re-allocations. However, with 3 passes, it's easy !

    1. After the first netlist scan, all the drivers have their fanout count updated.
    2. A second pass creates a counting variable, then for each driver with more than 1 sink, the fanout is added to the counter, which is then put in the address. It's called the "prefix sum"
    3. The large memory chunk can be allocated because we know exactly how many items it will contain, and the intermediary values of the counter point just beyond the end of each sub-list. The third pass re-scans all the sinks, check the driver, then (if fanout > 1) get the index, decrement it and finally store the sink address to the main list array. As explained in the video, everything falls into place neatly because pre-decrementing ensures that the last write it to the first element of the list.

    With this method, there are more simpler loops but each lookup is faster, creating fewer cache faults for example.

    This wouldn't have been considered possible if I had never watched the video above, and I think I'll apply this method in more places :-)

  • Wrapper rewrite

    Yann Guidon / YGDES11/21/2020 at 19:39 0 comments

    Good news everyone !

    The benchmarking results are encouraging and made possible thanks to a new, rewritten version of the wrapper, which now even handles some essential generics ! You can expose generics of integers and string-based types, including std_logic, text and SLVx.

    The core of this tool relies on GHDL's XML output, which is then parsed by a crude bash script. This is part of the new release :-)

  • Benchmarking results

    Yann Guidon / YGDES11/20/2020 at 17:27 0 comments

    The results are finally available !

    [yg@localhost test6_bigLFSR]$ ./test.sh 
    Simple gate version setup (RAM+time) :
    290 4260 0:00.00
    580 4528 0:00.00
    1160 4800 0:00.00
    2030 5576 0:00.00
    2900 6132 0:00.00
    5800 8272 0:00.01
    11600 12296 0:00.02
    20300 18568 0:00.03
    29000 24744 0:00.05
    58000 45264 0:00.09
    116000 86364 0:00.18
    203000 148588 0:00.30
    290000 210440 0:00.44
    580000 417476 0:00.91
    1160000 831120 0:01.77
    2030000 1451852 0:03.11
    Detailed gate version setup (RAM+time) :
    290 5480 0:00.00
    580 6652 0:00.01
    1160 8668 0:00.01
    2030 12052 0:00.03
    2900 15336 0:00.04
    5800 26764 0:00.08
    11600 49428 0:00.16
    20300 83392 0:00.28
    29000 117340 0:00.39
    58000 230732 0:00.79
    116000 457372 0:01.53
    203000 797488 0:02.70
    290000 1137588 0:03.81
    580000 2270924 0:07.51
    1160000 4538012 0:15.17
    2030000 7938396 0:27.71
    Benchmark : OK

    I wanted to test several things :

    • Time and RAM are roughly linear so it's a good news. Note that this is only the setup performance.
    • Setup time is 10× with the detailed version, and I don't even run an iteration !
    • The detailed version uses about 6× more RAM, but that amounts to about 4KB for a single gate !

    This means that with 16GB RAM, it is possible to simulate approx. 20M gates and analyse 3M gates.

    You can run this test manually with the newer archives. It's a stress test for the system and the behaviour will change depending on your computer configuration. I don't assume your CPU speed or RAM size, so run it cautiously.
    ---------------------------------------

    Update 20201121:

    I managed to run the design inside the wrapper, and the overhead is marginal (5% size, <10% time)

    290 5516 0:00.00
    580 6808 0:00.01
    1160 9200 0:00.01
    2030 12676 0:00.03
    2900 16228 0:00.04
    5800 28088 0:00.08
    11600 51880 0:00.17
    20300 87528 0:00.29
    29000 123120 0:00.41
    58000 241696 0:00.78
    116000 479028 0:01.55
    203000 835028 0:02.78
    290000 1191172 0:03.92
    580000 2377824 0:07.98
    

    The graphs will be auto-generated if you have installed gnuplot on your system.

    I still have to perform dynamic comparisons and I have not even started re-implementing the gates probes.

    Anyway the 10x speed&size gain with the "simple" version vindicates the choice and efforts to make 2 versions.

  • Benchmarking with a HUGE LFSR

    Yann Guidon / YGDES11/16/2020 at 08:57 0 comments

    After I solved the weird issues of logChasing a "unexpected feature" in GHDL, it's time to put the lessons to practice and implement that huge fat ugly LFSR. It's not meant to be useful, beyond the unrolling of many, many LFSR stages and see how your computer and my code behave. So I created test6_bigLFSR/ in the project.

    The LFSR's poly is finally chosen, thanks to https://users.ece.cmu.edu/~koopman/lfsr/index.html which contains a huge collection of primitives. For 32 bits, I downloaded 32.dat.gz (186MB) which expands to 600MB. It's huge but practical because you can grep all you want inside it :-) The densest poly is 0xFFFFFFFA, which is also the last. It contains 29 continuous XORs, which makes coding easy !

    For a quick test, I wrote lfsr.c which helps visualise the behaviour. The code kernel is a 2-steps dance, with rotation followed by selective XORing.

    U32 lfsr() {
      U32 u=LFSR_reg;
      if (LFSR_reg & 1)
        u ^= LFSR_POLY;
      LFSR_reg = (u >> 1) | (u << 31);
      return LFSR_reg;
    }

    To help put this code in perspective, I also created a small LSFR with circuitjs using a 5-tap with another ultra-dense poly 0x1E.

    One of the subtleties of LFSR poly notation is that the MSB (which is always 1) describes the link from the LSB to the MSB and does not imply a XOR gate.

    Unrolling the LFSR is pretty easy witch copy-paste. It is however crucial to keep the connections accurate.

    Each column of XOR2s has their own 5 signals so all is fine and should work. However we have seen already that GHDL has some issues with massive assignations, a shortcut is necessary. It's easy to spot when we move the wires around : there is no need to copy a stage to the other, just get the value from the appropriate previous stage directly.

    Still 5 wires between each stage but only 3 need to be stored, the others are retrieved "from the past". From there the rule is obvious : the benchmark needs only as many storage elements as there are XOR gates, which is 29 for the 0xFFFFFFFA poly. The new issue now is that the 0x1E poly is not totally like 0xF...A : there is one bit of difference. I will now illustrate it with a reduced version 0xFA and extrapolate from there. Here it is with circuitjs:

    By coincidence, 0xFA is also a primitive poly so it also provides a 255-cycles loop, just try it ! The 32-bits implementation will simply add 6×4 consecutive XORs to the circuit.

    Unrolling is very similar. The critical part is to get the connections "right". Fortunately, the only difference is the absence of a XOR just above the LSB, which is translated by sending the result to the cycle after the current cycle. The resulting circuit is :

    Note: there are 2×3=6 stages, while the LFSR has a period of 255=3×5×17 so the resulting circuit still has a period of 255. Not that it matters but 1) it's good to know in case you encounter this situation 2) it motivates me and brings challenging practical constraints into the benchmark :-)

    So all there is to do now is to add as many taps as necessary to get back to 0xFFFFFFFA. Oh, and also deal with the initial and final taps... So let's map the 32 taps to the 29-xor vector, called XOV, at time t:

    • All gates receive one signal from XOV(t-1)(0)
    • The other signal comes from t-1, t-2 and t-3:
      • For XOV(t)(0) : XOV(t-2)(1)
      • For XOV(t)(1 to 'last-1) : XOV(t-1)(2 to 'last)
      • For XOV(t)('last) : XOV(t-3)(0)

    This list of spatio-temporal links is illustrated below:

    From the theory point of view, this shows how the Galois and Fibonacci structures are 2 ways to express the same thing or process:

    • The Galois performs in parallel, all the elements are available for one point in time.
    • The Fibonacci structure is serialised, with only one value changed but with visibility into the "past", the previous values.

    The circuit described here is in the weird crossover region between these approaches. The above list shows how to wire the XOR gates and the outputs.

    Connecting the inputs is a bit less trivial...

    Read more »

  • Completeness of a simple heuristic

    Yann Guidon / YGDES11/16/2020 at 04:38 0 comments

    The archive contains several tests, including some exhaustive fault injection scans. The scan algorithm ignores all the gates with fewer than 4 inputs because

    • No-input gates are constants and are not really implemented in ASIC. Nothing more to say about it.
    • 1-input gates can only be inverters or buffers and they amount to a wire: the logic value is propagated (even if inverted) but if the gate is altered, it then behaves like a fixed value (a no-input gate) which can be detected.

    Let's consider a gate BUF ( input A, output B) with LUT2(0,1):

    • altering bit 0 will flip the 1 to 0, giving the LUT(0,0) and working like a GND,
    • altering bit 1 will flip the 0 to 1, giving the LUT(1,1) and working like a VCC.

    So as long as the A input is toggled, the Y output will change (or not if there is a fault). This change is propagated by

    • output ports, or
    • gate sinks which will in turn toggle output ports.

    In conclusion, only gates with 2 or more inputs need a "LUT bit flips" to check the circuit.

    This is a stark contrast with verification methods from the 60's where logic was wired (often manually) and the connexions themselves were delicate. Any fault needed to be identified, located and fixed, so the automated systems focused on the observability of each wire, sometimes forcing the addition of extra "observability wires" to circuits.

    This old method has been carried over to IC design but the needs have changed: we only need to know if a circuit works correctly, we don't care much about why or where is fails (except for batch reliability analysis) so there is no need to focus on the wires.

    However, some inference algorithms are shared because we still have to determine 2 things:

    • How to observe a gate's output
    • How to force a gate's input to a given value

    This is where things will be difficult.

  • Polishing and more bash hacking

    Yann Guidon / YGDES11/12/2020 at 11:35 9 comments

    Before digging too deep into the netlist extractor, I wanted to review and massage the files a bit. @llo helped and installed the latest GHDL on Fedora over WSL: several warnings uncovered some variable naming issues, now solved. Thanks Laura !

    The archive seems to be almost easy to use, but speed could still be better. Building the library is essentially linear, but the various tests could be run in parallel. It does not matter much because they are pretty short, except for the longest step in ALU8 where all the 500+ faults are injected, and it takes a while... The fault verification of INC8 could also use less time if all 4 cores could run simultaneously in this "trivially parallel" application.

    I still don't want to use make. There are other paralleling helpers such as xargs or GNU parallel but they create too many complications for passing information back and forth. I also want to avoid semaphores, critical sections or locks... and I finally found the right ingredients using only bash! Here is the structure:

    # if NBTHREADS is not set or invalid, then query the system
    [[ $(( NBTHREADS + 0 )) -eq 0 ]] &&
      NBTHREADS=$( { which nproc > /dev/null 2>&1 ; } && nproc ||
         grep -c ^processor /proc/cpuinfo )
    
    echo "using $NBTHREADS parallel threads"
    
    function TheWorkload () {
      echo "starting workload #$1"
      sleep $(( (RANDOM % 5) + 1 ))
      echo "end of workload #$1"
    }
    
    {
      threads=0
      for i in $( seq 1 20 )
      do
        threads=$(( $threads + 1 ))
        echo "loop #$i, $threads running threads"
        TheWorkload $i &
        if [[ $threads -ge $NBTHREADS ]] ; then
          echo "waiting ..."
          wait -n
          threads=$(( $threads - 1 ))
        fi
      done
    
      wait
      echo "End of script !"
    } 2> /dev/null # hide bash's termination messages

    Something is still missing: the loop must stop if one workload fails. It's quite delicate with more sophisticated tools but in bash, it's as easy as detecting the fault and break the loop. This leads to this new version:

    [[ $(( NBTHREADS + 0 )) -eq 0 ]] &&
      NBTHREADS=$( { which nproc > /dev/null 2>&1 ; } && nproc ||
         grep -c ^processor /proc/cpuinfo )
    echo "using $NBTHREADS parallel threads"
    
    err_flag=0
    
    function TheWorkload () {
      echo "starting workload #$1"
      sleep $(( (RANDOM % 5) + 1 ))
      echo "end of workload #$1"
      # make the 12th run break the loop
      [[ "$1" -eq "12" ]] && {
        err_flag=1
        echo "$1 reached !"
      }
    }
    
    {
      threads=0
      for i in $( seq 1 20 )
      do
        threads=$(( $threads + 1 ))
        echo "loop #$i, $threads running threads"
        TheWorkload $i &
        if [[ $threads -ge $NBTHREADS ]] ; then
          echo "waiting ..."
          wait -n && {
            echo "Got an error !" ; err_flag=1 ; break ; }
          [[ $err_flag -eq "1" ]] && {
            echo "found an error !" ; break ; }
          threads=$(( $threads - 1 ))
        fi
      done
    
      wait
      [[ $err_flag -eq "0" ]] && { echo "it worked."
               } ||   echo "premature termination"
    } 2> /dev/null # hide bash's job termination messages
    

    Since everything is controlled by a single loop in the script, and the flag can only be written by the workload, there is no risk of race condition, so no need of semaphore or what else... The only tricky part is to ensure that the remaining tasks, have to complete and check their errors too.

    I have seen other similar examples that use the jobs command to list the spawned threads but there is the risk of catching sub-sub-threads if the workload forks... Using a thread counter is faster (no fork-exec of other tools) and limits the scope of the test, which removes interferences.

    There are more issues to solve such as adapting the result of each invocation of the GHDL-generated binary. But another place where significant time can be saved is during the simulation setup: the binary is restarted all the time despite only tiny changes are made. The VHDL simulation system must also be updated...

    --------------------------------------------------------------------------

    Hmmm it seems I botched the script and made wrong assumptions. It is not possible to send a value back to the main script once a workload has forked. Yet this code works because...

    Read more »

  • Chasing a "unexpected feature" in GHDL

    Yann Guidon / YGDES11/09/2020 at 00:31 0 comments

    I know, I know, I should update my SW but if a 2017 build of GHDL misbehaves in significant ways, I wonder why it had waited so long to be apparent.

    So I'm adding a new "loading" test to the self-check suite, because I want to know how many gates I can reasonably handle, and I get weird results. It seems that GHDL's memory allocation is exponential and this should not be so. So I sent a message to Tristan, who sounds curious, asks for a repro, and here I start to rewrite the test as a self-contained file for easy reproduction.

    The test is simple : to exercise my simulator, I create an array of fixed std_logic_vector's, with a size defined by a generic. Then, I connect elements of one line to other elements of the next line. Finally I connect the first and last lines to an input and output port, so the circuit can be controlled and observed by external code.

    The result will surprise you.

    So let's define the test code.

    -- ghdl_expo.vhdl
    -- created lun. nov.  9 00:56:57 CET 2020 by Yann Guidon whygee@f-cpu.org
    -- repro of "exponential memory allocation" bug
    
    Library ieee;
        use ieee.std_logic_1164.all;
    
    entity xor2 is
      port (A, B : in  std_logic;
               Y : out std_logic);
    end xor2;
    
    architecture simple of xor2 is
    begin
      Y <= A xor B;
    end simple;
    
    --------------------------------
    
    Library ieee;
        use ieee.std_logic_1164.all;
    Library work;
        use work.all;
    
    entity ghdl_expo is
      generic (
        test_nr : integer := 0;
        layers : positive := 10
      );
      port(
        A : in  std_logic_vector(31 downto 0) ;
        Y : out std_logic_vector(31 downto 0));
    end ghdl_expo;
    
    architecture unrolled of ghdl_expo is
      subtype SLV32 is std_logic_vector(31 downto 0);
      type ArSLV32 is array (0 to layers) of SLV32;
      signal AR32 : ArSLV32;
    begin
      AR32(0) <= A;
    
      t1: if test_nr > 0 generate
        l1: for l in 1 to layers generate
          AR32(l)(0) <= AR32(l-1)(31);
    
          t2: if test_nr > 1 generate
            l2: for k in 0 to 30 generate
    
              t3: if test_nr = 3 generate
                x: entity xor2 port map(
                   A => AR32(l-1)(k), B=>AR32(l-1)(31),
                   Y => AR32(l)(k+1));
              end generate;
    
              t4: if test_nr = 4 generate
                AR32(l)(k+1) <= AR32(l-1)(k) xor AR32(l-1)(31);
              end generate;
    
              t5: if test_nr = 5 generate
                AR32(l)(k+1) <= AR32(l-1)(k);
              end generate;
    
            end generate;
          end generate;
        end generate;
      end generate;
    
      Y <= AR32(layers);
    end unrolled;

    There are 2 generics :

    • test_nr selects the code path to test.
    • layers selects the number of elements of the array.

    And now let's run it :

    rm -f *.o *.cf ghdl_expo log*.txt
    ghdl -a ghdl_expo.vhdl &&
    ghdl -e ghdl_expo &&
    
    # first test, doing nothing but allocation.
    ( for i in $(seq 2 2 100 )
    do
      echo -n $[i]000
      ( /usr/bin/time ./ghdl_expo -glayers=$[i]000 ) 2>&1 |
          grep elapsed |sed 's/max.*//'|
           sed 's/.*0[:]/ /'| sed 's/elapsed.*avgdata//'
    done ) | tee log1.txt

     The result (time and size) are logged into log1.txt, which gnuplot will happily display :

    set xlabel 'size (K vectors)'
    set ylabel 'seconds'
    set yr [0:1.5]
    set y2label 'kBytes'
    set y2r [0:800000]
    plot "log1.txt" using 1:2 title "time (s)" w lines, "log1.txt" using 1:3 axes x1y2 title "allocated memory (kB)" w lines
    

    And the result is quite as expected : linear.

    The curve starts at a bit less than 4MB for the naked program, which allocated 10 lines of 32 std_logic. Nothing to say here.

    I would object that I expected better than 1.13 seconds to allocate 100K lines, which should occupy 3.2MB for themselves (so less than 8MB total). Instead the program eats 710MB ! This means a puzzling expansion factor > 200 ! Anyway, I'm grateful it works so far.

    Now, I activate the test 1:

    ( for i in $(seq 2 2 100 )
    do
      echo -n $[i]00
      ( /usr/bin/time ./ghdl_expo -gtest_nr=1 -glayers=$[i]00 ) 2>&1 | 
      grep elapsed |sed 's/max.*//'| sed 's/.*0[:]/ /' | 
         sed 's/elapsed.*avgdata//'
    done ) | tee log2.txt

    The result looks different :

    Notice that I have reduced the number of elements by 10 and the max. run time is similar. Worse : the 10K vectors now use 3.2GB ! The expansion ratio has gone to 100K !...

    Read more »

  • The new netlist scanner

    Yann Guidon / YGDES11/06/2020 at 08:33 0 comments

    See the beginning at 3. Rewrite as well as 35. Internal Representation in v2.9 for more dirty details !


    After a hiatus, it's finally time to work again on the implementation of the algorithm explained in the log 36. An even faster and easier algorithm to map the netlist !

    The linear function 5x+3 can be configured to trade off security for speed, with POLY_OFFSET and POLY_FACTOR. The higher the factor, the better the discrimination of snafus, but that increases the number of steps. A generic with default value is appropriate in this case.

    The principle of serialising and de-serialising through wire-type symbols is shown in the picture below. The original idea, around 2001, used binary codes for real wires but std_logic provides 8 useful symbols, plus the "refresh" one ('U').

    Conversion between integers and the enumerated std_logic type is not as trivial as in other languages but still very easy, when you "get" how to dance around the strong typing constraints, as shown in this code:

    Library ieee;
        use ieee.std_logic_1164.all;
    entity test_std is
    end test_std;
    
    architecture arch of test_std is
    begin
      process is
        variable i : integer;
        variable s : std_logic;
      begin
        for j in std_logic'low to std_logic'high loop
          if j = std_logic'high then
             s := 'X';  -- not 'U', to show a custom wrap-around
          else
             s := std_logic'succ(j);
          end if;
          report integer'image(std_logic'pos(j))
                & " : " & std_logic'image(j)
                & " : " & std_logic'image(s);
        end loop;
    
        for i in 0 to 7 loop
          s := std_logic'val(i+1);
          report std_logic'image(s);
        end loop;
    
        wait;
      end process;
    end arch;
    

    The result:

    $ rm -f test_std *.o *.cf && ghdl -a test_std.vhdl && ghdl -e test_std && ./test_std 
    test_std.vhdl:24:7:@0ms:(report note): 0 : 'U' : 'X'
    test_std.vhdl:24:7:@0ms:(report note): 1 : 'X' : '0'
    test_std.vhdl:24:7:@0ms:(report note): 2 : '0' : '1'
    test_std.vhdl:24:7:@0ms:(report note): 3 : '1' : 'Z'
    test_std.vhdl:24:7:@0ms:(report note): 4 : 'Z' : 'W'
    test_std.vhdl:24:7:@0ms:(report note): 5 : 'W' : 'L'
    test_std.vhdl:24:7:@0ms:(report note): 6 : 'L' : 'H'
    test_std.vhdl:24:7:@0ms:(report note): 7 : 'H' : '-'
    test_std.vhdl:24:7:@0ms:(report note): 8 : '-' : 'X'
    test_std.vhdl:31:7:@0ms:(report note): 'X'
    test_std.vhdl:31:7:@0ms:(report note): '0'
    test_std.vhdl:31:7:@0ms:(report note): '1'
    test_std.vhdl:31:7:@0ms:(report note): 'Z'
    test_std.vhdl:31:7:@0ms:(report note): 'W'
    test_std.vhdl:31:7:@0ms:(report note): 'L'
    test_std.vhdl:31:7:@0ms:(report note): 'H'
    test_std.vhdl:31:7:@0ms:(report note): '-'
    

    So it's pretty trivial to convert an int to std_logic and vice versa. It's a bit less so to extract bit fields from an integer because VHDL does not provide the shift and boolean operators :-( As I have tested a decade ago, going with bitvectors is too slow. The only way is to divide or multiply but the semantic is somehow lost and this does not work correctly with negative numbers (due to inappropriate rounding). I don't want to use my own shift&bool routines because they link to C code and that might break somehow in the future with new revisions to GHDL.

    There is also a constraint on the amount of data that can be stored in the descriptor of each gate. For example a complete precomputed std_logic_vector would take too much room and the size would not be easy to determine before running the whole thing.

    One compromise could be to store one std_logic element that is precomputed before each new pass. There are already 2 such variables in the record of each gate: curOut and prevOut but then, what about the input gates ?

    curOut,                 -- the last result of lookup
    prevOut : std_logic;    -- the previous result of the lookup
    changes : big_uint_t;   -- how many times the output changed
    LUT     : std_logic_vector(0 to 15); -- cache of the gate's LUT.
    
    sinks : sink_number_array(0 to 3);
    

    The variable changes can be used to accumulate the number to serialise, but the LUT can't be overwritten because it's necessary for the following steps of the algorithm.

    sinks is...

    Read more »

  • OSU FreePDK45 and derivatives

    Yann Guidon / YGDES09/20/2020 at 18:07 0 comments

    One easy library to add to the collection is OSU's FreePDK45, found at https://vlsiarch.ecen.okstate.edu/flows/ (direct download link : https://vlsiarch.ecen.okstate.edu/flows/FreePDK_SRC/OSU_FreePDK.tar.gz : only 1.7MB, mirroredhere)

    It's a good candidate because it's a really surprisingly tiny library : 33 cells only !

    FILL (no logic input or output)

    BUFX2 BUFX4 CLKBUF1 CLKBUF2 CLKBUF3 INVX1 INVX2 INVX4 INVX8 TBUFX1 TBUFX2 (1 input, 1 output)

    AND2X1 AND2X2 HAX1 NAND2X1 OR2X1 OR2X2 NOR2X1 XNOR2X1 XOR2X1 (2 inputs)

    AOI21X1 FAX1 MUX2X1 NAND3X1 NOR3X1 OAI21X1 (3 inputs)

    AOI22X1 OAI22X1 (4 inputs)

    DFFNEGX1 DFFPOSX1 DFFSR LATCH (non-boolean)

    This is close to the minimum described at http://www.vlsitechnology.org/html/cell_choice2.html but should be enough for basic circuits. In fact we have the 1st order and most of the 2nd order gates, which I covered in log 31. v2.9 : introducing 4-input gates. OAI211 and AOI211 are missing, which are very useful for incrementers and adders...

    The site also provides these same basic standard cells for AMI 0.6um, AMI 0.35um, TSMC 0.25um, and TSMC 0.18um released in 2005 at https://vlsiarch.ecen.okstate.edu/flows/MOSIS_SCMOS/iit_stdcells_v2.3beta/iitcells_lib_2.3.tar.gz. A single bundle packages 4 technologies ! The library seems to have evolved to reach v2.7 and included a LEON example project: https://vlsiarch.ecen.okstate.edu/flows/MOSIS_SCMOS/osu_soc_v2.7/ (but the latest archive looks broken)

    I love how minimal, compact and simple these gates are, accessible to beginners and targeting from 45nm to .5µm. It looks like a "RISC" approach to VLSI ;-) Note also that there are only 2 gates with two output drives and 2 inputs : AND and OR, which are 2nd order gates, merging a NAND/NOR with a X1 or X2 inverter. The rest of the fanout issues are solved by inserting INVX gates at critical fanout points. This means that physical mapping/synthesis should start by examining the fanouts, inserting the inverters/buffers and then deducing which logic function to bubble-push.

    I'm not sure how to create the files but I can easily derive them from the A3P files in 3 ways:

    • Modify the file generator
    • Modify the generated files
    • Create a "wrapper" file

    Furthermore the sequential gates are not yet clear about the precedence of the inputs.

    Extracting the logical functions was as simple as a grep and some sed :

    grep -r '>Y=' * |sed 's/.*data//'|sed 's/<tr.*FFF//'|sed 's/<.*//'|sed 's/[.]html.*Y/: Y/'
    CLKBUF2: Y=A
    AND2X2: Y=(A&B)
    NAND3X1: Y=!(A&B&C)
    NOR3X1: Y=!(A|B|C)
    XOR2X1: Y=(A^B)
    BUFX4: Y=A
    MUX2X1: Y=!(S?(A:B))
    OR2X1: Y=(A|B)
    AND2X1: Y=(A&B)
    TBUFX2: Y=(EN?!A:'BZ)
    /INVX8: Y=!A
    /CLKBUF3: Y=A
    INVX1: Y=!A
    AOI21X1: Y=!((A&B)|C)
    XNOR2X1: Y=!(A^B)
    BUFX2: Y=A
    OAI22X1: Y=!((C|D)&(A|B))
    TBUFX1: Y=(EN?!A:'BZ)
    OR2X2: Y=(A|B)
    OAI21X1: Y=!((A|B)&C)
    NAND2X1: Y=!(A&B)
    AOI22X1: Y=!((C&D)|(A&B))
    CLKBUF1: Y=A
    NOR2X1: Y=!(A|B)
    INVX4: Y=!A
    INVX2: Y=!A
    

    For the VHDL version, only a few more simple substitutions are required:

    $ grep -r '>Y=' * |sed 's/.*data[/]/"/'|sed 's/<tr.*FFF//'|sed 's/<.*//'|sed 's/[.]html.*Y=/": /' |sed 's/[!]/not /g'  |sed 's/[|]/ or /g'  |sed 's/[&]/ and /g' |sed 's/\^/ xor /g' |sort
    "AND2X1": (A and B)
    "AND2X2": (A and B)
    "AOI21X1": not ((A and B) or C)
    "AOI22X1": not ((C and D) or (A and B))
    "BUFX2": A
    "BUFX4": A
    "CLKBUF1": A
    "CLKBUF2": A
    "CLKBUF3": A
    "INVX1": not A
    "INVX2": not A
    "INVX4": not A
    "INVX8": not A
    "MUX2X1": not (S?(A:B))
    "NAND2X1": not (A and B)
    "NAND3X1": not (A and B and C)
    "NOR2X1": not (A or B)
    "NOR3X1": not (A or B or C)
    "OAI21X1": not ((A or B) and C)
    "OAI22X1": not ((C or D) and (A or B))
    "OR2X1": (A or B)
    "OR2X2": (A or B)
    "TBUFX1": (EN?not A:'BZ)
    "TBUFX2": (EN?not A:'BZ)
    "XNOR2X1": not (A xor B)
    "XOR2X1": (A xor B)
    

    Notice that MUX2 should be called MUXI and A and B are swapped relative to the A3P lib. Some other corner cases are easily translated by hand as well. The TBUFs however fall outside of the purely boolean realm...


    Some gates have 2 outputs...

    Read more »

View all 16 project logs

  • 1
    Download

    Get the latest package version from https://hackaday.io/project/174585/files

  • 2
    Build

    Execute the run_tests.sh script.

    This will build the libraries and run many self-tests.

    These examples in the tests directory also show you the various ways to use this library.

  • 3
    Link

    You can directly use this library with all the ProASIC3 standard files, either without the analytics system ("simple" version) or the full analytics system (the standard version).

     

    In the "simple" case, at full simulation speed and no analysis, your VHDL source code contains these lines :

    Library proasic3;
        use proasic3.all;

    Then you point GHDL to the right library with this command line:

    ghdl -Psomepath/LibreGates/proasic3/simple my_file.vhdl
     

    If you need full analysis, then use the standard version and add these invocations to the VHDL testbench:

    Library LibreGates;    use LibreGates.all;    use LibreGates.Gates_lib.all;

    Then you modify the inclusion path :

    ghdl -Psomepath/LibreGates/proasic3 my_file.vhdl

View all 4 instructions

Enjoy this project?

Share

Discussions

Dylan Brophy wrote 11/22/2020 at 06:25 point

"Design For Test or nothing !" - YES, good practice. Need more of this IMO.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/22/2020 at 08:18 point

It must be the default.
Not just for HW but also SW.

  Are you sure? yes | no

frenchie68 wrote 11/22/2020 at 10:59 point

I wonder what you meant (wrt. software that is).

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/22/2020 at 11:13 point

I mean that systematic unit testing, thorough proofing/stressing/benchmarking are faint afterthoughts in most SW projects. People only query Google when they have a bug, a question or just want to learn how to do a specific task. Nowhere do I see tutorials about testbenches, it's mostly a culture where "it works so it's done".

One of the few exceptions is GCC where it is (was?) shipped with a suite of conformance self-tests.

But take any user-facing program and testing is "eventual" and "manual".

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates