Hardware Introduction

The CAT-644, is a simple computer using a 20 MHz ATMega644 microcontroller as its CPU. I am using the DIP-40 package, making it breadboard and hobbyist friendly.  Large sections of this project can be built and run entirely on a breadboard, without any soldering.  The ATMega644 offers four 8-bit GPIO ports, with each pin configurable as an input (with or without internal pullup resistors) or an output.  Many pins also have special hardware functions that can be enabled.  

This is the current use for each pin in the CAT-644:

Full Schematic (large image, click to view fullsize)

SPI Programming (later replaced by bootloader)


The CPU used in the C-644 is another in the same series used by the Arduino. There are two things that make the Arduino special: the bootloader, and the common hardware interface. Even non-AVR hardware has been sold as 'Arduino compatible' just by keeping the same I/O pinout and form factor. The plain AVR factory-fresh has no bootloader. The AVR has an SPI (Serial Peripheral Interface). SPI is a clocked serial protocol, with the clock driven by the Master. When the RESET pin of the AVR is held low, it is put in the Slave mode, and functions kind of like a flash memory chip, enabling the program to be replaced.

In the early stages of this project, I used SPI programming exclusively with no bootloader. Even after the serial port was functional, SPI was still what I preferred since it was already working. Later, I did move on to using the Chip45 bootloader. My only complaint of Chip45, is that it is not open source. An eventual goal is to replace Chip45 with something open or self-written.

Serial (RS-232)

The first interface I got running on the AVR (besides the programming interface) was the serial port. The AVR communicates using RS-232 TTL, which a a 5 volt version of the RS-232 protocol. Proper RS-232 uses positive and negative voltages, up to about 12V. Fortunately, interfacing TTL to RS-232 levels is a common task, and someone has made an IC for it. I used the Maxim MAX232 chip. (Actually I used a generic copy...) It generates about +/- 8 volts on the output (enough to meet the minimum voltage requirements) and is tolerant to the full spec RS-232 voltage. It uses 'magic' to generate +/- voltage greater than the supply. No, not magic. Imagine charging a couple capacitors in parallel up to about 5v. Then put the capacitors in series. You have 10v now. Or flip them around and connect the (+) side to ground and leave the (-) disconnected. The (-) is now 10v below the (+) side, and since the (+) side is now ground, the (-) side is now -10v. That's basically what this chip does, just really fast. The +10 and -10 outputs are smoothed out by a capacitor, and you get a steady + and - 8v. Nice trick.

Keyboard

The PS/2 keyboard is a well documented, old, slow, and easy protocol. Perfect for a simple microcontroller. The keyboard is connected to a the computer through a clocked serial bus.  It was very simple in concept, and 'mostly worked.'  Trying to run keyboard and video processing at the same time proved to be extremely difficult.  

The PS/2 protocol consists of 2 signals: CLK and DATA.  These are on open collector bus running at TTL (5v) levels.  Open collector refers to the way the transistor are arranged, but that isn't that important here.  What is important are three things:

In normal operation, the keyboard controls the clock.  Whenever a key is pressed, a scan code is transmitted from the keyboard to the host.  One important detail is the DATA is considered valid on the falling edge of the CLK signal.  This means, they keyboard puts out a data bit, THEN drops clock.  When looking at the clock, when the high-to-low transition occurs, we read the data. 

11 bits make up a PS/2 keyboard frame:

Additional PS/2 Keyboard information

External SRAM

The ATMega-644 has no external SRAM support.  Any interface to external SRAM has to be completely user-programmed.  This is a 'plus' for this project for two reasons:  defining your own bus makes this more of a 'computer design' project and less of a mindless soldering exercise.  It also makes it possible to design something exactly to what you need.  

PORTB of the AVR is being used as the address bus.  The high and low parts of the address bus are multiplexed; a D-latch holds the upper portion of the address.  This is in contrast to the AVR microcontrollers that have a 'real' external memory bus:  In those designs, the LOW part of the address is latched.  I chose to latch the upper part simply becaue the upper part changes less often than the lower part.  When accessing sequential addresses (such as running bytecode programs, drawing sprites, handlings strings, etc), only the low part of the address must change.

PORTC of the AVR is used as the data bus.  This is where 1 byte at a time is written/read to and from SRAM.

PORTA.5 is the Address 16 line, allowing more than 16 bits of address space.  (The 17th bit).  Flipping the A16 line switches between two BANKS of 64k memory.  When processing video, this can be thought as the video page selector.  Alternatively, you may choose to consider the SRAM of consisting of 64k words, and this is the BYTE selector.

PORTD.3 is the Page latch.  When high, the contents of PORTB are latched in the upper 8 bits of the address.

A memory byte address is formed from A16 (1 bit) : PAGE LATCH (8bits) : PORTB (8 bits).

PORTD.6 is the (/WE) WRITE ENABLE line of the SRAM.  When this is LOW, the value on PORTC is written to A16:PAGELATCH:PORTB

PORT D.2 is the (/OE) OUTPUT ENABLE line of the SRAM.  When this is LOW the value in the address A16:PAGELATCH:PORTB is output onto PORTC.  PORTC must be an input, or damage might result!

In preference to 'reading', the Cat-644 keeps the OE line almost-always low, and PORTC as an input.   

To read a byte:

To read successive addresses in the same page:

Note: AVRs with external memory busses need 3 clocks to read 1 byte! 

To write a byte:

To write successive addresses in the same page:


Video

Video generation is probably the most complicated part of this computer. First, there's almost no video hardware. Two I/O pins from the Atmega connect to the Hsync and Vsync VGA pins. Fortunately VGA monitor sync signals are 5V TTL.  A crude 2-bit-per-channel 2R2 DAC is connected across the RAM data bus, but connected through a 74HCT244 buffer. (This could have just as easily been a 74HCT245, but I happened to have this...) This buffer does two things: 1. isolate the affects of the 2R2 dac from the RAM data lines and 2) allow the DAC to be turned off. With VGA it is a requirement that during verical and horizontal blanking, nothing is output on the analog R G and B lines. One monitor of mine doesn't care, another one I have won't sync if the vertical blanking isn't blank. (The smart LCD monitor guts are trying to look for when the video frame starts and ends.)

Generating video is tricky. If you generate H and V sync signals with the right timing the monitor will sync to it and display black. That is fairly easy. Now to generete the picture. To output a line of video, we simply 'count' across the ram address bus. This steps through RAM addresses, and the values looked up go out through buffer and the video DAC.  Do this with the right timing, and red, green and blue dots appear on the screen in the right place. 

The AVR runs at 20 Mhz, and both the increment, and port output function take 1 clock each, so at a rate of 10 Mhz, we are able to update the address appearing across the address bus. A standard VGA signal (640x480) uses a 25Mhz pixel clock: 1 clock per pixel. So 10/25*640 = 256 pixels across. This is extremely convenient to have 256 possible pixels in a row and be able to output 256 values across an 8-bit port! VGA is an analog protocol, so the monitor does not care that the actual pixels have only changed 256 times instead of 640... each pixel is just a little wider than it should be. Some LCD monitors might scale poorly, but the two I've tried it with displayed a nice picture.

For 256 pixels across, 240 is a more appropriate resolution than 480. 240 is also convenient because it fits in a single byte. Using the timings for 640 by 480, the Cat-644 displays a 256 by 240 pixel image. 256 by 240 is 60KB of memory. The Cat-644 has 2 banks of 64k SRAM, leaving one bank with 4k, and the other bank unused. OR, for double-buffered graphics, 120k is used for the two video buffers, and 8k (4k in each bank) is left unused.

Prior Art of AVR driving VGA signals

Lucid Science VGA generator: http://www.lucidscience.com/pro-vga%20video%20generator-1.aspx

Quinn Dunki's Veronica 6502 project  http://quinndunki.com/blondihacks/?p=1121


Higher resolution Video

One thing that might seperate the Cat-644 from being a toy to something more usable would be an 80 column display. If the Cat-644 could display 80 columns instead of 32 or 40, I could take it to work and actually use it at the office to edit code, logging into the Linux box on my desk through the serial port. Ok, that's still a little silly, but one of the 'cool' things to have in the 80's was an 80 column text display. Business computers displayed 80 columns. 40 columns was for home computers. To legibly display 80 columns, we need more horizontal resolution. Going from 256 to 512 pixels across would help a lot. But the overworked AVR is already outputting addresses as fast as it can, right?

It turns out, it is not. PORT B can also be used as a timer output. A timer can count clock cycles and turn pins on and off. The fastest possible rate the timer can fire is every clock. One of the timer pins on PORTB can be toggled every clock cycle. This means that while the video interrupt is executing 'increment, output, increment, output, ...) as fast as it can, one of those pins can be going 'high, low, high low'. This means the address on port B can change every clock cycle. How awesome would it be if the toggled pin was B.0? Well, I've been lucky in this project so far, but not that lucky. Pin 3 is the toggling pin. This means if port B is counting, the AVR is really outputting on 'odd' sequence of addresses.  If the picture is laid out in memory in the correct sequence of addresses, the pixel can be 512 pixels across. There's just two additional details:


1. We can't simply add '1' when we increment. The 3rd bit is toggling on its own. We want to 'skip' past this bit when we count. The easy way to do that is add an extra '8'. If out of every 8 increments, we add 9, we can count PORTB in this sequence.  Fortunately, 'increment' and 'add immediate value' each take 1 clock, so we can add any amount (up to 256) that we need to.


2. 512 pixels across. Well, we only have 8 address lines we can control through PORTB. The Cat-644 has two banks of 64k RAM. The address line on A16 is tied directly to PORTA. All that has to happen is halfway through the scanline (after we have written to the bus 128 times (and the hardware toggler has output 256 addresses), we toggle the A16 address line, and do the whole thing again. Somehow the Cat-644 now can output 512 unique addresses in 1 scan line of time.

Ths also means the SRAM of the Cat-644 is being read (and dumped to) the screen at 20 megabytes a second.

Disk

The C-644 has an SD card for disk. Currently I'm using a 1GB sdcard, but SD (not SDHC) cards of 2GB should also function. Apparently, there is such a thing as a 4GB SD card, but I've never encounted one, as 4GB is usually the smallest SDHC card out there.

A SD card has a few different modes of operation, but the one used in this project is the SPI mode.

An SD card can read or write 512 bytes at a time. There are few different commands, but the simpler the better. The C-644 cannot access external SRAM at the same time the SPI bus is in use, because the pins are shared. So, the disk buffer must reside in the internal SRAM of the AVR. The AtMega 644 has 4k of SRAM, so this is not that bad.

Sound

Sound is generated using the PWM feature of Timer 2.  PWM is essentially just turning a port on and off really fast, and stands for Pulse Width Modulation.  Pulse Width AKA Duty Cycle refers to how long the port is on; it can be on anywhere between 0% of the time and 100% of the time.  Timer 2 is an 8-bit timer, so there are 256 possible PWM duty cycles.  The frequency is set really high, it in this case it increments every clock cycle, so a full PWM cycle runs at 20mhz / 256 = about 78 Khz.  This can easily be filtered with a capacitor, leaving a crude approximation of an audio DAC.  The Cat-644 has been tested with audio at a sample rate of 11 khz.  11 khz at 8-bit does not seem particularly high quality by today's standards, but is about the same bit rate and sample depth of the original PC Sound Blaster.  (8-bit 11.025 KHz)

Fast VM Interpreter

There is a near-fatal flaw with basing your computer design on the AVR:  it cannot execute code outside of flash.  The flash can be reprogramed thousands of times, and can be partially reprogrammed by the program itself, so you can sort-of write an OS that hot-loads programs by request. This would take a long time for the chip to wear out.  (Especially if the programs are small, and you have a system to keep multiple ones in flash at once, and cache (don't mindlessly reflash programs already stored) the most commonly used ones.)  

The workaround is an interpreter.  Several VM interpreters have been written for the AVR platform by various programmers; some of them are emulators for existing classic processors, such as the 6502. Some others are new abstract machines. A common construction in these interpreters are a jump table. This form is very common:


jumptable:
rjmp vm_add
rjmp vm_sub
rjmp vm_mul
...


This form of jumptable consists of literal jump instructions. An opcode (with or without a multiplier) can be added to the offset of the jumptable. The processor can then IJMP to the jump instruction in the jump table, which then in turn jumps to the handler. This wastes a lot of clocks.

//assuming each instruction is 1 byte, X contains the interprer's instruction pointer, and r0, r1 contain the address of the jump table

LDS ZL, X+  //load next instruction opcode to r16  (2 clocks)

LDI ZH, 0  //zhigh is zero  (1 clock)

ADD ZL, r1 //add low offset of jump table  (1 clock)

ADC ZH, r0 //add high offset of jump table, with carry  (1 clock)

IJMP  //jump to Z  (2 clocks)

at address Z in jump table:
RJMP handler   (2 clocks)

//total: fetch and dispatch: 9 clocks.  Plus you still need to do some work in the handler.



And alternate form of the jump table is to have the addresses of the handlers in RAM, and not composed of RJMP instructions:

LDS YL, X+  //load next instruction opcode to r16  (2 clocks)
LDI YH, 0  //yhigh is zero  (1 clock)
ADD YL, r1 //add low offset of jump table  (1 clock)
ADC YH, r0 //all high offset of jump table, with carry  (1 clock)
LDS ZL, Y+  //get low address of handler  (2 clocks)
LDS ZH, Y   //get high address of handler  (2 clocks)
IJMP  //jump to Z  (2 clocks)

This still takes 9 clocks, and wasted SRAM.
<br>


I believe I have come up with the fastest possible fetch/dispatch method.  It disposes of the jump table and uses the opcode directly as a partial address.  It also relies on each handler having the same low address byte: all the handlers are exactly 256 instructions apart:

//before 1st instruction:

LDI ZL, low8(pm(instruction0))   //low address of 1st instruction handler (also, low address of ANY instruction handler)

// to fetch/dispatch 1 instruction
dispatcher:
LDS ZH, X+  //load address of next handler into Z (2 clocks)
IJMP        //jump to handler (2 clocks)

To avoid the 'JMP dispatcher' at the end of each instruction handler, we can simply 
use a style called 'threaded code', popular in FORTH interpreters.  The end of each handler
just has a copy of the dispatch code:


.org VM_INSTR_ADDR( VM_ADD_B)
add  A_LOW, B_LOW       //1 clock
adc  A_HIGH, B_HIGH     // 1 clock
LDS ZH,X+               // 2 clocks (load next instruction handler low address)
IJMP                    // 2 clocks (jump to next handler)

.org VM_INSTR_ADDR( VM_ADD_C)
vm_add_c:
add  A_LOW, C_LOW
adc  A_HIGH, C_HIGH
LDS ZH,X+
IJMP

The simple 16-bit ALU operations take 6 AVR clocks to fetch, dispath and execute, IF the program being executed is in the internal SRAM. 20Mhz / 6 = 3.3 interpreted 16-bit MIPS. Video generation in 'fast' mode (skips every other scanline) takes about 50% of the CPU, so we are looking at running 1.6  16-bit MIPs in the interpreter.  In SLOW mode (all scanlines are drawn) about 95% of the CPU is used for video. 20Mhz/6 * .05 = .166 MIPS. That is not a lot, but remember the Commodore 64 runs at about .2 (8-bit) MIPs, and CPU-intensive operations such as sprite drawing routines, etc need not run in the interpreter: these can be syscalls made from the interpreter into optimized native AVR code stored in flash.  There is still potential for a fairly fast machine; at least fast enough for arcade style video games of the 80's.

The cost of this faster interpreter, of course, is a lot of wasted flash space.  Note, the handlers themselves are fairly short, so the gaps beween handlers could hold lots of other code, as long as we jump around the handlers as necessary.

Slow VM Interpreter during active scanlines 

(not implemented, but theoretically possible)

Until now, the CAT-644 alternates between interpreting VM programs, OR generating video, but not both at the same time.  When generating a video signal, during an active scanline, 100% of the cpu is in use.  VM programs can only be interpreted during the horizontal or vertical blanking intervals, or during 'black' scanlines in FAST mode.  There is a plan to generate video WHILE interpreting a VM program.  This is a variation on the technique used to generate high resolution video.

When genering high-res video, a hardware timer was used to flip one of the address bits faster than the software itself could.  This allowed 10 PORTB writes performed in 20 clocks to generate 20 pixels. However, the hardware timer does not have to flip the port EVERY clock; A divider can be set allowing the port to be flipped every 2 clocks instead.  2 clocks is 1 pixel in low-res mode, so this snippet of code will generate low-res video, using less of the cpu:

inc r16         //increment next address
out r16, PORTB  // output port            (changes pixel address)
                                          (timer0 also sets one bit low)
nop                                                        
nop              //  (timer 0 sets 1 bit high (changing the address))
inc r16          //increment next address
out r16, PORTB   // output port  (changes pixel address) 
                 //(timer0 also sets one bit low)
nop 
nop              // (timer 0 sets 1 bit high (changing the address))

The nops can be replaced with useful instructions, including dispatch/execute, as long as every 4th clock writes to port B the next address:

out r16, PORTB       (clock 0)
lds ZH, X+           (clock 1, 2)
inc r16              (clock 3)
out r16, PORTB       (clock 0)
ijmp                 (clock 1,2)


one_of_the_instruction_handlers:

inc r16         (clock 3)
out r16, PORTB  (clock 0)
add  r3, r5     (clock 1)
adc r4, r6      (clock 2)
inc r16         (clock 3)
out r16         (clock 0)
lds ZH, X+      (clock 1,2)
inc r16         (clock 3)
out r16, PORTB  (clock 0)
ijmp            (clock 1,2)

In the above example, a 16-bit add operation, including fetch, dispath, and active video output, takes 12 clocks.

There are limitations to running sequences of instructions while generating video.  First, these instruction sequences cannot access external SRAM, since it is in-use generating the picture.  Second, some development is necessary to STOP outputting video at the right point across the scanline; there will have to be a 'pixel counter' register that says when to suspend video generation.  Incrementing and comparing this counter will also take some cycles, so realistically, probably 14 or 15 clocks per 16-bit operation while generating video.  This is still better than nothing. and may do useful work.  In 'slow' mode (all scanlines drawn), 95% of the CPU is taken up processing video.  95% * 20mhz/15 = 1.2 MIPs 'recovered' from the video interrupt which would otherwise be wasted.

These are the current estimate for interpreter MIPS:

video 'fast' mode:  50% of cpu is processing video:

total: 2.2 MIPS

video 'slow' mode (95% of cpu is processing video)

total: 1.3 MIPs

This does not sound like a lot by today's standards, but:

There is also the challenge of the hosted VM program keeping track of when it is in the blanking region and external SRAM is available versus when it is in the active region and SRAM is blocked.  The easiest solution is to split the user program into two threads:  one thread is internal SRAM only, and runs during active video scanlines, and the second thread only runs during blanking.