Attempting to implement Forth on the Gigatron
A friend asked me how close I was to done. I explained that normal operation of a Forth Interpreter is (from the standard):
- Skip leading spaces and parse a name;
- Search the dictionary name space. If a definition name matching the string is found:
- if interpreting, perform the interpretation semantics of the definition, and continue at 1).
- if compiling, perform the compilation semantics of the definition, and continue at 1).
- if compiling, perform the compilation semantics of the definition, and continue at 1).
- if compiling, perform the compilation semantics of the definition, and continue at 1).
- If a definition name matching the string is not found, attempt to convert the string to a number. If successful:
- if interpreting, place the number on the data stack, and continue at 1);
- if compiling, compile code that when executed will place the number on the stack, and continue at 1);
and that out of all of this process, I currently have implemented "if".
Actually though, that doesn't describe where I am at all. Because my Forth, like most Forths is itself implemented in Forth, this loop will almost be the last thing that I write, and almost word I implement brings me closer to being able to write this interpreter loop.
Since the last update I've managed to get the cross-compiler / bootstrap process working for simple words. It's full of hacks, but the code it outputs is correct, and it's accelerated the rate of implementing the standard Forth words no end. Each thing I add unlocks a bunch more easy words to implement. At times it felt like I was in the middle of a great book that I couldn't put down.
I've added ASM words for conditional and unconditional branching, which combined with a mechanism in the bootstrap Forth for marking branch targets (using assembler labels) gave me the ability to use IF, ELSE and THEN, which unlocked ?DUP and more.
I added 0=, and the various binary bitwise words, which unlocked INVERT, 0<>, =, NEGATE, 0< and more
I added +, which unlocked -, < etc.
It took me more than a month before I'd written enough assembler code to fill a ROM page last year. Earlier this year I filled one in two days. Some might say that's because my encoding is not very dense, but I'd rather think of it as incredible progress! I currently have 34 of the 168 words in the core and core ext. lists, so maybe I'm really 20% of the way there. I think in truth I'm much closer.
I've recently taken a bit of a break from working on this project (following a few further trips down yak-shaving rabbit holes). My next step will be to get input and output working, by providing a means of wrapping the SYS functions built in to the Gigatron ROM as Forth words. Part of this is having a means of entering and re-entering Forth from vCPU, while making sure that all of the timings line up correctly. Going back to worrying about timing is one of the things that has put me off a bit.
Hopefully my next update here will be an explanation of how Forth words work in ROM, as I think my previous update wasn't clear, and it will be very useful for anyone who wants to look at my code.
Before Christmas I made slow but steady progress implementing various core words. I have written various stack manipulation words like ROT, SWAP, DROP and DUP, the memory words @, !, C@, C!, and the arithmetic words 1+ and 1-. I have also written the code for entering and exiting a thread: DOCOL and EXIT.
The experience of writing all of this code in assembly has been a mixture of fun and frustration. On the one hand it's a nice exercise in that it's just creative enough to be interesting, without allowing much space for analysis paralysis. The sort of task where you can keep putting one foot in front of another. On the other hand it's so slow! I don't think I've ever got a single routine correct the first time, and often my first effort is hilariously wrong. Even with my fairly decent test approach, working out what I've done wrong always seems to involve single stepping my code, usually multiple times, which is a very laborious process. Often I'm working on this in odd half-hours, and this really tasks that feel like they should be done in one session end up taking several. I'm really glad I've aimed for good unit test coverage, otherwise I'd be lost.
I'm now at a point where I can start to think about defining words in terms of other words. For example, a definition of NIP could be:
: NIP SWAP DROP ;
I could re-write this in Python as
def nip(): label("forth.core.NIP") docol() for word in ["SWAP", "DROP", "EXIT"]: st(lo("forth.core." + word), [Y, Xpp]) jmp(Y, "forth.next3.rom-mode-tail") st(hi("forth.core." + word), [Y, Xpp])
However if I follow that style, I will end up writing a whole Forth in Python-generated assembly, which isn't really the path I want to go down.
Instead I'd like a tool that can process the Forth definition above, and issue the same Python function calls. Lots of definitions for Forth words in Forth syntax would be quite complicated, so really this needs to be a full Forth implementation in Python. Of course, I want to write as much of this in Forth as possible too, because otherwise I have to write everything twice (or perhaps even three times). None of the existing Forth in Python projects that I've seen quite for the bill.
That Forth in Python is what I've been working on so far this year (although without spending much time on it), and I've finally got to a point where it does something. I can run simple definitions in an interactive session, and see the results, and while there are many words written in Python, the interpreter loop and several core words are written in Forth and loaded during a bootstrapping process. This includes ;, ['] and \. I think it has promise. My next step is to get it to generate Gigatron code, and NIP is the first word I shall be trying to compile.
There's still a lot of assembly to write, but it will be good to be able to write Forth too.
Well it happened: I went down a yak-shaving rabbit hole.
The thinking went like this:
I actually shaved yaks 1 to 3. The git script was fun, and has proved useful, and taught me a lot. The Python 3 port is also nice - and is useful to the wider Gigatron project.
I spent a lot of time vacillating between the "compiler" and the asm.py changes, and worrying that both are a waste of time. On the one hand it's kinda stupid to fee guilty about wasting time when the whole project is just for fun; why not follow an impulse to write some code if I want to? On the other hand, I would actually like to complete this at some point!
I think a big problem for me right now is that I don't really know how hard it's going to be to get this finished without shaving the yaks, I don't know how much work these side-lines would be, and so I don't know how much benefit they'll provide. But if I push on without them and it turns out that they would have been useful, I have missed an opportunity to benefit from the effort.
For now I've put both projects to one side, and am trying to make progress on implementing more words. More on this in the next update. I think I might well come back to the instruction counting work soon, but I'm going to go without the stack operation optimiser. I think this may come back in the form of an inliner and peep-hole optimiser to convert words written in Forth to Gigatron code. I've had an itch to write a compiler for a year now, and this might be the opportunity to scratch it.
[This log relates to the state of the project as of commit b58f04048e080c8397b921c5705d2b3de66b05e4]
I can't believe that I've been working on this for 24 days, with such a small amount to show for it. I guess they weren't joking when they said implementing Forth in the native instruction set would be hard! It's not that I haven't been putting any effort into it - I really have!
I've finally finished implementing the various NEXT routines, with pretty good tests. However I'm now convinced that my original design was just bad. All of the fail-fast tests, and jumping from one routine to another leads to dispatch having a very large cost - whilst also causing me to do a huge amount of thinking just to work out what the code should do. As someone who is not good at arithmetic, and gets easily confused by pointers, this has been proving to be a challenge.
My original design goal was to maximise the number of instructions a native-forth word can be without breaking the time window. The image above is from a spreadsheet I have made to find out how much the overhead is going to be, and if the system will even work in terms of the timings (does the cost of dispatch dwarf the instructions a word can have). What happens if you enter a time-slice with different length words in the 'W' register (the X-axis in the chart). In this particular case I'm examining the '---D line 0' time-slice, which is the shortest at 114 cycles, and we're assuming that we're executing code from ROM, and the word ends with a call to NEXT. All of these can be changed in the spreadsheet.
As you can see, it works: If we run a word with 0 - 32 "real-work" instructions, we'll finish it, dispatch to the next word, and possibly start another word all within our window. If our word is between 33 and 48 instructions, we will load the W register with the address of the next word, but won't dispatch to it (the steps are various points it can fail along the path), If our word is 48 to 82 cycles long we will complete the work, and queue the dispatch, but won't attempt it. Greater than that, and we can't do the required book keeping, so we don't enter the word.
I definitely want to come back to this area of the code, and get that big red stripe narrower, but for now, it's going to be good enough.
Next DOCOL and EXIT. I'm thinking of writing some Python utilities to do the code-counting for me - doing it by hand is pretty tedious!
P.S. Spreadsheets are a trap. It's so easy to get started, but a nightmare to keep track once they get a little bit big.
I've been making slow but steady progress over a little while, having implemented next1, next2 and next1-renter. I've even written pretty good unit tests for some of them. This has been a hard process, getting the tests right is harder than getting the code right!
Yesterday it occurred to me that I could string all the routines together, saving on the jumps. I tried it out, and while neat, it actually made the code slower. So then I decided to go the other way - inline everything. This makes the code faster in some cases, but massively changes the invariants of the routines, meaning the tests need to be rewritten. This whole business took most of my available time for a day.
So now I have a dilemma: carry on refactoring, for only possible benefit (let's face it, there's no real world benefit to any of this) or go back to my suboptimal code, and push forward with that. Neither sounds good. I also have come up with a few more approaches to saving time in dispatch.
This is what I meant about shaving yaks!
Part of the problem is that I don't really have a good way of evaluating any solution right now.
I'm going to commit my rewritten code as a branch, and revert.
I'm the forum Marcel posted a vCPU implementation of parts of a Forth kernel, and it's remarkably small compared to mine. I intend to post in the forum when my version reaches parity with his.
I've documented my design for the NEXT routines, which I've committed to Github here. It's a little sketchy, and I can already see mistakes. I haven't run, or even assembled any of the code yet, but hopefully I haven't made up any of the op-codes, even if some of my assembly syntax is invented! I think the big ideas are clear and I shan't try to explain it again here. I originally wrote that document to post to the Gigatron forum, but I decided it was a bit long to post at this stage.
One thing I will say about that design is that I'm really disappointed that the ROM-mode version of NEXT3 is so expensive. I'd really like running threads in ROM to be cheaper than running threads in RAM. I'm aware that I can encode jumps inline in my instructions, which would vastly speed things up, by removing the double-jump dispatch loop, however it would add 50% to the space usage. I think the current design will probably work, and I can switch to inline jumps later if it turns out that speed is more valuable than space. Also, I really love the double-jump trick - it's so cool!
I'd like to talk a little more about the general approach I'm planning to follow in developing my Forth. I think it can be summarised as:
In practice, I intend to lean very heavily on Python, following the existing approach of using it as "macro assembler", but also trying to use it to unit test my code, and probably even having a Forth compiler in Python for generating embedded Forth threads.
The first 'law' about not shaving yaks is important. I find it easy to get bogged down in minor details and lose velocity. I'm going to try to enjoy, and even celebrate kicking cans down the road. To that end, here are some yaks I have left unshaven so far:
The CFFI wrapping for gtemu.c is pretty neat actually; I hadn't used CFFI before. Here's an example of using it from iPython:
In : import _gtemu In : f_rom = open('dev.rom', 'rb') In...Read more »
My knowledge of Forth originally comes from the really fascinating jonesforth by Richard W.M. Jones, which I first saw through Lambda the Ultimate back in 2007. It takes the form of two files, one assembly file, which provides just enough machinery to stand up a Forth runtime, and one Forth file which builds upon it to provide a usable (although almost certainly incomplete) environment. The code is written in a literate style, being so heavily commented as to be at least as much a document as a program. I strongly recommend checking it out for an interesting read.
This style of writing a small assembly kernel in order to bootstrap a Forth system is apparently a common approach, and it's this that makes me think that a native Gigatron Forth might be a tractable endeavour. Hopefully I can fairly quickly get to the point where I'm copying existing Forth implementations of core words instead of writing lots of ASM by hand.
I'm also alive to the scepticism of the other posters on the Gigatron Forum, and must acknowledge that I'm likely to fail. This is also the first truly low-level programming I've ever done.
I've also read (many times) the Moving Forth papers by Brad Rodriguez, which explain (amongst other things) the difference between a few important implementation strategies, particularly Direct Threaded and Indirect Threaded code (jonesforth follows the more traditional indirect threaded code model).
I'm not going to try to re-explain any of what either of these excellent sources cover. To follow what is to come, you'll probably want to be familiar with Forth terminology like 'thread', 'word', etc. Ideally not too familiar though, as I'm likely to misuse the terms - I'm no Forth programmer myself!
What both jonesforth and Moving Forth make clear is that the heart of a Forth implementation is 'NEXT', the code which dispatches to the next word in the currently executing thread, and moves the interpreter pointer (IP) forward. NEXT for the Gigatron could never be as short as the two instructions in jonesforth, because the instruction set is not so rich as i386, but ours will be a lot more complicated because of the timing constraints.
My Forth is effectively going to be a third virtual CPU for the Gigatron, and as such must follow the same rules as the vCPU and the v6502 - which are basically that you are told how much time to run for (in ticks - two clocks, which means two instructions), and you must run for that long, and not an instruction more or less. In the vCPU this is done by the dispatch loop (itself called NEXT - no coincidence, I'd guess) seeing if there's time for the longest possible op-code to complete. However we won't know what the longest 'native' Forth word is (well perhaps we could, but it could be pretty long) and we don't really want to scatter all of our words with book-keeping and continue/abort decisions. So my plan is to have each native word describe it's worst-case runtime before starting, and NEXT will have to decide whether to enter it, or not. Upon completion the word will report the actual time taken (they could be different where there is code that loops or branches). This is similar to how SYS routines work. All of this book keeping and decision making will make NEXT a lot more costly, and the shortest call to the vCPU is 36 ticks - so there isn't much time to get much of anything done!
There's another fly in the ointment: to be usable as an interactive environment we need to be able to compile Forth code to RAM, and to make use of other Forth code in RAM, but native code needs to live in ROM, and we'll certainly want compiled Forth code in ROM as well - so sometimes we'll be executing a thread in RAM where the addresses are to other threads in RAM, sometimes we'll be executing a thread in RAM, but the addresses are to words in ROM, and sometimes we'll be executing threads in ROM, so our NEXT implementation needs to keep all of this straight.
Read more »
I've sketched out a design for all of this, which I'll post about...
I'm writing this retrospectively, as I've already done some work on this project.
I've been taking a lot of interest in the Gigatron project since I saw a video about it on YouTube, and then watched the Hacker Hotel video explaining its fascinating architecture - I don't own one yet, but I'm hoping to have some money after Christmas!
On the Gigatron forum a user posted about porting Forth to the Gigatron. I've also been interested in Forth, and stack-based language runtimes for a while, but haven't quite had the impetus to write one myself. The original post asked if people were interested, and questioned whether the vCPU, or v6502 was the better choice for implementing it, whilst leaning towards the v6502. I replied that I thought that the native 8-bit Gigatron instruction set would make for the cooler hack. Marcel and monsonite (the original poster) both thought that this was too hard. But I couldn't stop thinking about it... so here I am, setting out to write a ROM based native Forth for the Gigatron.
Who knows if this will result in any useful code, but I already feel like I've learned a few things.
The next update will be on what I've got so far, and my general approach.