Turns out there is still room to some optimisations.
I found a way to squeeze 3rd register address into 16-bit ALU instruction, so now operations on two operands with saving into the 3rd are possible, for example: ADD a0 b1 c2, adds values stored in registers 1 and 2 together and stores result into register 0. This is in no way breaks previous mode where ops were like this: ADD a0 b1, where values in registers 0 and 1 are added and result overwrites register 0. These new operations became possible by slightly changing addressing wiring in Register File. As it maintains compatibility with previous instructions, the new mode has limitation in that only registers 2 to 7 can be used as operand C.
Other changes to ALU operations are made with some modification of ALU itself. The wiring was changed slightly, so one operand instructions (such as shifts) now can store result into other register, without overwriting contents of the source.
Also barrel rotator was added to ALU, which can rotate the word left to up to 15 bits.
The encoding for new register addresses for above operations, as well as for bit number of rotations has been achieved by utilising previously unused bits of instruction (or conditional reinterpret of some bits). This turned out to take surprisingly small changes to ALU/Register File wiring and decode logic.
One encoding which was changed is that +/- bit now is used for shift direction indication instead of previously used dedicated Left/Right bit. All the other changes are additions to previously existing set.
On another note, I found a way to speed up execution to shave off number of clock cycles for some of the instruction types. This is done by staggering instruction execution, so when the last step of current cycle is going on, the fetching of the next instruction begins.
It was possible for instructions which don't do writes to Memory Address Register (MAR) at their last step, so no bus congestion is created.
Instructions affected are ALU, MOV and Load/Store instructions. This way, if they are going in succession, ALU and MOV instructions take only 2 clock cycles instead of 3, and Load/Store operations take 3-4 clocks instead of 4-5, depending on operation.
To achieve this, the changes made were also surprisingly small, 3 or 4 logic gates were added, as well as a couple of wires, to the simulated processor.
The files with updated instruction set as well as simulation can be found in Computer04.7z archive in "Files" section.
Things to add to simulation later:
1. Start-up sequence -- first cycle of 8 clocks nothing is done and Reset applied to whole system, then execution commences with instruction at address 0x000000.
2. Interrupt handling. For now it is still somewhat obscure topic for me.
3. More memory :) For now placeholder is used, with 256 word ROM and RAM. For testing purposes, now it is enough.
Right now I need to write assembler (however crude) in c++, so I won't need to type 0's and 1's in spreadsheet by hand and look up every bit pattern, as it became apparent that this process is incredibly slow and error prone.
Quick Update: today it occurred to me that I can make first two steps of instruction cycle into one, by bypassing MAR when doing addressing from Program Counter or Stack pointer. This can be done by rerouting some busses and using two 4:1 24-bit multiplexers instead of one 8:1. So, in coming days I'll try to implement this. It will not change any instruction encoding in any way, but will make all instruction cycles 1 clock cycle shorter. This, combined with recently implemented pre-fetching (described above), will make ALU and MOV operations effectively done 1 instruction per clock cycle in some situations.