OK, I guess I'll do another log since all I have to do is copy and paste some stuff from a reddit post: https://www.reddit.com/r/m68k/comments/63ov3e/fast_68000_block_memory_move_routine_comparison/
simple memory move routines based around MOVE and DBRA are often regarded as fairly fast on the 68000:
.loop: MOVE.L (A0)+,(A1)+ ; do the long moves DBRA D0,.loop
This runs decently fast on the 68000, a bit faster on the '010 and above due to the fast loop feature they added for DBRA, especially if your RAM is otherwise slow. However, it's not the most efficient way for the '000, or the '010, and certainly not the higher variants of the 68000.
Enter the MOVEM instruction. This single instruction is capable of loading up to 16 long words, with only the instruction overhead fetch of a single instruction. The problem? You can't use the postincrement mode for the destination, only the source, so you have to manually (in a separate instruction) add to the destination pointer. Additional complications include that using the stack pointer or your source and destination pointers to hold data within a MOVEM are probably not a good idea:
MOVEM.L (A0)+,D0-D7/A0-A6 ; not a good idea, unless you're sure this is what you want. behavior varies by CPU
So in reality, assuming you don't want to use your stack pointer either (also not a good idea if you're running in supervisor mode especially) you should set aside 4 registers, one for source, one for destination, your stack pointer, and a loop counter.
Anyway, I started getting curious whether or not a MOVEM based routine would be faster than a simple MOVE; DBRA like the first example. The MOVE; DBRA has virtually no instruction fetch time, but the MOVEM routine has less loop overhead due to transferring large blocks per loop iteration rather than a single byte/word/long. I decided to run a test on real hardware. I set up the routines shown below to transfer a 256K block 256 times, and timed them with a stop watch. A stipulation on either routine is that they must have word aligned source and destination.
; 68K block memory move using MOVEM ; this routine is roughly 2.5x faster on a motorola 68332 running out of its fast 2-cycle TPURAM ; the test was performed by moving a 256K block approximately 256 times (might have been an off by one in there.) ; the block was moved from SRAM to SRAM each time by each routine. Both routines ran out of fast 2-cycle TPURAM. ; in theory, the above doesn't matter much for the 2nd routine, since should run entirely in the special loop mode. memmovem: ; parameters: ; A0 - source (word alignment assumed) ; A1 - destination (word alignment assumed) ; D0 - size in words (0 is undefined) ; variables ; D0 - size in MOVEM blocks (rounded down) ; D1 - number of words not included by above ; D1 - reused for MOVEM DIVUL.L #24,D1:D0 ; 22 words per MOVEM SUBQ #1,D0 ; needed for the way DBRA works SUBQ #1,D1 BMI .loop ; if D1 negative, we happen to have round number of MOVEMs LSR.W #1,d1 ; convert from word count to long count BCC .longloop MOVE.W (A0)+,(A1)+ ; correct for word .longloop: MOVE.L (A0)+,(A1)+ ; take care of long moves first DBRA D1,.longloop JMP .loop .outerloop: ; do MOVEM sized moves SWAP D0 .loop: MOVEM.L (A0)+,D1-D7/A2-A6 ; 12 long words MOVEM.L D1-D7/A2-A6,(A1) ADDA.W #48,A1 DBRA D0,.loop SWAP D0 DBRA D0,.outerloop RTS memmove: ; parameters: ; A0 - source ; A1 - destination (word alignment assumed) ; D0 - size in words (0 is undefined) LSR.W #1,D0 ; convert to long count BCC .longloop MOVE.W (A0)+,(A1)+ ; correct for word jmp .longloop .outerloop: SWAP D0 .longloop: MOVE.L (A0)+,(A1)+ ; do the long moves DBRA D0,.longloop SWAP D0 DBRA D0,.outerloop RTS
These routines were run from inside my 68332's TPURAM, a very fast 2-cycle SRAM internal to the 68332. This reduced instruction fetch times for the MOVEM routine as much as possible, and is not an unreasonable purpose to use the RAM for, in my opinion. The 68332 was running at ~16MHz. The 68332 is based around the CPU32 core, and it has the fast DBRA loop feature. The results? clear as crystal: 20 seconds for the MOVEM based loop, and 48 seconds for the simple MOVE based loop. You'll notice in both cases I made sure to take advantage of long transfers despite 16 bit bus to reduce looping overhead.
Another potential speedup for moving big blocks of data around exists on the '040: the MOVE16 instruction. However, this has it's own stipulations (mumbles something about cache blocks) and I don't have an '040, so I won't go into it.
Another note is that the MOVEM routine doesn't work so well for moving from a single I/O address, unless you specifically design your I/O mapping to allow it. You can set up the I/O decoding to ignore the lowest 6 bits of the address and use ADDR6, ADDR7, etc. in place of the low order bits. You can then safely MOVEM to/from the first occurrence of the I/O register you intend to transfer to/from, and the subsequent transfers will use higher addresses, which still correspond to the same I/O register. This could be useful for fast peripherals that can run more or less at the speed of the bus (and will make the bus wait if they can't) like a PATA HDD, for example (or CF, same thing almost).
Additionally, moves of a constant size using MOVEM can potentially be unrolled without taking as much code space as unrolling a simple MOVE.L would. Some smaller transfers may take only a single pair of MOVEM to complete as well, meaning no loop overhead at all.