fast 68000 block memory move routine comparison

OK, I guess I'll do another log since all I have to do is copy and paste some stuff from a reddit post: https://www.reddit.com/r/m68k/comments/63ov3e/fast_68000_block_memory_move_routine_comparison/

simple memory move routines based around MOVE and DBRA are often regarded as fairly fast on the 68000:

.loop:
    MOVE.L  (A0)+,(A1)+ ; do the long moves
    DBRA    D0,.loop

This runs decently fast on the 68000, a bit faster on the '010 and above due to the fast loop feature they added for DBRA, especially if your RAM is otherwise slow. However, it's not the most efficient way for the '000, or the '010, and certainly not the higher variants of the 68000.

Enter the MOVEM instruction. This single instruction is capable of loading up to 16 long words, with only the instruction overhead fetch of a single instruction. The problem? You can't use the postincrement mode for the destination, only the source, so you have to manually (in a separate instruction) add to the destination pointer. Additional complications include that using the stack pointer or your source and destination pointers to hold data within a MOVEM are probably not a good idea:

MOVEM.L (A0)+,D0-D7/A0-A6 ; not a good idea, unless you're sure this is what you want. behavior varies by CPU

So in reality, assuming you don't want to use your stack pointer either (also not a good idea if you're running in supervisor mode especially) you should set aside 4 registers, one for source, one for destination, your stack pointer, and a loop counter.

Anyway, I started getting curious whether or not a MOVEM based routine would be faster than a simple MOVE; DBRA like the first example. The MOVE; DBRA has virtually no instruction fetch time, but the MOVEM routine has less loop overhead due to transferring large blocks per loop iteration rather than a single byte/word/long. I decided to run a test on real hardware. I set up the routines shown below to transfer a 256K block 256 times, and timed them with a stop watch. A stipulation on either routine is that they must have word aligned source and destination.

; 68K block memory move using MOVEM
; this routine is roughly 2.5x faster on a motorola 68332 running out of its fast 2-cycle TPURAM
; the test was performed by moving a 256K block approximately 256 times (might have been an off by one in there.)
; the block was moved from SRAM to SRAM each time by each routine. Both routines ran out of fast 2-cycle TPURAM.
; in theory, the above doesn't matter much for the 2nd routine, since should run entirely in the special loop mode.
memmovem:
; parameters:
; A0 - source (word alignment assumed)
; A1 - destination (word alignment assumed)
; D0 - size in words (0 is undefined)

; variables
; D0 - size in MOVEM blocks (rounded down)
; D1 - number of words not included by above
; D1 - reused for MOVEM

    DIVUL.L #24,D1:D0   ; 22 words per MOVEM

    SUBQ    #1,D0   ; needed for the way DBRA works
    SUBQ    #1,D1

    BMI .loop   ; if D1 negative, we happen to have round number of MOVEMs

    LSR.W   #1,d1   ; convert from word count to long count
    BCC .longloop

    MOVE.W  (A0)+,(A1)+ ; correct for word

.longloop:
    MOVE.L  (A0)+,(A1)+ ; take care of long moves first
    DBRA    D1,.longloop

    JMP .loop
.outerloop: ; do MOVEM sized moves
    SWAP D0
.loop:
    MOVEM.L (A0)+,D1-D7/A2-A6   ; 12 long words
    MOVEM.L D1-D7/A2-A6,(A1)
    ADDA.W  #48,A1
    DBRA    D0,.loop
    SWAP    D0
    DBRA    D0,.outerloop

    RTS

memmove:
; parameters:
; A0 - source
; A1 - destination (word alignment assumed)
; D0 - size in words (0 is undefined)

    LSR.W   #1,D0   ; convert to long count
    BCC .longloop

    MOVE.W  (A0)+,(A1)+ ; correct for word
    jmp .longloop

.outerloop:
    SWAP    D0

.longloop:
    MOVE.L  (A0)+,(A1)+ ; do the long moves
    DBRA    D0,.longloop
    SWAP    D0
    DBRA    D0,.outerloop

    RTS

These routines were run from inside my 68332's TPURAM, a very fast 2-cycle SRAM internal to the 68332. This reduced instruction fetch times for the MOVEM routine as much as possible, and is not an unreasonable purpose to use the RAM for, in my opinion. The 68332 was running at ~16MHz. The 68332 is based around the CPU32 core, and it has the fast DBRA loop feature. The results? clear as crystal: 20 seconds for the MOVEM based loop, and 48 seconds for the simple MOVE based loop. You'll notice in both cases I made sure to take advantage of long transfers despite 16 bit bus to reduce looping overhead.

Another potential speedup for moving big blocks of data around exists on the '040: the MOVE16 instruction. However, this has it's own stipulations (mumbles something about cache blocks) and I don't have an '040, so I won't go into it.

Another note is that the MOVEM routine doesn't work so well for moving from a single I/O address, unless you specifically design your I/O mapping to allow it. You can set up the I/O decoding to ignore the lowest 6 bits of the address and use ADDR6, ADDR7, etc. in place of the low order bits. You can then safely MOVEM to/from the first occurrence of the I/O register you intend to transfer to/from, and the subsequent transfers will use higher addresses, which still correspond to the same I/O register. This could be useful for fast peripherals that can run more or less at the speed of the bus (and will make the bus wait if they can't) like a PATA HDD, for example (or CF, same thing almost).

Additionally, moves of a constant size using MOVEM can potentially be unrolled without taking as much code space as unrolling a simple MOVE.L would. Some smaller transfers may take only a single pair of MOVEM to complete as well, meaning no loop overhead at all.

Discussions

Lars Brinkhoff wrote 10/05/2017 at 05:21

In the olden Atari ST days, we'd code this kind of thing in assembly language and painstakingly tally up the cycle count of every instruction. It was pretty clear that MOVEM was a win for block moves.

The sound chip registers worked the way you describe. There were two registers at FF8800 and FF8802, but they were shadowed across the entire FF8800-FF88FF area.

Are you sure? yes | no

Multiuser Timesharing EhBASIC

Discussions

Become a Hackaday.io Member