DMA on the STM32H7

DMA on the STM32H7 is a beast, with each incremental improvement as their hardware got better represented by a different interface.  There's the BDMA, the regular DMA, & finally the MDMA.  The mane one used for accessing GPIOs is the regular DMA.

The mane use of DMA is making a parallel bus by firing data at all 16 lines of a GPIO register & using a timer as the clock.  Despite being a 400Mhz core, bit banging a GPIO only goes at 16Mhz, so you need some kind of hardware support.  There are a few limitations.  The only timers which can drive DMA transfers over GPIOs are TIM1 & TIM8.  Only DMA2 can access the GPIOs.  The most useful information came from:

https://community.st.com/thread/41701-stm32f7-dma-memory-to-gpio-by-timer-problem

a complete listing which actually works, once you move the address pointer to AXI RAM & fix all the mistakes he discovered.  The STM32F7 code is interchangeable with the STM32H7.

https://community.st.com/thread/48054-stm32h7-spi-does-not-work-with-dma

note about the address pointer.

The TIM_HandleTypeDef has an array of DMA_HandleTypeDefs which cause various timer events to trigger DMA transfers.

FIFOMode must be DMA_FIFOMODE_ENABLE & FIFOThreshold is key to maximizing the bandwidth.  DMA_FIFO_THRESHOLD_1QUARTERFULL gave the best results.

MemBurst only worked with DMA_MBURST_SINGLE.



HAL_DMA_Start is the command which provides the src & dst addresses.  You have to call SCB_CleanInvalidateDCache(); before & after this, since DMA doesn't touch the cache.  The address for a GPIO input is (uint32_t)&(GPIOC->ODR) & for the output is (uint32_t)&(GPIOC->IDR)

__HAL_TIM_ENABLE_DMA is the command which starts the actual data transfer, when using timer triggers.

When using multiple timers to drive clock pins & DMA streams, you have to synchronize the timers.  This is easiest done by setting all the timer_handle.Instance->CNT registers to starting values based on probing with a scope.  All the CNT registers have to be set inside a __disable_irq(); __enable_irq(); block.  Similarly, all the __HAL_TIM_ENABLE_DMA calls need to be with the IRQs disabled.

You must call HAL_DMA_AbortHAL_DMA_DeInit, & HAL_DMA_Init to restart a DMA transfer.

In the STM32H7, GPIO to DMA operations now have to be done in the AXI RAM (0x24000000) or SRAM1, SRAM2, SRAM3 domanes, but not the DTCM-RAM (0x20000000). 


Speed limitations 

 The mane problem is a single DMA stream writing a GPIO from AXI-RAM maxes out at 28.5Mhz.  Any higher & the GPIO stalls every 8 samples.  The DMA doesn't really directly access memory, but uses a FIFO.  The FIFO appears to get starved if the timer fires too fast.  The network analyzer project needs 1 writer DMA stream & 2 reader DMA streams to move 10 bits out & 20 bits in.

Using 3 DMA streams to move 30 GPIO lines, the speed drops to 11.7Mhz & the streams just lock up if they go any faster.  It's disappointing a 400Mhz core has such slow I/O.  The good news is you can copy data to DTCM-RAM (0x20000000) with the CPU & perform calculations without interfering with the DMA transfers.

It should be noted 11.7Mhz is a lot higher than 28.5Mhz / 3, so you can get slightly higher speeds by having more DMA streams in parallel.  There was more speed to be had.

Overclocking the STM32H7

In the 3 DMA stream case of 11.7Mhz, it would be nice to get an even 12Mhz.  You can get a few percent more clockcycles through overclocking.  In your SystemClock_Config function, the core clockspeed is defined by

external crystal / PLLM * PLLN / PLLP

The lion kingdom got it to 408Mhz with 32Mhz / 4 * 102 / 2, which gave a 12Mhz DMA.  It's very important to use an external crystal that can be desoldered for testing it.  Without a reset halt function in the OpenOCD debugger, the chip is bricked if it can't execute past SystemClock_Config.

Assigning DMA streams to different SRAM domanes

The SRAM is divided into many domanes described in the electrical specifications document page 14 & the reference manual page 109.  AXI-SRAM, SRAM1, SRAM2, SRAM3 can all feed DMA streams in parallel, but SRAM4 is extremely slow.  By pointing 3 DMA streams at different domanes, the speed increases to 22Mhz.  Overclocking the core to 432Mhz, the DMA increased to an even 24Mhz.  

You might be better off splitting 16 signals between 2 GPIOs & using 2 DMA streams than having a single GPIO firing all 16 lines from a single DMA stream.  A single GPIO register can't be shared between 2 DMA streams without slowing down.

Timing errors when DMA reads a GPIO

Things were looking good when benchmarking with 1 writer stream, but with the 2 other streams reading, the reader FIFOs once again would stall until the speed was reduced to 12Mhz.  More importantly, there were significant timing errors when reading from the GPIO, causing bit errors anywhere above 1Mhz.  

Skipping DMA & just bit banging

Simply bit banging the IDR & ODR registers managed to hit 6Mhz, without any bit errors.  Careful synchronization of a free running PWM generator, careful ordering of GPIO bit banging, & overclocking to 432Mhz managed to hit 10Mhz.  DMA has some indeterminate timing errors when multiple streams try to read GPIOs.  With GCC, you have some control of when the GPIO is accessed.  

uint8_t gpioc = GPIOC->IDR;
uint16_t gpioe = GPIOE->IDR;

*dst1++ = gpioc;

*dst2++ = gpioe;

Actually gets it to generate 2 consecutive loads from the GPIOs, then 2 consecutive stores into RAM.  The trick is to set timer_handle.Instance->CNT so the clock pins fire & the ADC latches new data during the stores.

There doesn't seem to be any way to force when multiple DMA streams read a GPIO, so it's worthless for reading all but the slowest parallel data.

Introducing the MDMA

The MDMA is not a drug, but a new DMA module on the STM32H7.  The datasheet says the MDMA can read any RAM area & write directly to any peripheral address faster than the DMA2.  There are no examples of this actually being done.  The closest they do is transfer from a UART to SRAM using the conventional DMA, then from SRAM to SRAM using MDMA.  

There is no Request value which causes a timer to trigger the MDMA.  The MDMA takes a software trigger (HAL_MDMA_Start_IT) & fires data to the peripheral as fast as possible until an entire buffer is transferred.  In the case of writing 16 GPIO pins from a buffer, you can do:

#define SAMPLES 32768

uint16_t *waveform = (uint16_t*)0x24000000;

MDMA_HandleTypeDef dac_mdma;

dac_mdma.Instance = MDMA_Channel0;   dac_mdma.Init.Request   = MDMA_REQUEST_SW; dac_mdma.Init.TransferTriggerMode  = MDMA_REPEAT_BLOCK_TRANSFER;   dac_mdma.Init.Priority     = MDMA_PRIORITY_HIGH; dac_mdma.Init.Endianness   = MDMA_LITTLE_ENDIANNESS_PRESERVE; dac_mdma.Init.DataAlignment   = MDMA_DATAALIGN_PACKENABLE;    dac_mdma.Init.SourceBurst   = MDMA_SOURCE_BURST_128BEATS; dac_mdma.Init.DestBurst   = MDMA_DEST_BURST_128BEATS; dac_mdma.Init.BufferTransferLength = 128; dac_mdma.Init.SourceDataSize       = MDMA_SRC_DATASIZE_HALFWORD; dac_mdma.Init.DestDataSize         = MDMA_DEST_DATASIZE_HALFWORD; dac_mdma.Init.SourceInc            = MDMA_SRC_INC_HALFWORD; dac_mdma.Init.DestinationInc       = MDMA_DEST_INC_DISABLE; dac_mdma.Init.SourceBlockAddressOffset  = 0; dac_mdma.Init.DestBlockAddressOffset    = 0;   

result = HAL_MDMA_Init(&dac_mdma);

result = HAL_MDMA_Start_IT(&dac_mdma,  (uint32_t)waveform,      (uint32_t)&(GPIOC->ODR),      SAMPLES * 2,      1);

That fires 65536 bytes at the GPIO, 128 bytes at a time.  It achieves 160 megsamples/sec, much faster than the conventional DMA.  You can only adjust the speed by changing the system clock speeds or oversampling the data.

The problem is the MDMA doesn't have a proper FIFO, so it transfers a block of 128 bytes, waits to load another block, & transfers again.  Since the GPIO is 16 bits, that's 64 GPIO updates between delays.  The datasheet has a note about ping ponging 2 FIFOs, but there's no evidence of how to enable this.  It's acceptable for just transferring data, but not for driving a DAC.

The MDMA shares the same FIFO among all its channels, so 3 channels banging 48 GPIO lines will have very long delays between blocks.  Although the MDMA can read from DTCM RAM, the DTCM RAM is much slower than the AXI RAM.  Increasing AHBCLKDivider slows down the GPIO writes, but also makes the FIFO reload longer.   Decreasing AHBCLKDivider bricks the chip.  Managed to recover it by desoldering the crystal, only because it didn't use the internal oscillator by default.  The MDMA can only transfer 65536 bytes in a single buffer & can only resend the buffer 8 times before another call to HAL_MDMA_Start_IT is required.   Using BufferTransferLength of 2, the MDMA can fire continuously at 16Mhz until its buffer resends are exhausted. When MDMA & DMA2 are run simultaneously, DMA2 ends up clobbering the MDMA bandwidth after a while.  

Impedance matching

The network analyzer proved very noisy.  Part of the problem was noisy analog reference voltages.  A capacitive multiplier might have fixed that.

Interestingly, the lion kingdom was berated by a CTO on the matter of the GPIO lines experiencing reflections & causing radio interference with nearby analog signals.  On a board measuring microvolt analog signals like this one, the radiation at even 10Mhz would be a problem.  The impedance matching is typically a series resistor to form a voltage divider which divides out the reflections as they travel back & forth.  10R to 100R are usual values.

https://electronics.stackexchange.com/questions/7709/why-put-a-resistor-in-series-with-signal-line