Close

[E3][R] Renesas RA8D1 and RAM

A project log for Tetent [gd0090]

A water resistant, 300WPM input peripheral for end-to-end workflows.

kelvinakelvinA 06/07/2024 at 20:070 Comments

It sounds like Renesas is coming out with a chip specifically to "bridge the gap between MCUs and MPUs" with their RA8D1 which achieves 6.4 CoreMark/Mhz using Cortex-M85 (so more likely to have embedded Rust support). On its single 480MHz core, it eclipses the ESP32P4 and allegedly gets 3000 CoreMark. It also has a (relatively large) 176-pin QFP alongside its smaller BGA:

I wish these chip manufacturers also bridged the gap between QFN and BGA with something like a dual-perimeter BGA:

Even a 3-perimeter BGA seems fanout-able with 2 layers: 

This is a 180-ball package that's only 12 x 12mm despite its 0.8mm ball spacing. The U5G7's 100 LQFP is 14 x 14 and the RA8D1 is 24 x 24 for the 176 LQFP. I've also heard that BGAs were invented back in the day because these large QFPs were more fragile.

Anyway, it does actually exist on Digikey in the same price range as the U5G7 and it seems that they're fresh on shelves because practically 0 data has been added for them meaning that they would've been insta-filtered out during my searches:

There's no mention of vector support, but the datasheet does say:

The 2D Drawing Engine (DRW) provides flexible functions that can support almost any object geometry rather than being bound to only a few specific geometries such as lines, triangles, or circles.

That sounds like pseudo vector support.

There's only 1MB of on-chip RAM and, for the as-yet-unavailable 2MB Flash + MIPI DSI BGA version (highlighted in green below), the extra 1MB of flash will cost £1 more than it's otherwise identical counterpart (right at the bottom of the list). Considering that I was planning to just put the firmware assets on the MicroSD card like Linux for a Raspberry Pi, the 1MB flash option is likely fine (as long as it's actually cheaper, which currently isn't the case for the available options).

Since the RAM is a bit low, the best Octo-SPI options are £1.70 for 8MB and £3.80 for 32MB, both of which are BGA.

Dropping down to Quad-SPI has many more options and at lower price points, such as 

Considering that this is partially serving as a testbench, it makes the most sense to go with the 8WSON packages:

The 32MB BGA does make some sense too though, again considering that I just have to place the chip on the pad and heat it to solder it, skipping a step.

From what I can understand, Octo-SPI is more ideal as an entire byte can be transferred. Because of this, I suspect that it also has different names such as "Parallel" and "HyperBus". Searching using that, I get cheaper and larger options

The latter of which has a copious amount of unused pins:

I counted 24 not connected. Oh, but this is the TSOP not the BGA
I count 24 pins actually being used.

Why have chip manufacturers done this???

Anyway, I've also noticed that the QSPI is clocked at 104 - 133MHz but the OSPI is usually 200MHz, so there's probably a large speed difference between them. This is confirmed in an ISSI leaflet:

ISSI’s Octal flash delivers 400 MB/s of read bandwidth, which is over 4x times faster than a Quad SPI Flash.

Anyway, it's only because of the leaflet that I found out that there's Octo FLASH in addition to Octo RAM, and it seems most of the chips I just found are of the FLASH variety. The 4MB HyperBus is indeed RAM though. The next cheapest option is some 64MB DDR2 for 198p:

Int the STM32 application note, it sounds like it doesn't really matter if it's FLASH or RAM:

The wording is a bit different though. Sounds like the framebuffer can live inside RAM but can only take assets from FLASH.

Not all too sure about how it's done in the RA8 though. Meanwhile in the U5Gx, the interface can even accept a 16-bit bus at up to 160MHz. I think it's Double Data Rate too, meaning that it would be as fast as on-chip, 32-bit SRAM. There's seemingly only 2 chips that do this at Mouser:

Conclusions

All this research has given me even more confidence to just stick with the U5G7 (or potentially the BGA U5G9 if I'm feeling daring). Literally every option vs the U5Gx looks like:

I do like the spikes and misty blues, but the STM32 route is just so smooth that there might even be a travelator built into the floor.

Essentially, all the other options either have uncertainties (P4: release date, vector performance) or only performance benefits that I'd need to know Tetent needs in advanced beforehand to justify the drawbacks.

[June 9

I was thinking of the potential to use the U5Gx as the "main MCU" and a chip like the RA1 as a "discrete accelerator MCU" after thinking about the "over 3000 Coremark" claim. For a point of reference, the Raspberry Pi 3 gets 3800 single-core CoreMark points.

Currently on Digikey, it seems that only the 176-pin is available, with a different but slower 400MHz 144-pin chip £1 cheaper (that I don't think has MIPI). Coremark points would be 3067 and 2556 respectively, and it seems that, with cache on, the current consumption is 318 uA/MHz, or 152.6mA. There is another measurement in the datasheet under "Calculation guide of maximum current" that says that the current is 147mA for CoreMark.

The thing is that there's also the STM32H7 series, and the H7R3 is essentially the 2976 CoreMark cousin of the U5G7 without the conveniently large storage nor vector accelerator. It's £6.88. It also consumes 71mA. Half price, half power, same performance (specifically on CoreMark). From what I understand, the Cortex M85 is much faster than the Cortex M7 in machine learning and digital signal processing applications.

Ideally, I'd just use the H7R3 as memory is probably straightforward enough to add via dual OSPI (allowing for an addressable space of 4.7MB RAM, 8MB FLASH for £3), but the VG is doing quite a lot:

• Vector graphic acceleration
  – Path drawing (lines, polygons, rectangles, arcs, ellipses, circles)
  – Bezier curves (cubic and quadratic)
  – Path transformation (3x3 matrix)
  – Path stroking
  – Filling (event-odd and non-zero with 8x MSAA anti-aliasing)
  – Gradient generation (linear, radial, conic)

It's also still much more likely that I only get Tetent finished to a MVP status and hastily redirect my attention to other things, meaning that there is a low probability that I'd ever get to program anything that would actually use the performance.

This strategy is the idea that I can go with the U5Gx for everything I need and then extend the performance with whatever chip is suitable where, by that time, ST might have already come out with an M85 (or better) chip.

[June 11

If I search Hyperbus only, there's actually a 16MB HSPI-compatible RAM chip for not much more than the 4MB OSPI-compatible:

W957D6NBX5I, which unfortunately is a 0.5mm pitch.

Since these RAM chips and the display prism both run at 1.8V, and the MCU current consumption is unaffected by the input voltage, it seems to make most sense to use a 1.8V main logic rail.

Another thing to note is that, even with the lower power consumption of HyperRAM, the power envelope is still in the mW range:

Thus, I looked in the 128Mbit datasheet and the read/writes are 13mA, so 23.4mW at 1.8V. This current consumption is the same as the entire U5Gx running at 140Mhz (peripherals disabled, SRAM enabled).

Since I don't need MIPI-DSI for either Tetent, Tetrescent or Tetinerary, and 2.4MB of application RAM (see image below) being relatively huge for embedded systems, it sounds like the STM32U5G7 has won this.

I'm planning to dedicate SRAM6 to the frame buffer (if they need one continuous bank of RAM) and then have SRAM1, 3 and 5 for application code.

[June 13] I'm looking through an ST presentation and found this:

I looked into why 2 framebuffers are used and the TouchGFX site and the LTDC application note says it's to prevent screen tearing:

Double buffering is a technique that uses double framebuffers to avoid displaying what is being written to the framebuffer.

Additionally, there is talk of a "partial frame buffer" requiring RAM on the display itself. Since I'm targetting 400 x 480 (a 400x400 square-widget-like GUI and 400x80 header/footer for things like a clock), it means that a good chunk of the frame buffers might be black space.

So I'm now reading. A cool thing is that it supports resolutions up to 4096 across and 2048 down with a 83MHz pixel clock (so max resolution is 9.8Hz). The 3.5" display supports up to 27MHz, typically 19.8MHz. 

However, it doesn't seem that there are any exposed pins on the display to configure anything, so it's not like I can tell the screen to expect a lower resolution.

Also, the RAM needs to be "contiguous". Considering that ST's demo kit is 800 x 480 at 24bpp, I assume that it doesn't mean that the entire framebuffer needs to fit in a single SRAM block because that would be 1,152KB per each.

For the first point, it seems that ST is already ahead of me; a quick search soon lead me to the GFXMMU application note, which deals with this black-unused-screen stuff, even for circular displays:

• Lower memory usage according to display shape
• Fully configurable display shape
• Transparent integration
• Works with any memory of the system

So I can probably assume that 2 framebuffers at 18bpp (which is the max colour depth the 3.5" supports, and if it looks good on that then I doubt I'd need 24bpp on Tetinerary) would be 864KB of RAM and there'd be 2.1MB left available for the system. Since the first and last visible pixel per line is input to set up this peripheral, rounded corners (and probably even a PearPhone) should work to shave just a little bit more off.

Source

Unfortunately, there's "granularity" meaning that I might not save much:

There's also this in the HSPI application note:

HSPI addressable space is from 0xA000 0000 to 0xAFFF FFFF.

This seems to imply that the max RAM supported is 268 million... something. It might be 256MB, or perhaps it's 16-bit words, thus 512MB. In practice, it's basically 256MB. Why? Well this article says:

But if your Flash chip is larger than the 256MB of internal memory space dedicated to the QSPI peripheral, you’ll need to use the indirect read mode to read data which is located after the first 256MB.

and the errata sheet says:

In indirect write mode, if the address is greater than 256 Mbytes, the indirect write is not performed at the targeted address {...} Actually, this write operation takes place within the 256-Mbyte memory space, thus corrupting
the memory content.

Indirect read operations are not impacted.

RAM you can only read to is useless. Good thing I stumbled on an errata sheet for some other chip to find out such a document even exists.

Considering the maker of #SDA - The best new PDA managed with only 0.2MB of RAM, I think I don't need to worry as much:

On a side note, I've now got to look into using removable phone batteries to see if I can replace the non-removable lipo I was planning on.

Discussions