MPEG-1 was the standard for VCDs, which held an entire film on a 700-MB disc. That should be quite efficient, and indeed, at 1 Mbps the visual artefacts are still less disrupting than MJPEG. It is also highly probable that MPEG-1 will decode faster than MJPEG, due to reuse of image blocks reducing the computation-intensive steps (mostly 8×8 IDCT) involved.
That well addresses the storage-efficiency requirement. Decoding is more of a concern, as Ben had to modify the library and switch to greyscale for 240×240. Upon closer look, the biggest hurdle is the three frame buffers used: one for the current frame, one for the forward-prediction (P) frame, and one for the backward-prediction (B) frame. Each takes 240×240 (L) + 120×120×2 (Cb, Cr) = 86400 bytes. While Ben chose to eliminate the Cb and Cr planes, I think we can get by without B-frames, which also reduces memory footprint by a third, but without the compromise in video appearance.
Off to porting. Dropping the B-frame buffer was mostly straightforward. I then modified the library to take a user-supplied large buffer for frames (as that will make memory allocation/reuse, DMA'ing, etc. more flexible later into development and optimization).
Meanwhile, the library apparently was not designed with embedded environments in mind, as the dynamic allocations present hindrance: apart from the large frame buffers, there are smaller bitstream buffers, one of which grows dynamically with differently-sized packets. That is less than ideal, but does not stop us from running a preliminary test.
On RP2040 single-core at 133 MHz, decoding a four-second excerpt (frames 1476~1572) from the Umiyuri animation takes 6521 ms (1.63× actual duration), excluding final conversion from YUV420 to RGB565. (Corresponding commit: ebdc3f2)
This fell short of the real-time goal, but not by much either. First, RP2040 has two cores, which in the best case can cut decoding time by a half; as an optimistic estimate, that already fits into our time budget. Furthermore, pl_mpeg encapsulates the decoder state in a large struct and passes it around, resulting in a lot of redundant pointer indirections in deeply-nested subroutine calls (e.g., the plm_buffer_t * pointer is dereferenced every time in plm_buffer_read(), called from a wide range of subroutines). In an embedded context, we need only a singleton, so that probably can be optimized. It also seemed that there could be a faster, lighter algorithm to decode the bitstreams, without the growing buffers.
I would admit that I am quite visually-driven; I surely aspired for the best possible playback within the constraints. Given the vast possibility of optimizations, I made the audacious decision to diverge onto the path less travelled by: to write my own decoder. The hypothesis is that a decoder working with static buffers, in singleton mode, with optimized bitstream readers, can complete the heavylifting with considerably less strain; we will see whether this works out.
A pin apparently had a manufacturing defect (?) that caused it to open right at the IC package, so the pins had to be manually shorted and mapped in an odd way in firmware. But after that was clear, driving the display through DMA'ed SPI, and the audio amplifier through I²S (PIO) was mostly smooth sailing. The core problem, as we have predicted, would be in efficient video storage and decoding, and in turn, the selection of video codecs.
In the following steps, we will use the animation video for a well-known track, Umiyuri Kaiteitan (ウミユリ海底譚, Tale of the Deep-sea Lily; composed by n-buna, video by Awashima). This animation contains a lot of moving, blurred backgrounds and objects, which render it an ideal sample for quickly profiling codec approaches. (Anecdote: on streaming websites where weekly Vocaloid compilations are released, excerpts from this animation is often taken by the audience as an indicator of video quality.) We further process by scaling down to 240×240 and masking content outside the central circular region (display viewport). Frame rate is kept at 24 fps.
The first experiment is with the QOI format. QOI is a very simple lossless image codec with a decent compression rate comparable to that of PNG. Applying that to our video frame-by-frame, we get a lossless video encoded at ~15 Mbps. A further downscaling by a half (120×120) yields a much more acceptable 4~5 Mbps:
Original scale (240×240) is heavy in computation and only able to run at 12 fps, but at half scale, decoding is fast enough to run comfortably within RP2040's default 133 MHz system clock. Combine that with QOA-encoded audio processed with my previous implementation uQOA, we get a first working prototype. Here is a recording of the result:
We must admit that this is less than ideal. Downscaled video is blurry and still takes a lot of storage (a two-minute video will take 120 MiB), which adds cost and complexity in storage and causes a longer wait time during user uploads.
A straightforward idea is to optimize or modify QOI. QOI works by encoding each RGB pixel with one of the many shortcuts possible, with dedicated optimization for consecutive identical runs. Profiling shows that much of the time is spent in its 64-element hash table serving as the dictionary for recently-seen pixels, but this is largely a tradeoff between space and time (where we aim to optimize both). The encoding scheme specializes in individual RGB8 pixels; modifying this to work in YUV420 will require more extensive work, yet the outcome (performance in time and space) is not easy to predict.
A low-hanging-fruit alternative is MJPEG which achieves 1\~2 Mbps at original scale and, as a rough estimate for now, will be on par with QOI regarding decoding speed (as well as being more flexible and tunable). But as we are already decoding JPEG, why not go for MPEG? Here again, I will be retracing a trodden path.
This started as a birthday gift for a close friend. My envisioned outcome would be a circular little trinket that could play video — similar to a button badge or a bag charm — self-contained, battery-powered, and rechargeable over USB. It should at least support video lasting a few minutes (enough for a music video or a short animation), ideally uploadable through USB at a reasonable speed.
It seems that someone must have done this before. Indeed, this has been implemented multiple times with ESP32-series microcontrollers (including a kit on Adafruit) as well as the more lightweight RP2040 (Ben's 2023 Supercon badge hack and a follow-up revision). However, ESP32 does not excel in power-efficient sustained-load operation, while Ben's MPEG-1 approach had to make compromises in appearance (either go greyscale or use smaller screens). RP2040's official Popcorn demo plays QVGA smoothly, but compression is rudimentary at ~20 Mbps (~40% compression ratio compared to raw 24-bit RGB). Similar commercial products are listed online with a decent battery life of 10+ hours, but only supports seconds-long animations and are priced at CNY 100 (USD 14) or more.
None of the existing solutions quite matched what I wanted: several minutes of smooth, colourized video, running for hours on a small battery. How greedy I am >_< And I am atoning for it by suffering my own prophecy, confining myself onto the workbench trying to coalesce with the almighty numen of computation, enmeshed in endless rises and falls of the aetheric force...
After another round of search for microcontrollers, I decided to retrace the path of RP2040. A fast system clock combined with its versatile PIO block makes a perfect fit for smooth video playback, standing out in its range of complexity, power, and price.
I already have RP2040 development boards and a spare 1.28" 240×240 display at hand. Audio is less of a concern; an I2S-interfaced MAX98357 block covers all needs.
Given practice from previous projects, this setup is more of a comfort zone. Still, unknowns remain — how smooth can we reach? The only way to figure out would be to forge a manifestation.