Why Deploy Video Detection Models on Embedded Devices?
When we talk about visual AI, many people first think of high-precision models on the server side. However, in real-world scenarios, a large number of video analysis requirements actually occur at the edge: abnormal behavior warning of smart cameras, road condition prediction of in-vehicle systems... These scenarios have rigid requirements for **low latency** (to avoid decision lag), **low power consumption** (relying on battery power), and **small size** (to be embedded in hardware devices). If video frames are transmitted to the cloud for processing, it will not only cause network delay but also may lead to data loss due to bandwidth limitations. Local processing on embedded devices can perfectly avoid these problems. Therefore, **slimming down** video detection models and deploying them to the edge has become a core requirement for industrial implementation.
Isn't YOLO Sufficient for Visual Detection?
The YOLO series (You Only Look Once), as a benchmark model for 2D object detection, is famous for its efficient real-time performance, but it is essentially a **single-frame image detector**. When processing videos, YOLO can only analyze frame by frame and cannot capture **spatiotemporal correlation information** between frames: for example, a "waving" action may be misjudged as a "static hand raising" in a single frame, while the continuous motion trajectory of multiple frames can clarify the action intention.
In addition, video tasks (such as action recognition and behavior prediction) often need to understand the "dynamic process" rather than isolated static targets. For example, in the smart home scenario, recognizing the "pouring water" action requires analyzing the continuous interaction between the hand and the cup, which is difficult for 2D models like YOLO because they lack the ability to model the time dimension.
Basic Knowledge of Video Detection Models: From 2D to 3D
A video is essentially four-dimensional data of "time + space" (width × height × time × channel). Early video analysis often adopted a hybrid scheme of "2D CNN + temporal model" (such as I3D), that is, first using 2D convolution to extract single-frame spatial features, and then using models like LSTM to capture temporal relationships. However, this scheme does not model spatiotemporal correlations closely enough.
**3D Convolutional Neural Networks (3D CNNs)** perform convolution operations directly in three-dimensional space (width × height × time), and extract both spatial features (such as object shape) and temporal features (such as motion trajectory) through sliding 3D convolution kernels. For example, a 3×3×3 convolution kernel will cover a 3×3 spatial area in a single frame and also span the time dimension of 3 consecutive frames, thus naturally adapting to the dynamic characteristics of videos.
Why Introduce Efficient 3DCNNs Today?
Although 3D CNNs can effectively model video spatiotemporal features, traditional models (such as C3D and I3D) have huge parameters and high computational costs (often billions of FLOPs), making them difficult to deploy on embedded devices with limited computing power (such as ARM architecture chips).
The **Efficient 3DCNNs** proposed by Köpüklü et al. are designed to solve this pain point:
1. **Extremely lightweight design**: Through technologies such as 3D depthwise separable convolution and channel shuffle, the model parameters and computational load are reduced by 1-2 orders of magnitude (for example, the FLOPs of 3D ShuffleNetV2 are only 1/10 of ResNet-18) while maintaining high precision;
2. **Hardware friendliness**: It supports dynamically adjusting model complexity through "Width Multiplier" (such as 0.5x, 1.0x) to adapt to embedded devices with different computing power;
3. **Plug-and-play engineering capability**: The open-source project provides complete pre-trained models (supporting datasets such as Kinetics and UCF101), training/fine-tuning scripts, and FLOPs calculation tools, which greatly reduce the threshold for edge deployment.
For scenarios that need to implement real-time video analysis (such as action recognition and gesture control) on embedded devices, Efficient 3DCNNs can be called a "model of balance between precision and efficiency".
- **Multi-model comparison and verification**:
The paper implements 8 types of 3D CNNs (including lightweight models and mainstream models such as ResNet-18/50 and ResNeXt-101) and conducts unified benchmark tests on datasets such as UCF101 and Kinetics. The results show that lightweight models (such as 3D ShuffleNetV2) perform best in balancing precision and efficiency. For example, on the Jester dataset (gesture recognition), its FLOPs are only 1/10 of ResNet-18, but the accuracy reaches 92.3%.
## Test Analysis
First, let's look at the picture:

First, we need to understand the performance of the test platform, which will provide a benchmark for future selection and adjustment.
1. **Titan XP: Representative of High-performance Desktop GPUs**
NVIDIA's former flagship desktop GPU has powerful floating-point computing capabilities and memory bandwidth. It has a sufficient number of CUDA cores, a memory bandwidth of up to 336GB/s, and a single-precision floating-point performance (FP32) of **11.3 TFLOPs**. It can efficiently handle high-complexity 3D convolution calculations, provide sufficient computing power support for the model, and is suitable as a "performance upper limit" reference platform to verify the extreme performance of the model under ideal hardware conditions.
2. **Jetson TX2: Pioneer in Embedded AI Computing**
Jetson TX2 is a platform for embedded and edge computing, focusing on low power consumption and lightweight deployment. It integrates an NVIDIA Pascal architecture GPU (256 CUDA cores), paired with a dual-core Denver 2 CPU and a quad-core ARM A57 CPU, with an overall power consumption controlled at 7.5 - 15W. Although its computing power (**FP32 is about 1.33 TFLOPs**) is much lower than that of Titan XP, it fits the "edge deployment"需求 in actual scenarios and is used to test the practicality of the model in resource-constrained environments.
We hope that the video analysis algorithm can truly be implemented on our smart cameras, and **computational efficiency, real-time performance, and precision** are the two major issues we need to consider.
1. Computational Efficiency and Real-time Performance
Smart cameras usually hope to achieve a detection speed of at least 15fps (frames per second) to ensure the real-time performance of detection without obvious delay.
- **Titan XP platform**: From the table data, Titan XP has a very fast computing speed. For example, 3D - ShuffleNetV1 0.5x can process 398 video clips per second (cps) on this platform, and even the relatively heavy 3D - SqueezeNet can reach 682 cps. However, Titan XP is a desktop-level high-performance GPU with large size and high power consumption, which is not suitable for direct deployment in smart cameras. Therefore, although it can support the model to achieve ultra-high computational efficiency, it has little reference significance in the smart camera scenario from a practical application perspective.
- **Jetson TX2 platform**: Jetson TX2 is an embedded computing platform, which is closer to the hardware conditions of smart cameras. Some lightweight models can meet the real-time requirements on this platform. For example, the cps of 3D - ShuffleNetV1 0.5x is 69, the cps of 3D - ShuffleNetV2 0.25x is 82, and the cps of 3D - MobileNetV1 0.5x is 57. The processing speed of these models per second is higher than 15fps, enabling real-time detection. However, some models have low computational efficiency. For example, the cps of 3D - MobileNetV2 0.7x on Jetson TX2 is only 13, which cannot meet the real-time requirements.
2. Model Precision
When smart cameras are used for real-time detection, they need high precision, otherwise, there may be more false detections and missed detections, reducing the reliability of detection.
- On the Kinetics - 600 dataset, some lightweight models such as 3D - ShuffleNetV2 2.0x have an accuracy of 55.17%. Although there is a gap compared with some heavy models (such as ResNeXt - 101 with 68.30%), considering the limitation of computing resources, this accuracy can already meet some smart camera application scenarios with not extremely high precision requirements, such as some simple behavior classification and detection.
- On the Jester dataset, the accuracy of 3D - ShuffleNetV2 series models is generally high. For example, 3D - ShuffleNetV2 2.0x reaches 93.71%, and 3D - MobileNetV2 series can also reach about 86% - 93%. In scenarios such as gesture recognition, such accuracy can more accurately identify different gesture actions, providing reliable support for applications such as intelligent interaction.
- On the UCF - 101 dataset, the model accuracy can also meet certain actual needs. For example, the accuracy of 3D - ShuffleNetV2 2.0x is 83.32%, which can effectively identify various actions for applications such as human action recognition.
Back to the RV1126B platform, its NPU computing power can reach 3Tops@INT8. Its real-time performance and power consumption may be better than those of Jetson TX2, but the accuracy will also decrease to a certain extent. 3D CNNs are models with high computational complexity. According to the test results, Jetson has achieved an accuracy of more than 76% on the two relatively simple action datasets Jester and UCF-101. Even if there is a certain loss in the quantized model running on the RV1126B platform, I believe that through targeted training and fine-tuning, the Efficient 3D algorithm can still complete video analysis tasks.
However, the biggest difficulty lies in platform adaptability. The NPU of RV1126B only supports model files in rknn format. At present, the official has not provided a tool to convert the model files of Efficient 3D into rknn format, which means that developers need to design operators and do platform adaptation, which is also the work that reCamera may do in the future.
In general, RV1126B can support the Efficient 3D video analysis algorithm in terms of computing power and performance, but more work needs to be done on platform adaptability.
Deng MingXi
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.