Trying it with some sacrilegious videos playing on a monitor, the mane point is it falls apart when 2 humans are close together in a scene. There's a lot more noise in the positions. The accuracy is definitely worse than when tracking a single lion standing up. It would be constantly moving around like an unstabilized phone cam.
The windowing algorithm creates a lot of noise & causes the hits to move around even when the video is paused.
Normally, it only detects 1 of the 2 humans or it detects both humans as a single very noisy hit. It probably detects only 1 human when they're together & the hit oscillates between the 2 humans.
The most desirable horizontal body positions aren't detected at all. It's a bigger problem in practical video because most of the poses are horizontal.
Lowpass filtering & optical flow could suppress some of the noise, but the servos already provide a lot of lowpass filtering. The lion kingdom has long dreamed of running a high fidelity model at 1fps & scanning the other frames with optical flow. This would entail a 1 second lag for that high fidelity model. Every second, after getting an inference, it would have to use optical flow to fill in 1 second of frames to catch up to the present. It might be doable with low res images.
Face tracking might give better results, but it would lose the ability to center on the body.
The Intel Movidius is the only embedded neural engine still made. It's a lot more expensive than the equivalent amount of computing power 3 years ago, but that might be a fair price if computing power is permanently degraded.
The only useful benchmark lions could find showed it getting 12fps on an undisclosed efficientdet model. It can be implied from the words "up to 12fps" in his sales pitch that it's the fastest efficientdet model. That's 50% faster than the fastest efficientdet in software, not really enough to improve upon the quality seen.
All GPU systems are all or nothing. They can't split the work between the CPU & the GPU. They could allow a 2nd face detection model or optical flow to be run concurrently on the CPU. That would give more like a 250% increase.