Rapidly becoming the next big thing, 1st with subject tracking on quad copters, then subject tracking on digital assistants. It's long been a dream to have an autonomous camera operator that tracks a subject. The Facebook Portal was the 1st sign lions saw that the problem was finally cracked. The problem is all existing tracking cameras are operated by services which collect the video & either sell it or report it to government agencies.
Compiling & fixing a machine vision library to run as fast as possible on a certain computer is such a monumental task, it's important to reuse it as much as possible. To simplify the task of a tracking camera, the same code is used to count reps & track a subject. The countreps program was a lot more complicated & consumed most of the time.
Previous work way back in Aug 2016 involved LIDAR.
There were other failed attempts with chroma keying, luma keying, difference keying. Then it lay dormant for 2 years.
The last year saw an explosion in CNN's for subject tracking. The key software is openpose. That theoretically allows a camera to track a whole body or focus in on a head, but it doesn't allow differentiating bodies. Differentiating bodies still would require chroma keying or face matching.
If a subject takes off its clothing, as subjects do in the kind of thing that this camera would be used for, chroma keying would get thrown off. Face tracking doesn't work so well when the subject looks away.
The example videos show it doing a good job. Instead of a spherical camera or wide angle lens, it manages to track only by what's in its narrow field of view. This requires it to move very fast, resulting in jerky panning.
It isolates the subject from a background of other humans, recognizes paw gestures, & smartly tracks whatever part of the body is in view without getting thrown off. In the demos, it locks onto the subject with only a single frame of video rather than a thorough training set. It recognizes as little as an arm showing from behind an obstacle. Based on the multicolored clothing, they're running several simultaneous algorithms: a face tracker, a color tracker, & a pose tracker. The junk laptop would have a hard time just doing pose tracking.
The image sensor is an awful Chinese one. It would never do in a dim hotel room. Chinese manufacturers are not allowed to use any imported parts. The neural network processor is not an NVidia but an indigenously produced HiSilicon Hi3559A. China's government is focused on having no debt, but how's that working in a world where credit is viewed as an investment in the future? They can't borrow money to import a decent Sony sensor, so the world has to wait for China's own sensor to match Sony.
It's strange that tracking cameras have been on quad copters for years, now are slowly emerging on ground cameras, but have never been used in any kind of production & never replicated by any open source efforts. There has also never been any tracking for higher end DSLR cameras. It's only been offered on consumer platforms.
After living with a macbook that only did 3fps & a desktop which ran the neural network over a network, the lion kingdom obtained a gaming laptop with GT970 that fell off a truck. The GT970 was much more powerful than the macbook's GT750 & the desktop's GT1050, while the rest was far behind. Of course, the rest was a quad 2.6Ghz I7 with 12GB RAM. To an old timer, it's an astounding amount of power. Just not comparable to the last 5 years.
Most surprising was how the GT970 had 50% more memory & 2x more cores than the GT1050 but lower clockspeeds. They traded clockspeed for parallelism to make it portable, implying clockspeed used more power than transistor count.
Pose tracking on a tablet using cloud computing was a failure. The network was never reliable enough. It desperately needed a laptop with more horsepower than the macbook.
The mane problem was it quickly overheated. After some experimentation, 2 small igloo bars directly under the fans were enough to keep it cool. Even better would be a sheet of paper outlining where to put the igloo bars & laptop. Igloo bars may actually be a viable way to use the power of a refrigerator to cool CPUs.
Panning to follow standing humans is quite good. Being tall & narrow animals, humans have a very stable X as body parts come in & out of view. The Y is erratic. The erratic direction reverses when they lie down. The next question is what body part is most often in view & can we rank body parts to track based on amount of visibility?
It depends on how high the camera is. A truly automated camera needs to be on a jib with a computer controlling height. Anything else entails manual selection of the body part to track. With a waist height camera, the butt is the best body part to track. With an eye level camera, the head is the best body part to track.
Lacking Z information or enough computing power to simultaneously track the viewfinder, the only option is adding a fixed offset to the head position. For the offset to be fixed, the camera has to always be at eye level, so there's no point in having tilt support. There's no plan to ever use head tracking.
The servocity mount doesn't automatically center, either. There needs to be a manual interface for the user to center it.
Autofocus has been bad.
It's all a bit less automatic than hoped. For the intended application, butt tracking would probably work best.
Rotating the Escam 45 degrees & defishing buys a lot of horizontal range, but causes it to drop lions far away. The image has to be slightly defished to detect any of the additional edge room, which shrinks the center.
The unrotated, unprocessed version does better in the distance & worse in the edges, but covers more of the corners. Another idea is defishing every other rotated frame, giving the best of both algorithms 50% of the time. The problem is the pose coordinates would alternate between the 2 algorithms, making the camera oscillate. It would have to average when the 2 algorithms were producing a match. When it went from 1 algorithm to 2 algorithms, there would be a glitch.
In practical use, the camera is going to be in the corner of a cheap motel room, so the maximum horizontal angle is only 90 degrees, while longer distance is desirable. The safest solution is neither rotating or defishing.
Another discovery about the Escam is if it's powered up below 5V, it ends up stuck in night vision B&W mode. It has to be powered up at 5V to go into color mode. From then on, it can work properly below 5V. So the USB hub has to be powered off a battery before plugging it into the laptop. Not sure if openpose works any better in color, but it's about getting your money's worth.
The next problem is the escam can only approximate what the DSLR sees, so there's parallax error. There's too much error to precisely get the head in the top of frame.
After much work to get the power supply, servo control, escam mount, & DSLR mount barely working, there was just enough data to see the next problems. Getting smooth motion is a huge problem. Body parts coming in & out of view is a huge problem. Parallax error made it aim high.
With 4x3 cropping, it was 4fps, but detected a lot of positions it couldn't with the Samsung in 4x3. Decided this was as cropped as the lion kingdom would go.
The escam is a lot narrower than 180 degrees, maybe only slightly wider than the gopro. Detection naturally fails near the edges. Because so much of the lens is cropped, the next step would be mounting it diagonally & stretching it in software.
The answer is no. It doesn't provide a USB webcam interface as sometimes claimed by the internet. It has no USB device port & doesn't even allow access to the SD card over USB. We can assume the venerable USB webcam is dead & everything is now using TCP/IP.
It only provides a delayed H.264 stream over wifi. It claims to support a standardized protocol for security cams called ONVIF. More confusingly, IP cams have shifted to being marketed only as security cams, rather than a replacement for the venerable USB webcam.
It requires an access point to stream video. It initializes as an access point only to allow the user to configure it to use an external access point. Then, it converts to a station for the rest of its life.
Once the ugly case was removed, a much smaller Hi3518 board was revealed, with RTL8188 USB wifi dongle.
The serial port for debugging was easily spotted. Helas, the internet doesn't actually develop with it. They only use the debugging console to run ifconfig or list directories. The debugging console outputs diagnostics for RTSP requests, the IP address, & the MAC address, which are very useful for troubleshooting.
There are some notes on using the debugging console.
The latency with H.264 & the app varies from 1-10 seconds. The chip physically supports JPEG for no latency, but it's not exposed on the app. The macbook can't provide a wifi access point unless it's hard wired to the internet. A raspberry pi would have to be used as a portable access point. It's not as bad as it sounds, considering the only alternative would require connecting the USB host on the Hi3818 to a USB bridge before connecting it to the macbook.
The complete workout with all the body positions shows how reliable that counter eventually became. No more lost counts. More attention spent on the TV. Just can't use wifi for anything else.
Made the GPU server just send coordinates for the tablet to overlay on the video. This was within the bandwidth limitations, but it still occasionally got behind by a few reps. It might be transient wifi usage. The next step would be reducing the JPEG quality from 90. Greyscale doesn't save any bandwidth. There's also omitting the quantization tables, but by then, you're better off using pipes with a native x264 on the tablet.
It never ceases to amaze lions how that algorithm tracks body poses with all the extra noise in the image, even though it has many errors. It's like a favorite toy as a kid, but a toy consisting of an intelligence.
The pose classifier eventually was bulletproof. This system had already dramatically improved the workout despite all its glitches. You know it's a game changer because no-one watches the video. Just like marriage & politics, the most mundane gadgets no-one cares about are the most revolutionary while the most exciting gadgets everyone wants are the least revolutionary.
The mane problem became connectivity. Lockups lasting several minutes & periods of many dropped frames continued. It seemed surmountable compared to the machine vision. Finally did the long awaited router upgrade.
Its days of computing ended 6 years ago. It is now the apartment complex's most powerful router. Seem to recall this was the laptop lions used while dating ... the single women. Lions wrote firmware on it while the single women watched TV or ate their doritos.
The other problem was for the pose classifier to work, it can't drop any frames. The tablet needs to capture only as many frames as the GPU can handle & the GPU needs to process all of them. Relying on the GPU to drop frames caused it to drop a lot of poses when a large number of frames got buffered & received in the time span of processing a single frame.
The hunt for connectivity eventually led to wifi not being fast enough to send JPEG frames both ways, at 6fps. The reason skype can send video 2 ways is H.264, which entails a native build of x264 so isn't worth it for this program. Wifi is also bad at multiplexing 2 way traffic. Slightly better results came from synchronizing the frame readbacks, but only after eliminating them completely did the rep counts have no delay & the lost connections go away. It counted instantaneously, even on the ancient broken tablet, but it didn't look as good without the openpose outlines. There's also bluetooth for frame readbacks or drawing vectors & statistics on the tablet, if H.264 is that bad.
The trick with the tracker is to use a spherecam on USB, but opencv can't deterministically select the same camera, scale the input, & display in fullscreen mode on a mac.
There is running a network client on virtualbox to handle all the I/O or fixing opencv. Running a network client in a virtual machine to fix the real machine is ridiculous, so the decision was made to fix opencv, uninstall the brew version of opencv & recompile it, caffe, & openpose again from scratch.
brew remove opencv
Openpose on the mac can't link to a custom opencv. You have to hack some cmake files or just link it manually.
Discovered openpose depends heavily on the aspect ratio & lens projection. It works much faster but detects fewer poses on a 1x1 aspect ratio. It works much slower & detects more poses on a 16x9 crop of the center. It works much slower on 512x512 & much faster on 480x480. Pillarboxing & stretching to fit the entire projection in 16x9 don't work.
Spherical projections are as good as equirectangular projections. The trick is detecting directly above & below with the 16x9 cropping. It's not going to detect someone looking down on it like the Facebook portal, but the current application doesn't need to see directly above & below. There's also scanning each frame in 2 passes.
Despite paying a fortune to ship an ESCAM Q8 from China in 7 days, it was put on the same boat as the free shipping, 2 weeks ago.
An edited video provided the 1st documentation of a computer counting reps. The reality was it only made 3.5fps instead of 4.5 & this wasn't fast enough to detect hip flexes.
It detected mane hair instead of arms.
Then it froze for several minutes before briefly hitting 4.5, then dropping back to 3.5. The only explanation was thermal throttling.
If only obsolete high end graphics cards could drop to $30 like they did 20 years ago, but tried running it again on the GTX 1050 with the framerate at 5fps. This actually allowed it to play 1920x1080 video with an acceptable amount of studdering, while also processing machine vision. They actually have some form of task switching on the GPU, but PCI express is a terrible way to transfer data.
As it has always been with GPU computing, we have 32GB of mane memory on the fastest bus being used for nothing while all the computations are done on 2GB of graphics memory on the PCI bus.