Body_25 using FP16

While porting BodyPartConnectorCaffe, it became clear that the easiest solution was to link openpose against the tensorrt library (nvinfer) rather than trying to move the openpose stuff into the trt_pose program. It would theoretically involve replacing just the use of spNets in poseExtractorCaffe.cpp. It would still be work & most of the unhappy path was already done though.

BodyPartConnectorCaffe entailed copying most of openpose. In the end, the lion kingdom's attempt to port just a subset of openpose became a complete mess. Having said that, the port was just the GPU functions. The post processing is so slow, the CPU functions aren't any use on a jetson nano.

Should be noted all the CUDA buffs are float32 & all the CUDA functions use float32. No fp16 data types from the tensorrt engine are exposed. INT8 started gaining favor as a next step, since it could impact just the engine step, but the value ranges could change.

Another important detail is they instantiate a lot of templates in .cpp files with every possible data type (classType className<unsigned short>;) Despite this efficiency move, they redefine enumClasses.hpp in 9 places.

The non maximum suppression function outputs a table of obviously 25 body parts with 3 columns of some kind of data. It's interesting how close FP16 & FP32 came yet 2 rows are completely different. The rows must correspond to POSE_BODY_25_BODY_PARTS + 2. Row 9 must be LWrist. Row 26 must be RHeel. Neither of those are really visible. The difference is not RGB vs BGR, brightness or contrast, the downscaling interpolation, but some way the FP16 model was trained.

After 1 month invested in porting body_25 to FP16, the result was a 3.3fps increase. The model itself can run at 9fps, but the post processing slows it down. The GUI slows it down by .3fps. The FP32 version did 5fps with a 224x128 network. The FP16 version hit 6.5fps with a 256x144 network, 8.3fps with a 224x128 network. It's still slower than what lions would consider enough for camera tracking.

Results are somewhat better if we match the parameters exactly. 128x224 network, 640x360 video experiences a doubling of framerate in FP16. The size of the input video has a dramatic effect. There is less accuracy in FP16, as noted by the NMS table.

Debugging tensorrt

The truth about INT8

Discussions

Become a Hackaday.io Member