Here's a video on the position sensor output (csv output from the phone is visualized and manually synced with the video). Output of the position sensor is the position (x, z in meters), orientation (rad) and velocity in m/s.
For the position estimation, features in the camera images are tracked. With these tracked features you can estimate the displacement between two images (also kown as visual odometry). The position estimation gets fused with the phones accelerometer readings, which allows an output rate which is as low as your accelerometer rate (e.g. 2ms).
Whats the problem?
Only one camera is used, therefore depth information is not available. You can anyways estimate a displacement which is correct in direction and relative magnitude, but it differs from the realworld displacement by a unkown scale. And there is no way around this, as long as you do not get some distance information from elsewhere (like from speed sensors over time or a Lidar or whatever). Even super cool AI mono camera visual odometry cannot solve this.
A simple solution to this is, assume you are moving on a even surface, and you know your height above that surface. So as long as this assumption is valid, the output is correct.
The used phone in the video is a Huawei P10, which works really well. Compution time is ~10ms for the corner tracking (320x240 done in RenderScript) and another ~5ms for the displacement estimation and fusion (done in Java).
Tested with a Huawei P10, Samsung S6, LG V30 and my favorite, the Samsung S3 which works not so well, you barely can see anything in the images, lol.