Stereo Visual Odometry

A project log for SNAP: Augmented Echolocation

Sightless Navigation And Perception (SNAP) translates surroundings into sound, providing continuous binaural feedback about the environment.

dan-schneiderDan Schneider 10/21/2017 at 07:550 Comments

SVO is looking to be our next sensor of choice. I want to discuss some of the pros and cons of this method, as well as compile some project learning resources. 

To define the desired feature set of the vision sensor, we should first recognize that the sensor unit consists of not just the physical sensor parts, but the entire sensing system from the outside world right up to the depth map input where SNAP's modular feedback software takes over. We would like this system to feature the following, in no particular order of importance:

There are a few other requirements such as ergonomics which we will take as a given. This list is in essence why SVO looks so good. The biggest shortcoming is in surface compatibility, as SVO has a hard time with unfeatured, clear, and reflective surfaces. Since that is true of all high resolution SLAM systems, it's hard to count as a negative. One thing to consider is that most of the SVO tutorials and tools are focused around SLAM techniques, and are dead set on absolute positioning, which we don't care about. That might mean that we can save processing (and coding) time by skipping those steps. 

Chris Beall at Georgia Tech put out this extraordinarily awesome overview of what SVO entails which has made the process actually easy to understand, and look deceptively easy to accomplish. It makes sense to discuss methodology by following Mr. Beall's step by step process, so here goes:

1) Epipolar (Stereo) Rectification

There is a nifty overview of image rectification in OpenCV using the "Pinhole Camera Model" which assumes the image is not distorted by lenses before being captured. This is clearly never the case, but if a camera is small enough, and the distance between the lens and image sensor is negligible, we can use the pinhole model with relatively little error. Adjustments can then be made for lens effects as discussed in this paper on rectifying images between >180° fish eye lens cameras. 

My biggest question here is whether this is done realtime for each frame, or if you simply establish transformation arrays characteristic of the pair of cameras. 

Something to note on rectification: keeping your cameras closer together makes rectification easier, but also makes distance calculations more prone to error. Human vision accomplishes depth perception astoundingly well, but it does so by combining stereo depth perception with lens distortion and higher level spacial awareness. In our application we can likely accept more error in distance measurements than most SLAM enthusiasts, so long as we don't cause excessive noise in the far field. 

2) Feature Extraction

 We recognized SIFT from OpenCV, but while reading up on their site, I was happy to find that there is also a SURF library. The Feature Detection and Description page has a great overview of both. 

The real challenges here, so far as I can predict, will be in maintaining frame rate while casting enough points to prevent voids in the depth map, and writing our own feature extraction routines to hone in on objects of interest. There is a good chance we will end up wanting to mesh or best fit planar surfaces so as not to drown our user's hearing in the sound of blank walls. 

3) Stereo Correspondence 

Once again OpenCV comes to the rescue with a page on Stereo Correspondence.  This step seems to be more straight forward. 

Unless I simply misunderstand, stereo matching is necessary to triangulate distances, but not much else. While there's a chicken and egg problem at hand, we might be able to skip matching some points to save time if we are running plane detection. 

4) Temporal Matching

Here's the OpenCV resource you've come to expect. As I understand it, this is expensive from a processing standpoint, but I am still fuzzy on details. It seems to me that if you are matching between cameras quickly, you could match between frames just as easily using the same method. Depending on whether rectification is done realtime or established in advance as I asked earlier, this may not be too terribly expensive. 

We should remember that because we aren't looking for absolute positioning, the absolute magnitude of velocity is of secondary importance. A much less precise temporal matching (or temporal rectification) method could be employed to save time with little to no impact on the usability of the device. 

5) Other SLAM Topics

Due to our lack of interest in absolute positioning, relative pose estimation and absolute orientation are not necessary.