The RealSense is extremely easy to work with, and has provided a great development platform to jump off from, but for a number of reasons this device is simply not suited to the application. Here we discuss the shortcomings of the R200, what we have learned from it, and what we think we should do in the future.
1) Narrow field of view:
The R200 sports a 59° azimuthal and 46° vertical Field of View (FOV). This narrow perspective makes it difficult to keep track of what objects are directly to the sides, and undermines our utilization of binaural localization. Ideally we would like to have a full hemispherical FOV, providing 180° feedback in both axis, although the sensors required to achieve this might be ungainly. To limit bulk, it is likely acceptable to restrict the vertical axis to 110°, with an offset as shown below. This corresponds roughly to the FOV of human vision, for which we have a lot of data.
2) Poor Resolution
The depth camera has limited resolution which prevents us from increasing the audio resolution. The image below is of a wall when viewed at an angle. The visible striping (red lines added) is the space between measured depth layers. This becomes much more apparent when you watch video of the output. Intel likely left gaps in the resolution to reduce incorrect measurements due to noise. Unfortunately for us, all of those spaces cause gaps of silence which make the output sound choppy. Those aren't the White Stripes we want to listen to!
While spending more on a nicer camera may let us crank up the resolution, and extensive filtering may help reduce the audio impacts, these sort of artifacts are going to be present in depth cameras no matter what we do. This is one of the major reasons we've been thinking about Stereo Visual Odometry (SVO).
3) Noisy & Vacant Images
The R200 does pretty well with nearby objects, but when the surface is somewhat shiny, or it gets further than about 8 feet away, the depth camera starts getting less certain about where exactly things are. This comes through as grainy images, with static-like patches of varying depth which pop in and out. In the image below, you might think Morgan is standing in front of a shrub, based on the speckled appearance of the object. Although it isn't the nicest piece of furniture, this is in fact a couch, not a plant.
All that static comes across as a field of garbled sound. It isn't steady, like the audio queues for Morgan, but poppy and intermittent, just like the visual artifacts would lead you to expect. Once again, filtering can be employed to reduce this noise, at the cost of speed, but on some level, this is a fact of life for IR imaging.
In the lower right hand corner of this image you can see a rectangular hole to infinity. The image is of my workbench, which looks messy because it is, and the vacant corner is an LCD monitor. The R200, like many long range IR, structured light, and LiDAR sensors, doesn't detect clear or reflective objects. This might be the hardest obstacle for a robotic vision system to overcome, since no single sensor is very good at everything. We may find that it is necessary to integrate supplementary sensors into our final SNAP to handle clear objects like glass doors and windows.
4) Cost Prohibitive
SNAP's stated goal is to provide effective navigation and perception assistance for $500 or less. Although the R200 is a dirt cheap by my usual robotic vision standards, the $100 - $150 price tag makes it an easy target when we are trying to cut costs, and the lack of alternate sources makes it a high risk option when trying to make an assistive device available to people off the beaten path. This is theoretically another benefit to SVO, which can employ $10 - $20 cameras (although supplementary hardware will impose additional costs).