Close

Depth To Audio: Where the Magic Happens

A project log for SNAP: Augmented Echolocation

Sightless Navigation And Perception (SNAP) translates surroundings into sound, providing continuous binaural feedback about the environment.

colin-pateColin Pate 09/03/2017 at 00:590 Comments

It was decided early on in the project that our "back end" software would be coded separately and run in a separate process from the simulator. This was done so that in the future, the "back end" code would be easier to port from working with depth data from a simulator, to depth data from a hardware prototype.


The back end has three main processing steps:

  1. Read depth data as an image
  2. Use computer vision to process the image
  3. Turn the data from the image into sound

We first had to choose how to get our depth data from the simulator (DLL, shader, and C# script) into the back end, reliably and with little lag. Our solution came in the form of shared mapped memory, available in both Windows and Unix. Shared mapped memory is simply a block in your PC's RAM that can be accessed and written to and read from by multiple processes. This block can be found and accessed using a unique identifier that is set by the software that creates it. In our case, the DLL Unity plugin creates the memory space and our back end software opens it.

Software Packages

Our 3D depth data is output from Unity as a simple bitmap image, just like an uncompressed digital photo. However, instead of color, each pixel represents how far the next object is from the camera. Because of this similarity to regular images, we chose to use the popular computer vision library OpenCV to do our image processing. To generate audio, we used the open-source spatial audio library OpenAL.

Methodology

There seemed to be two different ways to go about getting converting our 3D depth data to audio: a smart way and a dumb way. The smart way would try to use advanced computer vision techniques such as edge detection, plane detection, object detection, and facial recognition to process the depth data in real time and attempt to decide what was important to the user, and then use different audio signals to describe different types of objects and landscapes around them. The dumb way was to perform no advanced computer vision, and instead use the most direct possible mapping of 3D space to sound and rely on the brain's inherent ability to restructure itself and figure out new ways to use the senses.

Computer Vision

Pros: Potentially provides more detail and pertinent information to the user

Cons: Could be dangerous if it gets confused, hard to make reliable and effective, computationally intensive

Simple Depth-To-Audio

Pros: Less computationally intensive, makes the brain do the heavy lifting, potentially faster response and lag time

Cons: Possibly lower signal-to-noise ratio and less perceptible detail

The smart way sounds pretty sexy and much more high-tech, but none of us had a background in advanced computer vision and the dumb way had it's own upsides. For example, a recent Wired article discusses how a blind woman can perceive her surroundings in incredible detail using an amazing program called the vOICe, using simply a low-resolution black and white webcam. Her brain was able to retrain itself to interpret audio as visual data with fantastic success. The vOICe has a very simple algorithm, in which distance corresponds to volume and pitch corresponds to vertical position.


We were inspired by this amazing success with such a simple algorithm. Before seeing this article, we had tried a few similar algorithms with limited success. We attempted to divide the depth data into a number of rectangular regions and pulse sound at a rate corresponding to the average distance of each region. However, having sound output simultaneously from every angle of depth data quickly became an unintelligible mess, even with very low resolutions. The trick used in the vOICe, and the latest version of our audio generation software, is to pan the audio from left-to-right at a user configurable rate. That way, the user will hear sounds from one horizontal angle at a time, rather than the whole depth image simultaneously.

Below is a pseudo-code block to demonstrate the algorithm.

Sounds[4] = SoundSources
Regions[4][4] = GetRegions(DepthData)
#X is horizontal position in depth data and in spatial sound position (L-R panning)
#Y is vertical position in depth data and pitch of sound
while(1):
    for x in range(0, 4):
        StopSounds(Sounds)
        for y in range(0, 4):
            Sounds[y].Volume = Regions[x][y].AverageDepth
            Sounds[y].Pitch = Coefficient * y
            Sounds[y].HorizontalPosition = x
        PlaySounds(Sounds)
        Sleep(0.25 seconds)

 Obviously we had far more than 4 horizontal and vertical points, and the real code is in C++ (attached in the file list) so it's a lot more confusing. We also added another program that opens a GUI that allows the user to adjust several parameters, such as the rate of left-to-right looping, the maximum and minimum pitch, the number of horizontal steps, and more.

Discussions