PS Eye GCC-PHAT

Details

Generalized cross correlation with phase transform is used here to determine the time delay from when a signal reaches the first microphone in an array to when that signal reaches another microphone in the array. In this case, the first and fourth microphones in the microphone array of a PlayStation Eye are used. The distance between these microphones is 6 cm. Lines representing the incident sound wave to two microphone arrays are plotted in the terminal, and distances in the x and y directions are reported. The x-axis is defined as the line on which the cameras sit, with the positive direction toward the right array and zero halfway between the arrays. The y-axis is defined such that the sound waves are travelling in the negative y direction, with its zero point coincident with the zero point on the x-axis. During testing the arrays were spaced from three to five feet apart.

At the start of the program, the user chooses which audio capture device to use for the left array and which device to use for the right array. Next, the user is asked to enter the distance between the arrays in feet. At this point control of the terminal is given to ncurses for plotting. Audio is sampled with a sample size of 16 bits for each of the four channels at 96,000 frames/second. The plot is updated only when both arrays report reasonable angle of arrival. GCC-PHAT can give nonsensical angles. These erroneous angles are detected when attempting to take the sine of the angle returned by GCC-PHAT, and then they are ignored if the sine value fails a "!isnan()" check.

The code is available as a zipped CodeLite project and as separate files in the files section.

There are print statements commented out throughout the code, and these may be uncommented for debugging. Note that the ncurses functionality should be disabled if printing to stdout is desired for debugging.

Thanks Chuck Wooters for the code on your GitHub showing the usage of FFTW3 for GCC-PHAT.

Thanks David E. Narváez for the code on your blog showing the usage of FFTW3 for cross correlation, and how to rotate the output array so that the 0-frequency appears in the center.

Thanks Xavier Anguera for the page showing the math of the GCC-PHAT.

Thanks Jeff Tranter for the article in Linux Journal showing the procedure for capturing audio using alsa.

Please let me know if you find errors/bugs or if you feel that I have not attributed usage of your prior art properly. I intend no infringement, and will cite you or remove this project.

Files

withDistances.mov

quicktime - 43.37 MB - 02/08/2019 at 19:28

Download

Demo1.mov

quicktime - 48.62 MB - 02/08/2019 at 19:27

Download

main.c

x-csrc - 4.60 kB - 02/08/2019 at 18:57

Download

misc_utils.c

x-csrc - 12.40 kB - 02/08/2019 at 18:57

Download

misc_utils.h

x-chdr - 608.00 bytes - 02/08/2019 at 18:56

Download

Project Logs

Collapse

Proof of Concept
Rob Grizzard • 02/08/2019 at 17:11 • 0 comments

This was the first step in the project. A single microphone array was used to calculate the angle of arrival of sound waves every quarter second for five seconds.
Alsa programming based on https://www.linuxjournal.com/article/6735

Code:

#define ALSA_PCM_NEW_HW_PARAMS_API #include <alsa/asoundlib.h> #include <stdio.h> #include <math.h> #include <complex.h> #include <fftw3.h>

int main(int argc, const char* argv[]) {    /* FFTW Setup */    int num_samples = 250 * 48; //0.25 second * period size (2 bytes/sample)    double *channel_1 = (double*) malloc(sizeof(double) * num_samples);    memset(channel_1, 0, sizeof(double) * num_samples);    double *channel_2 = (double*) malloc(sizeof(double) * num_samples);    memset(channel_2, 0, sizeof(double) * num_samples);

   int nComplex = (num_samples / 2) + 1;    fftw_complex *channel_1_out = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * nComplex);    memset(channel_1_out, 0, sizeof(fftw_complex) * nComplex);         fftw_complex *channel_2_out = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * nComplex);    memset(channel_2_out, 0, sizeof(fftw_complex) * nComplex);

   fftw_plan pa = fftw_plan_dft_r2c_1d(num_samples, channel_1, channel_1_out, FFTW_ESTIMATE);         fftw_plan pb = fftw_plan_dft_r2c_1d(num_samples, channel_2, channel_2_out, FFTW_ESTIMATE);

   int i = 0;    fftw_complex *multiplication_result = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * num_samples);         memset(multiplication_result, 0, sizeof(fftw_complex) * (num_samples));

   fftw_complex *out = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * num_samples);    memset(out, 0, sizeof(fftw_complex) * num_samples);

   fftw_plan px = fftw_plan_dft_1d(num_samples, multiplication_result, out, FFTW_BACKWARD, FFTW_ESTIMATE);

   /* ALSA Setup */    int rc;    int num_channels = 4;    snd_pcm_t *handle;    rc = snd_pcm_open(&handle, "plughw:1,0", SND_PCM_STREAM_CAPTURE, 0);

   if (rc < 0) {        fprintf(stderr, "unable to open pcm device: %s\n", snd_strerror(rc));        exit(1);    }

   snd_pcm_hw_params_t *params;    snd_pcm_hw_params_alloca(¶ms);    snd_pcm_hw_params_any(handle, params);    /* Interleaved mode */    snd_pcm_hw_params_set_access(handle, params, SND_PCM_ACCESS_RW_INTERLEAVED);    /* Singed 16-bit little-endian format */    snd_pcm_hw_params_set_format(handle, params, SND_PCM_FORMAT_S16_LE);    /* Two channels */    snd_pcm_hw_params_set_channels(handle, params, num_channels);    /* 48000 bits/second sample rate */    int dir = 0;    unsigned int sampleRate = 48000;    snd_pcm_hw_params_set_rate_near(handle, params, &sampleRate, &dir);    printf("Chosen approximate set rate from snd_pcm_hw_params_set_rate_near(): %u\n", sampleRate);

   /* Set period size to 12000 frames, each frame is 1ms, we want 0.25 second/ */    snd_pcm_uframes_t frames = 250 * 48;    snd_pcm_hw_params_set_period_size_near(handle, params, &frames, &dir);    printf("Chosen approximate target period size from snd_pcm_hw_params_set_period_size_near(): %lu\n", frames);

   /* Write parameters to driver */    rc...
Read more »

View project log

Build Instructions

Collapse

The 5 second proof of concept is compiled by

gcc -lm -lfftw3 -lasound -Wall gcc_phat_five_sec.c -o gcc_phat_five_sec

Rejoice.

The standalone files in the files section should be placed in directories

root

----array/

|--------/include/mic_array.h

|--------/src/mic_array.c

----fftw/

|--------/include/fftw_utils.h

|--------/src/fftw_utils.c

----misc/

|--------/include/misc_utils.h

|--------/src/misc_utils.c

----main.c

Or the .zip contains a workspace that was created in CodeLite

Discussions

chitester12 wrote 04/02/2019 at 09:13

I want to build a large microphone array, say 64 microphones. Thinking to connect 16 PlayStation Eyes with a Linux machine. Do you think if it it possible to capture all the audio streams?

Are you sure? yes | no

clatour007 wrote 02/08/2019 at 21:59

Very Good !

I was starting to research the algo for sound direction-detection and this is an already made prototype. Great !

One thing has me a bit confused though, the input is 4 channels (2x2 mics) but only 2 channels are used ? (dropping ch-2&3 and only using ch-1&4 ) or I'm mis-reading the code (?)

Also, there doesn't seem to be any frequency, filtering of the input sound waves, according to what I've read, the phase-differential of arriving waves at the mic(s) is dependent (among other things) on the frequency of the sound-wave. (perhaps that's what causes the sometimes 'erratic' results ) together with the time-difference of the sampling (i.e. channels are sampled one at the time, so channel-1 is 3 time-samples ahead of channel-4 ) may induce enough jitter to bias the calculations.

Thank you very much for collating and publishing this code !

Are you sure? yes | no

Rob Grizzard wrote 02/09/2019 at 18:53

Thanks for the kind words!

I did drop channels 2 & 3. The microphone arrays I am using only have 2cm between microphones, so I just take the samples from microphones 1 and 4 to get 6cm of separation. I do this so that the time delay will be greater, and I can then achieve higher angular resolution.

I need to look into filtering the input, that is an excellent idea for future work.

It is my understanding that during sampling all four channels are sampled sequentially, as you say. However, the time difference from one channel to the next is much smaller than the time difference from one sample to the next. Stated another way, I think that the time difference among channels during one sample (a "frame" as alsa uses that term) and the time difference from one sample to the next are so different as to make the difference among the channels in a frame negligible.

Let me know if I can clarify any thing more, and thanks for checking this out. I hope it proves useful!

Are you sure? yes | no

clatour007 wrote 02/10/2019 at 03:54

Ahhh, That makes sense. 2 cm is way too close.

Yes, Time difference may be negligible, (sorry, most of my readings were focused on high-frequency sampling where such delays would introduce very significant phase delays).

As per frequency filtering, the algorithm used works differently than what I've read so far, it seems to work on time-domain (cross-correlating the waveform in time,and extract the time difference between them) instead of phase/frequency domain (cross-correlate the phase of the waveform to extract said time) given that, it would appear that pre-filtering wouldn't improve it all that much. (still going through the maths thought :-}

My project is to have 2 of these on a room, (at 90deg apart) to localize the source of a speaker.

(i.e. localize the speaker within the room) Simple usage example, within a medium size room, I say 'turn my light' and a system would know to power on my closest light all based on the location of the voice within the room :-)

It may be also usable as an input to a filter to remove unwanted sound (much like beam-forming antennas in most wi-fi routers, where they 'direct' the signal via software)

Current alexa/google asst. etc. give only 1 direction filtering (8 channels, but only resolves a single direction. ) with 2 of these 90deg apart, it should be possible to get an even better 'far-field' audio filtering.

If you have published any content along those likes, pls. point me to it !

Thanks,

Carlos

Are you sure? yes | no

Rob Grizzard wrote 02/10/2019 at 18:22

Your project sounds like a great application of these techniques, and I hope they help you accomplish your goals.

I haven't published anything in this field, and I am definitely not an expert. I do enjoy learning the mathematics behind these algorithms though!

The GCC-PHAT as it is used here as of today is performed in the frequency-domain. Section 2 of this 2012 paper: ftp://ftp.esat.kuleuven.be/pub/SISTA/ida/reports/12-133.pdf has a good overview. These authors claim that the computational complexity of the algorithm and its low angular resolution for small arrays make it less desirable than a time-domain alternative. I chose to achieve higher resolution by increasing the sample rate because I found the frequency domain GCC-PHAT algorithm to be easy to understand and translate to code, but these authors propose "quadratic interpolation" as an alternative means. I have not yet looked into such interpolation.

If I am correctly interpreting the mathematics of their alternative time-domain approach, then they are applying the definition of cross-correlation on slide 35 of http://www-pagines.fib.upc.es/~pds/Lect02.pdf, followed by whitening one of the inputs by an adaptive linear prediction filter (to maintain the unity gain advantage of PHAT weighting), copying that filter to the other input, and finally using a quadratic interpolation to increase the resolution.

Those authors seem to have moved on from this approach, as this paper: ftp://ftp.esat.kuleuven.be/sista/ida/reports/13-199.pdf from 2013 uses steered response power. A 2016 paper (ftp://ftp.esat.kuleuven.ac.be/pub/stadius/ida/reports/CP_16-183.pdf) uses a similar algorithm augmented with a multi-channel Wiener filter to locate footsteps.

I also found this paper: http://www.handicams-fet.eu/uploads/files/6.pdf from 2015, where an eigenvalue decomposition is used for each array. Their procedure was to use a multi-channel Wiener filter (either by a central processor or by a distributed means called DANSE) for de-noising, followed by subspace estimation in the frequency domain, and finally wide-band direction of arrival estimation. The authors here cite the paper: ftp://ftp.esat.kuleuven.ac.be/sista/doclo/reports/01-30.pdf as a resource for understanding multichannel Wiener filtering.

When I began my research at the start of this project, I came across these subspace-based methods, but my inability to follow the mathematics involved in the algorithms has kept me from attempting them. For example, I have never applied the expected value operator to matrices. I tried to follow the explanation given here: https://math.stackexchange.com/questions/642291/expectation-operator-on-a-matrix, but I think I need a more concrete example. I may attempt to write a toy example this week in an effort to familiarize myself with the linear algebra concepts needed by these algorithms.

I hope these resources help you, and I would appreciate any resources you know of that show any related linear algebra procedures.

P.S. I am replying to this older message because I don't see the option to reply to your most recent comment.

Are you sure? yes | no

PS Eye GCC-PHAT

Description

Details

Files

withDistances.mov

Demo1.mov

main.c

misc_utils.c

misc_utils.h

Project Logs

Collapse

Proof of Concept

Build Instructions

Collapse

Discussions

Similar Projects

dBSound Bracelet

DIY SmartWatch

Beverly-Crusher: bit crushing toy for 1-bit audio.

Audio delay and VOX using ESP32

PS Eye GCC-PHAT

Become a Hackaday.io member

Just one more thing

Description

Details

Files

Project Logs Collapse

Build Instructions Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse

Build Instructions

Collapse