What is it?

lalelu_drums is a system (hardware + software) that can be used for live music performances in front of an audience. It consists of a camera recording a live video of the player, an AI network that estimates the body pose of the player from each video frame and algorithms that detect predefined gestures from the stream of pose coordinates and create sounds depending on the gestures.

Video 1: Example video showing basic drum pattern

Why?

This type of drumming allows to incorporate elements of dancing into the control of the drum sounds. Also, the drummer is not hidden behind the instrument. Both aspects should promote a more intense relation and interaction between the musician and the audience.

The pose estimation yields coordinates of many different landmarks of the human body (wrists, ellbows, knees, nose, eyes,...) and I envision that there are intriguing options how to create music from gestures with these.

Compared to other forms of modern electronic music control, lalelu_drums can be played with a minimal amount of tech visible to the audience (i. e. the camera in front of the player). It is therefor especially well suited to be combined with acoustic instruments in a low-tech setting.

With this kind of atmosphere in mind and in order to foster a good contact to the audience, I would like to design the system in a way that the player has no need to look at any display while playing. For checking basic parameters like illumination or camera positioning or for troubleshooting, I think a display will be necessary. But it should not be needed for the actual musical performance so that it can be installed in an unobstrusive way.

An interesting application of lalelu_drums is to augment other instruments with additional percussive elements. In such a hybrid setting, the gestures need to be defined taking into account the normal way of playing the instrument.

Video 2: Acoustic cajon and egg shaker augmented with snare drum and two bells

While it is certainly possible to use the arrangement of lalelu_drums to control other types of instruments apart from percussion (e. g. Theremin-like), I chose percussion for the challenge. If it is possible to design a gesture controlled percussion system with acceptable latency and temporal resolution, it should be straight forward to extend it for controlling other types of sounds.

Prior art

There are examples of gesture controlled drums using the kinect hardware:
https://www.youtube.com/watch?v=4gSNOuR9pLA
https://www.youtube.com/watch?v=m8EBlWDC4m0
https://www.youtube.com/watch?v=YzLKOC0ulpE
However, the pose estimation path of the kinect has a frame rate of 30fps and I think that this rate is too low to allow for precise music making.

Here is a very early example based on video processing without pose estimation:
https://www.youtube.com/watch?v=-zQ-2kb5nvs&t=9s
However, it needs a blue screen in the background, and since there is no actual pose detection it can not react on complex gestures.

There is a tensorflow.js implementation from 2023 of a pose estimation based drumming app, but it seems to be targeting rather a game like application in a web browser than a musical instrument for a live performance:
https://www.youtube.com/watch?v=Wh8iEepF-o8&t=86s

There are various 'air drumming' devices commercially available. However, they either need markers for video tracking (Aerodrums) or they use inertia sensors so that the drummer still has to move some kind of sticks (Pocket Drum II) or gloves (MiMU Gloves) and can not use gestures comprising ellbows, legs or face.

One other interesting commercially available device is the Leap motion controller. It uses infrared illumination and dual-camera detection to record high-framerate (115fps) video data of the user's hands from below. The video data is processed by some proprietary algorithm to yield coordinates of the hand and finger joints + tips. Here is a project using the device for making music:
https://www.youtube.com/watch?v=v0zMnNBM0Kg...

Log #05: Pose estimation precision
Lars Friedrich • 04/11/2024 at 19:45 • 0 comments

In order to compare and optimize different body movements, clothing, backgrounds, illumination etc. (also across different players) for the most precise pose estimation results, it is desirable to have a quality measure for the pose estimation results. However, this measure is not straightforward to obtain, since ground truth data is typically missing. In this log entry I present a procedure to obtain a quality measure for the pose estimation results without the need for ground truth data.

The concept relies on the fact that the input data to the pose estimation is always a video of contiguous movements, sampled with a high frame rate (typically 100fps). For each keypoint, it is therefor valid to apply a temporal low-pass filter (I use a gaussian filter) to the estimated positions. The low-pass filtering will increase the precision of the estimated keypoint coordinates. Then, the difference between the raw pose estimations and the low-pass filtered data can be regarded as a pose estimation error. The difference is a vector (x,y) and the length of this vector can be called residual and serves as a quality measure. The larger the residual, the lower is the precision of the individual pose estimation results.

Video 1 shows an example for the RIGHT_WRIST keypoint. In the image, blue dots indicate individual pose estimation results for the 500 frames of the 5 second recording. The red dot shows the pose estimation result for the current frame and the red cursor in the plots on the right indicates the current frame in the line plots of the row coordinate and column coordinate, respectively. The orange curves in the line plots show the low-pass filtered data. The line plot on the lower right shows the residual for each frame. As an example, a threshold at 3 pixels is shown (yellow horizontal line). All residuals above this threshold are highlighted with a yellow circle. The corresponding coordinates are also highlighted with yellow circles in the camera image.

Video 1: Tracking results for keypoint RIGHT_WRIST

It can be seen that the points with high tracking errors concentrate at a specific position, where the camera perspective is such that the RIGHT_WRIST coordinate is almost identical with the RIGHT_ELBOW coordinate.

To get an impression of the lower of the lower bound of the residual, I recorded a 5 second video of a still person and computed the rms residual for the full 500 frames for each keypoint. The results are shown in figure 1. It can be seen that the typical rms value of the residual is below one pixel.

Figure 1: Tracking results for a still person

Video 2 shows an example for the LEFT_ANKLE keypoint, this time the image is slightly zoomed, as can be seen by the axes limits. Again, the points with high residual concentrate at a certain position. This time it is the upmost position of the ankle during the movement. Admittedly, the contrast between the foot and the background is very low here.
Video 2: Tracking results for keypoint LEFT_ANKLE

I think the proposed procedure is helpful to identify situations where the pose estimation precision is lower as usual. It should be possible to provide a live display of this information to the player.
Log #04: Related work
Lars Friedrich • 04/04/2024 at 17:46 • 0 comments

Additionally to the references I made in the 'Prior art' section in the main project description, I recently found two other pieces of work that I would like to mention.
The first is [Mayuresh1611]’s paper piano that was presented in a recent hackaday.com blog post. The project uses Mediapipe's hand pose detection to create a piano experience on blank sheets of paper. I am not sure how well the implementation works for actual music making, but I must admit that the whole idea is very close to lalelu_drums.
The second is a TEDx talk given by Yago De Quay in 2012. He employs a kinect device to detect the pose of a human dancer and creates music and video projections from the data. While I still think the framerate of the kinect's pose estimation (30fps) is too slow for precise music making, I find that the talk contains many nice ideas of how to use the pose estimation technique to create an appealing live performance.
Log #03: Tonal example
Lars Friedrich • 03/26/2024 at 20:17 • 0 comments

While the main target of lalelu_drums is to control percussive instruments, I would like to show a tonal example in this log entry. I chose the german lulaby "Weißt Du wie viel Sternlein stehen?" as a song that I accompany with chords triggered by gestures.

The first verse of the song is about the multitude of stars and clouds that can be seen in the sky. I chose the triggering gestures to resemble pointing at the stars or showing the moving clouds.
Log #02: MIDI out
Lars Friedrich • 03/21/2024 at 18:22 • 0 comments
I would like to provide the MIDI out functionality to the backend so that I can use the drum sounds of my Roland JV-1080 sound generator. In this log entry, I explain how I setup the MIDI out and what difficulties I faced.

MIDI follows the communication protocol of a serial port with two special aspects:
- Some dedicated wiring is necessary, since the MIDI ports rely on opto-isolators.
- MIDI uses a baudrate of 31250Hz, which is not a standard serial port baudrate.
On a RaspberryPi, it is straightforward to create a MIDI output from one of the serial ports as described in this article. Since operating a serial port from python does not require a special driver, I wanted to use the same approach for my x86 based backend. So when I was selecting the backend mainboard, I carefully looked for one that provides an onboard serial port and I ended up with a MSI Z97SLI.

However, I had to learn that an RS232 output (which my mainboard provides) uses different voltage levels than the UART that a RaspberryPi has. I now use a USB-powered converter (€4,60).

But still, the communication did not work and it took me some time to figure out that even though I could set the baudrate to 31250Hz without error message, the serial port was actually operating at the next standard baudrate which is 38400Hz.

It turns out that the 16550A chipset that my mainboard uses for the serial port just does not support custom baudrates different from the default baudrates (at least I did not manage to configure it accordingly). On the RaspberryPi the MIDI baudrate of 31250Hz could be configured without difficulties from the python script opening the port.

Now I use a dedicated PCI card that provides serial ports using an AX99100 chipset (€17,95). There is a dedicated linux driver on the manufacturer's homepage that I could compile without problems ('make'). In order to permanently add the driver, I had to
- place the binary module ax99100.ko in /lib/modules/5.15.0-91-generic/kernel/drivers ('5.15.0-91-generic' being the kernel I use as returned by 'uname -r')
- add the name 'ax99100' to /etc/modules
- run 'sudo depmod'
With the driver comes a command line tool 'advanced_BR' that can be used to configure custom baudrates. I could successfully run MIDI out communication with the following configuration:
```
advanced_BR -d /dev/ttyF0 -b 1 -m 0 -l 250 -s 16
```
It configures a base clock of 125MHz (-b 1), a divisor of 250 (-m 0 -l 250) and a sampling of 16 (-s 16), yielding a nominal baudrate of 125MHz / (250 * 16) = 31250Hz.

So now, after a lot of trial-and-error, I have a serial-port-based MIDI out for ~20€, that performs very well. I still wonder, how a USB-MIDI adapter would compare to my solution in terms of linux-driver-hassle and latency / jitter.
Log #01: Inference speed
Lars Friedrich • 03/14/2024 at 19:14 • 0 comments
In this log entry I present some benchmarks regarding the inference speed of the movenet model I use for human pose estimation. The benchmarks are carried out on the backend system, comprising a i5 4690K four-core CPU and GTX1660 GPU with 6GB memory.

To download the model from www.kaggle.com, search for 'movenet' in the models section, select the 'tensorflow 2' tab, select 'single-pose-thunder' from the variation drop-down menu. It offers version 4 of the pretrained model. Download the 'saved_model.pb' file and the complete 'variables' directory and place everything in a local directory.

For the benchmark I use the following code. It measures the time of the first inference, then it runs 100 inferences without measurement for warmup. The measurement is done for 2000 inferences with different random input images.

Code snippet 1: Inference benchmark
```
import numpy as np
import tensorflow as tf
from time import monotonic


modelPath = '/home/pi/movenet_single_pose_thunder_4'
model = tf.saved_model.load(modelPath)
movenet = model.signatures['serving_default']

# create 8-bit range random input
input = np.random.rand(1, 256, 256, 3) * 255
input = tf.cast(input, dtype=tf.int32)

start = monotonic()
out = movenet(input)
print(f'first inference time: {(monotonic() - start)}')

nWarmup = 100
for i in range(nWarmup):
    out = movenet(input)

print('starting measurement')

nMeasure = 2000
times = []

for i in range(nMeasure):
    input = np.random.rand(1, 256, 256, 3) * 255
    input = tf.cast(input, dtype=tf.int32)
    start = monotonic()
    out = movenet(input)
    times.append(monotonic() - start)
    
print(f'mean inference time: {np.mean(times)}')
print(f'min inference time: {np.min(times)}')
print(f'max inference time: {np.max(times)}')
print(f'standard deviation: {np.std(times)}')
```
The results can be found in the first row ('plain tensorflow') of table 1. The average inference time of 23.4ms is quite disappointing, given that I could measure 27ms already on a RaspberryPi 4 with a Google Coral AI accelerator.

Table 1: Benchmark results

first
mean
min max std
plain tensorflow 2.03 s
23.4 ms
8.7 ms
25.5 ms
2.9 ms
tensor rt
419 ms
5.6 ms
3.9 ms
15.4 ms
0.51 ms
tensor rt + cpu_affinity
407 ms
3.8 ms
3.7 ms
3.9 ms
23 µs

In order to accelerate the inference, I used nvidias TensorRT inference optimization. The following code creates an optimized version of the model. With this optimized code, I obtained the values in the second row ('tensor rt') of table 1. The average inference time was reduced approximately by a factor of four. The optimization itself took 82 seconds.

Code snippet 2: TensorRT conversion
```
import numpy as np
import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt
from time import monotonic


start = monotonic()

inputPath = '/home/pi/movenet_single_pose_thunder_4'

conversion_params = trt.TrtConversionParams(
    precision_mode=trt.TrtPrecisionMode.FP32)
    
converter = trt.TrtGraphConverterV2(input_saved_model_dir=inputPath,
                                    conversion_params=conversion_params)
converter.convert()

def input_fn():
    inp = np.random.random_sample([1, 256, 256, 3]) * 255
    yield [tf.cast(inp, dtype=tf.int32)]

converter.build(input_fn=input_fn)

outputPath = '/home/pi/movenet_single_pose_thunder_4_tensorrt_benchmark'
converter.save(outputPath)

print(f'processing time: {monotonic() - start}')
```
I could reduce the average inference time further by adding the following lines to the beginning of the benchmarking script. They assign the python process to a specific core of the four-core CPU. With this change, I obtained the values in the third row ('tensor rt + cpu_affinity') of table 1.

Code snippet 3: CPU affinity
```
import psutil

osProcess = psutil.Process()
osProcess.cpu_affinity([2])
```
During the measurement 'tensor rt + cpu_affinity', htop shows that actually only one CPU core is used.

nvidia-smi shows that the GPU runs at half of its maximum power usage and the fan is completely off.

The results of the accelerated configuration...
Read more »

View all 5 project logs

lalelu_drums

Details

What is it?

Why?

Prior art

Project Logs

Collapse

Log #05: Pose estimation precision

Log #04: Related work

Log #03: Tonal example

Log #02: MIDI out

Log #01: Inference speed

Discussions

Similar Projects

Rodent Arena Tracker (RAT)

Asteria Network

highratepose

Low Cost Open Source Eye Tracking

	first	mean	min	max	std
plain tensorflow	2.03 s	23.4 ms	8.7 ms	25.5 ms	2.9 ms
tensor rt	419 ms	5.6 ms	3.9 ms	15.4 ms	0.51 ms
tensor rt + cpu_affinity	407 ms	3.8 ms	3.7 ms	3.9 ms	23 µs

lalelu_drums

Become a Hackaday.io member

Just one more thing

Details

What is it?

Why?

Prior art

Project Logs Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse