Gesture controlled percussion based on markerless human-pose estimation with an AI network from a live video.

Similar projects worth following

What is it?

lalelu_drums is a system (hardware + software) that can be used for live music performances in front of an audience. It consists of a camera recording a live video of the player, an AI network that estimates the body pose of the player from each video frame and algorithms that detect predefined gestures from the stream of pose coordinates and create sounds depending on the gestures.

Video 1: Example video showing basic drum pattern


This type of drumming allows to incorporate elements of dancing into the control of the drum sounds. Also, the drummer is not hidden behind the instrument. Both aspects should promote a more intense relation and interaction between the musician and the audience.

The pose estimation yields coordinates of many different landmarks of the human body (wrists, ellbows, knees, nose, eyes,...) and I envision that there are intriguing options how to create music from gestures with these.

Compared to other forms of modern electronic music control, lalelu_drums can be played with a minimal amount of tech visible to the audience (i. e. the camera in front of the player). It is therefor especially well suited to be combined with acoustic instruments in a low-tech setting.

With this kind of atmosphere in mind and in order to foster a good contact to the audience, I would like to design the system in a way that the player has no need to look at any display while playing. For checking basic parameters like illumination or camera positioning or for troubleshooting, I think a display will be necessary. But it should not be needed for the actual musical performance so that it can be installed in an unobstrusive way.

An interesting application of lalelu_drums is to augment other instruments with additional percussive elements. In such a hybrid setting, the gestures need to be defined taking into account the normal way of playing the instrument.

Video 2: Acoustic cajon and egg shaker augmented with snare drum and two bells

While it is certainly possible to use the arrangement of lalelu_drums to control other types of instruments apart from percussion (e. g. Theremin-like), I chose percussion for the challenge. If it is possible to design a gesture controlled percussion system with acceptable latency and temporal resolution, it should be straight forward to extend it for controlling other types of sounds.

Prior art

There are examples of gesture controlled drums using the kinect hardware:
However, the pose estimation path of the kinect has a frame rate of 30fps and I think that this rate is too low to allow for precise music making.

Here is a very early example based on video processing without pose estimation:
However, it needs a blue screen in the background, and since there is no actual pose detection it can not react on complex gestures.

There is a tensorflow.js implementation from 2023 of a pose estimation based drumming app, but it seems to be targeting rather a game like application in a web browser than a musical instrument for a live performance:

There are various 'air drumming' devices commercially available. However, they either need markers for video tracking (Aerodrums) or they use inertia sensors so that the drummer still has to move some kind of sticks (Pocket Drum II) or gloves (MiMU Gloves) and can not use gestures comprising ellbows, legs or face.

One other interesting commercially available device is the Leap motion controller. It uses infrared illumination and dual-camera detection to record high-framerate (115fps) video data of the user's hands from below. The video data is processed by some proprietary algorithm to yield coordinates of the hand and finger joints + tips. Here is a project using the device for making music:

Read more »

  • Log #07: Combinations

    Lars Friedrich06/18/2024 at 17:27 0 comments

    In this log entry, I use a virtual base guitar as an example to show how multiple keypoints in combination can be used to control a single instrument.

    The movements of the RIGHT_WRIST keypoint control the trigger of a base guitar sound, while the position of the LEFT_WRIST keypoint defines which note is played. Additionally, as an extension of the way a real base guitar is played, the position of the RIGHT_WRIST keypoint allows to override the note defined by LEFT_WRIST, so that the fundamental note of the scale is easily accessible.

  • Log #06: Other keypoints

    Lars Friedrich05/28/2024 at 20:28 0 comments

    In the demo videos so far, I almost exclusively used the LEFT_WRIST and RIGHT_WRIST keypoints. In this log entry, I present an example using other keypoints. The option to use keypoints following different parts of the body extends the possibilities of lalelu_drums beyond those of 'air drumming' solutions that rely on tracking drumsticks or similar devices (e. g. Pocket Drum II).

    Video 1: The bad touch

    In this example, base and snare are controlled by a virtual kepoint that I call  'Svadhisthana', which is the middle position between LEFT_HIP and RIGHT_HIP.

    Four chords are triggered from the movements of LEFT_ELBOW and RIGHT_ELBOW.

  • Log #05: Pose estimation precision

    Lars Friedrich04/11/2024 at 19:45 0 comments

    In order to compare and optimize different body movements, clothing, backgrounds, illumination etc. (also across different players) for the most precise pose estimation results, it is desirable to have a quality measure for the pose estimation results. However, this measure is not straightforward to obtain, since ground truth data is typically missing. In this log entry I present a procedure to obtain a quality measure for the pose estimation results without the need for ground truth data.

    The concept relies on the fact that the input data to the pose estimation is always a video of contiguous movements, sampled with a high frame rate (typically 100fps). For each keypoint, it is therefor valid to apply a temporal low-pass filter (I use a gaussian filter) to the estimated positions. The low-pass filtering will increase the precision of the estimated keypoint coordinates. Then, the difference between the raw pose estimations and the low-pass filtered data can be regarded as a pose estimation error. The difference is a vector (x,y) and the length of this vector can be called residual and serves as a quality measure. The larger the residual, the lower is the precision of the individual pose estimation results.

    Video 1 shows an example for the RIGHT_WRIST keypoint. In the image, blue dots indicate individual pose estimation results for the 500 frames of the 5 second recording. The red dot shows the pose estimation result for the current frame and the red cursor in the plots on the right indicates the current frame in the line plots of the row coordinate and column coordinate, respectively. The orange curves in the line plots show the low-pass filtered data. The line plot on the lower right shows the residual for each frame. As an example, a threshold at 3 pixels is shown (yellow horizontal line). All residuals above this threshold are highlighted with a yellow circle. The corresponding coordinates are also highlighted with yellow circles in the camera image.

    Video 1: Tracking results for keypoint RIGHT_WRIST

    It can be seen that the points with high tracking errors concentrate at a specific position, where the camera perspective is such that the RIGHT_WRIST coordinate is almost identical with the RIGHT_ELBOW coordinate.

    To get an impression of the lower of the lower bound of the residual, I recorded a 5 second video of a still person and computed the rms residual for the full 500 frames for each keypoint. The results are shown in figure 1. It can be seen that the typical rms value of the residual is below one pixel.

    Figure 1: Tracking results for a still person

    Video 2 shows an example for the LEFT_ANKLE keypoint, this time the image is slightly zoomed, as can be seen by the axes limits. Again, the points with high residual concentrate at a certain position. This time it is the upmost position of the ankle during the movement. Admittedly, the contrast between the foot and the background is very low here.

    Video 2: Tracking results for keypoint LEFT_ANKLE

    I think the proposed procedure is helpful to identify situations where the pose estimation precision is lower as usual. It should be possible to provide a live display of this information to the player.

  • Log #04: Related work

    Lars Friedrich04/04/2024 at 17:46 0 comments

    Additionally to the references I made in the 'Prior art' section in the main project description, I recently found two other pieces of work that I would like to mention.

    The first is [Mayuresh1611]’s paper piano that was presented in a recent blog post. The project uses Mediapipe's hand pose detection to create a piano experience on blank sheets of paper. I am not sure how well the implementation works for actual music making, but I must admit that the whole idea is very close to lalelu_drums.

    The second is a TEDx talk given by Yago De Quay in 2012. He employs a kinect device to detect the pose of a human dancer and creates music and video projections from the data. While I still think the framerate of the kinect's pose estimation (30fps) is too slow for precise music making, I find that the talk contains many nice ideas of how to use the pose estimation technique to create an appealing live performance.

  • Log #03: Tonal example

    Lars Friedrich03/26/2024 at 20:17 0 comments

    While the main target of lalelu_drums is to control percussive instruments, I would like to show a tonal example in this log entry. I chose the german lulaby "Weißt Du wie viel Sternlein stehen?" as a song that I accompany with chords triggered by gestures.

    The first verse of the song is about the multitude of stars and clouds that can be seen in the sky. I chose the triggering gestures to resemble pointing at the stars or showing the moving clouds.

  • Log #02: MIDI out

    Lars Friedrich03/21/2024 at 18:22 0 comments

    I would like to provide the MIDI out functionality to the backend so that I can use the drum sounds of my Roland JV-1080 sound generator. In this log entry, I explain how I setup the MIDI out and what difficulties I faced.

    MIDI follows the communication protocol of a serial port with two special aspects:

    • Some dedicated wiring is necessary, since the MIDI ports rely on opto-isolators.
    • MIDI uses a baudrate of 31250Hz, which is not a standard serial port baudrate.

    On a RaspberryPi, it is straightforward to create a MIDI output from one of the serial ports as described in this article. Since operating a serial port from python does not require a special driver, I wanted to use the same approach for my x86 based backend. So when I was selecting the backend mainboard, I carefully looked for one that provides an onboard serial port and I ended up with a MSI Z97SLI.

    However, I had to learn that an RS232 output (which my mainboard provides) uses different voltage levels than the UART that a RaspberryPi has. I now use a USB-powered converter (€4,60).

    But still, the communication did not work and it took me some time to figure out that even though I could set the baudrate to 31250Hz without error message, the serial port was actually operating at the next standard baudrate which is 38400Hz.

    It turns out that the 16550A chipset that my mainboard uses for the serial port just does not support custom baudrates different from the default baudrates (at least I did not manage to configure it accordingly). On the RaspberryPi the MIDI baudrate of 31250Hz could be configured without difficulties from the python script opening the port.

    Now I use a dedicated PCI card that provides serial ports using an AX99100 chipset (€17,95). There is a dedicated linux driver on the manufacturer's homepage that I could compile without problems ('make'). In order to permanently add the driver, I had to

    • place the binary module ax99100.ko in /lib/modules/5.15.0-91-generic/kernel/drivers ('5.15.0-91-generic' being the kernel I use as returned by 'uname -r')
    • add the name 'ax99100' to /etc/modules
    • run 'sudo depmod'

    With the driver comes a command line tool 'advanced_BR' that can be used to configure custom baudrates. I could successfully run MIDI out communication with the following configuration:

    advanced_BR -d /dev/ttyF0 -b 1 -m 0 -l 250 -s 16

    It configures a base clock of  125MHz (-b 1), a divisor of 250 (-m 0 -l 250) and a sampling of 16 (-s 16), yielding a nominal baudrate of 125MHz / (250 * 16) = 31250Hz.

    So now, after a lot of trial-and-error, I have a serial-port-based MIDI out for ~20€, that performs very well. I still wonder, how a USB-MIDI adapter would compare to my solution in terms of linux-driver-hassle and latency / jitter.

  • Log #01: Inference speed

    Lars Friedrich03/14/2024 at 19:14 0 comments

    In this log entry I present some benchmarks regarding the inference speed of the movenet model I use for human pose estimation. The benchmarks are carried out on the backend system, comprising a i5 4690K four-core CPU and GTX1660 GPU with 6GB memory.

    To download the model from, search for 'movenet' in the models section, select the 'tensorflow 2' tab, select 'single-pose-thunder' from the variation drop-down menu. It offers version 4 of the pretrained model. Download the 'saved_model.pb' file and the complete 'variables' directory and place everything in a local directory.

    For the benchmark I use the following code. It measures the time of the first inference, then it runs 100 inferences without measurement for warmup. The measurement is done for 2000 inferences with different random input images.

    Code snippet 1: Inference benchmark

    import numpy as np
    import tensorflow as tf
    from time import monotonic
    modelPath = '/home/pi/movenet_single_pose_thunder_4'
    model = tf.saved_model.load(modelPath)
    movenet = model.signatures['serving_default']
    # create 8-bit range random input
    input = np.random.rand(1, 256, 256, 3) * 255
    input = tf.cast(input, dtype=tf.int32)
    start = monotonic()
    out = movenet(input)
    print(f'first inference time: {(monotonic() - start)}')
    nWarmup = 100
    for i in range(nWarmup):
        out = movenet(input)
    print('starting measurement')
    nMeasure = 2000
    times = []
    for i in range(nMeasure):
        input = np.random.rand(1, 256, 256, 3) * 255
        input = tf.cast(input, dtype=tf.int32)
        start = monotonic()
        out = movenet(input)
        times.append(monotonic() - start)
    print(f'mean inference time: {np.mean(times)}')
    print(f'min inference time: {np.min(times)}')
    print(f'max inference time: {np.max(times)}')
    print(f'standard deviation: {np.std(times)}')

    The results can be found in the first row ('plain tensorflow') of table 1. The average inference time of 23.4ms is quite disappointing, given that I could measure 27ms already on a RaspberryPi 4 with a Google Coral AI accelerator.

    Table 1: Benchmark results

    plain tensorflow2.03 s
    23.4 ms
    8.7 ms
    25.5 ms
    2.9 ms
    tensor rt
    419 ms
    5.6 ms
    3.9 ms
    15.4 ms
    0.51 ms
    tensor rt + cpu_affinity
    407 ms
    3.8 ms
    3.7 ms
    3.9 ms
    23 µs

    In order to accelerate the inference, I used nvidias TensorRT inference optimization. The following code creates an optimized version of the model. With this optimized code, I obtained the values in the second row ('tensor rt') of table 1. The average inference time was reduced approximately by a factor of four. The optimization itself took 82 seconds.

    Code snippet 2: TensorRT conversion

    import numpy as np
    import tensorflow as tf
    from tensorflow.python.compiler.tensorrt import trt_convert as trt
    from time import monotonic
    start = monotonic()
    inputPath = '/home/pi/movenet_single_pose_thunder_4'
    conversion_params = trt.TrtConversionParams(
    converter = trt.TrtGraphConverterV2(input_saved_model_dir=inputPath,
    def input_fn():
        inp = np.random.random_sample([1, 256, 256, 3]) * 255
        yield [tf.cast(inp, dtype=tf.int32)]
    outputPath = '/home/pi/movenet_single_pose_thunder_4_tensorrt_benchmark'
    print(f'processing time: {monotonic() - start}')

    I could reduce the average inference time further by adding the following lines to the beginning of the benchmarking script. They assign the python process to a specific core of the four-core CPU. With this change, I obtained the values in the third row ('tensor rt + cpu_affinity') of table 1.

    Code snippet 3: CPU affinity

    import psutil
    osProcess = psutil.Process()

    During the measurement 'tensor rt + cpu_affinity', htop shows that actually only one CPU core is used.

    nvidia-smi shows that the GPU runs at half of its maximum power usage and the fan is completely off.

    The results of the accelerated configuration...

    Read more »

View all 7 project logs

Enjoy this project?



Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates