The overview of the system is as follows:
On the left are the vision components. These are responsible for getting images of the blocks from the webcam and converting them into a set of notes to be played. Open CV is a well known library for image processing and here we use the Emgu.CV wrapper for C#/.Net
On the right are the audio components. These convert the extracted notes into sound. The NAudio library is used for MIDI sound synthesis.