I was recently asked by my brother to see if I could recover the MIDI files stored on late 1990s Yamaha Disklavier 720K 3.5 inch floppy disks. These disks employ a rudimentary DRM which made them unreadable by contemporary operating systems, appearing as corrupted or unformatted. DOS tools exist for dealing with these floppies, but they have not been updated in years and reportedly do not work under modern DOS environments such as DOSBox.
The goal of this project is to study the structure of the Disklavier disks under Linux using an external USB floppy drive, and ultimately recover their contents. Ideally, I'd like to develop a new open source tool for extracting the MIDI files from these disks so they can be preserved before the fragile disks are no longer readable.
In other words, it is my intent to copy that floppy.
While it's not perfect, I'm seeing a high enough success rate (~90% against the disks I have) that I've decided to release the Python script I've written on GitHub. It can determine the track type and title, list the individual tracks on the disk, and can even attempt to automatically extract them into separate files. If that fails, it can try and determine the start position and length of each file so they can be manually extracted with dd.
Things have taken a somewhat unexpected turn, as it turns out the reason the disks weren't working in the first place is because the drive in the Disklavier piano was shot. This evening I was able to swap out the floppy drive for a new one (expect to see more about this...), and it's now able to read all of the disks without issue. Suddenly, the need to back-up these disks seems much less urgent.
Especially considering what else was discovered tonight. It turns out that for whatever reason, simply writing the images made with ddrescue back to known good floppies doesn't work. The Disklavier says they are unformatted disks. This means one of the core goals, that is, to preserve these commercial Disklavier floppies, is in jeopardy until we can figure out how to actually put them back onto physical media.
So, that being said, what is the status of the actual software? As of right now, the Python program I've written is able to identify and extract individual tracks from all but one of the disks we've currently made images of. The problem disk may be defective or a fluke, as it has some very unusual formatting issues that don't appear on any of the others.
I plan on doing some more testing soon, and may end up grabbing a few more of disks off of eBay to collect more data, but an initial release of the tool on GitHub should be coming pretty soon.
As an extension of my previous work, the script can now determine the position of the track within the disk image as well as the length. These two variables can be used in conjunction with dd to successfully pull individual files out, but ultimately I'm going to add that ability into the program without the need for any manual work. I've sent a handful of manually extracted tracks to my brother for verification, and waiting to hear the verdict.
To do this, I'm searching for the start and end byte sequences of the tracks. But as par for the course with this project, I have at least one disk here that doesn't use the same end sequences as its peers. If there's one, there's probably more, so I'm trying to come up with some kind of alternate scheme to use in the event that the primary method doesn't work.
If the number of detected file start points don't match the number or end points, the program can kick into Plan B. I'm not 100% sure what that is yet, but in the absolute worst case I could simply assume that the end of each track is just 1 byte lower than the start of the next track. But of course that won't work for the last track on the disk so still not a complete solution.
I now have multiple copies of both disk formats, and have started to map out and parse the different parts of each one. This basically boiled down to finding the byte sequences that signify the start of the Table of Contents, and then stepping through the file byte-by-byte to get the appropriate info. Not only are the Smart PianoSoft and PianoSoft Plus totally different from each other, but as it turns out, not even all the Smart PianoSoft disks appear to be the same.
For example, on PianoSoft Plus disks the album name is a 64 byte long ASCII string starting at 0x2ED0. Easy, no problem. But for Smart PianoSoft disks, I have to search for the first appearance of "P.PLAYER" in the file (it's different for each disk), seek forward by 30 bytes, and then finally read the 60 byte title.
With some fiddling, and perhaps more trial and error than I'd like to admit, I now have a Python script that can reliably determine the disk type as well as print a listing of the tracks on the disk:
I'm very happy with the progress made so far, but of course the end goal is getting the actual music off these disks. For that, I should only need to search for the start and end bytes of the different file types which may be on the disks, and then copy those out to external files. I could probably get cute and even give them appropriate track titles, but for my own sanity I'll just dump them to track numbers.
Taking a close look at the disk images in hexedit, and it turns out that they aren't even all the same format. As pointed out in a previous comment by @Gravis, not all the disks actually contain MIDIs to begin with. It seems like the disks labeled "Smart PianoSoft" are MIDI, and the ones that say "PianoSoft Plus" use E-Seq. So any tool I create will need to deal with (at least) these two types of disks.
To illustrate the difference, here is what I am going to assume is the "Table of Contents" (for lack of a better term) of a Smart PianoSoft disk:
Here we can see the the track names and some information related to the MIDIs themselves. I believe that the "MAX020" and "FILE020" fields at the top indicate the total number of tracks on the disk, but as I only have one of these types of disks right now I'm not sure. I'm considering just biting the bullet and looking for another Smart PianoSoft disk on eBay so I have another data point.
Compare that to the ToC of a PianoSoft Plus disk:
We still have a list of track titles, but it looks like everything else has changed. Also note that it's not even in the same location on the disk.
I'm going to start working on a Python script that can first reliably identify which type of image you've given it, and from there start parsing the ToC so it can tell you the title of the disk and what tracks are on it. From there, the next step will be locating the individual files in the image and splitting them out.
So I was reminded by a reader that I don't physically need a 720KB disk just to get the boot sector from one, since I can just format an image file as if it was the real thing. I still need the physical disks to eventually get the content back into the piano, but for now I can fiddle around with the images.
The short version is, using the boot sector from the 720KB floppy worked better, but still isn't right. The garbage file names went away, but most of the files aren't listed. Copying the files that do show up just results in an I/O error, but using the fantastic fatcat tool I'm actually able to extract the files despite the wonky FAT. The output of fatcat incidentally shows there's even more hidden files lurking around:
While pulling out one of the MIDI files intact and actually being able to play it was very encouraging, I've decided to change my approach. Rather than trying to convert the disk to a normal floppy, I'm going to start studying the format of the disks and make a tool that can pull the files/info out without having to patch the image. They're clearly in there, I just need to get them out.
So if you'll excuse me, I have a date with hexedit.
Under the assumption that a missing boot sector on the disk is what is keeping it from being recognized by the OS, it occurred to me that I should be able to use dd to simply copy the boot sector (first 512 bytes) from a valid floppy disk and insert it into the already created image. This would naturally trash the File Allocation Table (FAT), but I figured one step at a time.
I imaged a known good/blank floppy, and then used dd to merge the first 512 bytes into the previously created image.
So that....kind of worked. Putting the boot sector of a blank disk into the first 512 bytes of the Disklavier image makes it mountable, but there's clearly something weird going on. Judging by the track names on the right, it seems like some data is overflowing somewhere. I tried some file recovery tools on the image, but they complained about a boot sector mismatch.
Thinking about it now, I believe the problem (or at least one of them) is that I am using a boot sector from a 1.4 MB floppy on a 720 KB image. I've ordered from blank 720 KBs as part of this project, so when they arrive I'll repeat this same process and see if the results change.
When doing any sort of data recovery or digital forensics, the first step is always to make an image of the disk and work from that. You don't want to screw up the original media, and when you're dealing with something old and slow like a 3.5 floppy disk, the less time you have to spend using the real hardware the better.
As old floppies are notorious for bit rot, I'll be using ddrescue to make the images which is basically dd with added fault tolerance and error correction. So far, I've been able to pull images from a couple of the disks with relatively little issue.
Not that I expected anything different, but the disk images are not mountable and no software I've tried recognizes them as anything but a block of incomprehensible data.