Why reverse engineer file formats?

Besides the fact that binary file formats are so last century, people use version control more and more to keep track of design document changes . Rationale that text files are big is very weak, as you could just zip them up and have best of both worlds.

Real reason for industry's insistence on binary file formats lies in the attempt to lock customers in by not allowing easy migration between different tools. If Eagle did anything right, it was to switch to XML file format. Suddenly you could write code to do stuff to your design and automate some tasks that previously you had to rely on vendor to implement - if they thought that the concept was valid...

Which tools to use?

It is always difficult to find the right way to start, so the intention with this project is not to give you the solution (even though the code to read Altium Designer Schematics is provided), but to provide you with the tools which you can use to reverse engineer any file format.

You may think that this is very hard to do, but it doesn't necessarily have to be like that.

First you need to select the tools. You need at least:

When selecting these you need to be pragmatic - particularly when selecting programming language. You have to consider if there is a need to use native code, platform independence, how easy is it to read binary files (in some languages it is not all that straightforward) and whether you need to use third party libraries that will help you...

Luckily it is much easier to select editors. Use whatever text editor you like... if it can handle binary files great, but don't sweat about that. For binary files you really need a decent hex editor. One of the things that it needs to support is byte-by-byte comparisons. My preference is 010 Editor by SweetScape (I'm not in any way related to them, except that I'm a very happy customer) and if there ever existed an industry standard for reverse engineering file formats - this would be it! They provide trial version, even though it is a commercial product, but well worth the money (personal license is $50). I don't want to get into the debate here whether tools should be free or not, but if you want the job done link is above...

How do you go about it?

Open the file in your text or (better) hex editor and scroll around. Try to figure out if there are any strings that can give you any hints or reference points to start exploring.

You know, you don't have to start at the top!!!

What can you expect to see? Well anything really. Let's have a look what our test file would look like when you open it n 010:

Lots of junk to say the least... Well if you scroll forward you get something more meaningful (take note of those strings you can see on the screenshot):

There's awfully lot of text here...

Let's see what we can find out about that Root Entry string we found at the beginning of the file. Googling it gives back bunch of links mentioning Compound Document File format. You can find fairly detailed description of this format which represents a filesystem within a file and you may be tempted to reverse engineer that. That would be perfectly fine, however my intention is to get to the point of reading/understanding Altium Designer Schematic file format and I don't have time to deal with a format that I can find libraries to help me read... This format is a container format that is used by Microsoft Office documents (Word, Excel, Powerpoint, etc.) and it doesn't carry anything interesting for understanding schematics.

At this point I have decided to use Java as a programing language as I want portable code that I can write on my Mac and run (maybe) on Linux.

First thing I will write is a small program to unpack the Compound Document container. I decided to use Apache POIFS, as it provides very nice API to do so. Once I run it on my test file I see that within the file there are two files one called Storage and the other called FileHeader. Opening them in 010 tells me that Storage is not interesting and that I should focus on FileHeader.

How do you store data in a binary file?

This depends (slightly) on a language used, but for the most part you will do something like this:

The Work...

Knowing this and looking into FileHeader, I figure out following:

Furthermore:

Let's have a look at the 010 template to do this:

typedef struct {
    uint32 size; 
    char text[size-1];
    char null; 
} Record;

while( !FEof() ) {
    Record rec;
}

This template outputs a straight conversion of the (pseudo) binary into plain text format. Now I can write the code that can convert this format into JSON, and I get this as an output:

Not bad, eh ;)