Close

Question For Experts - Disassembly and Data-Bytes?

A project log for Improbable AVR -> 8088 substitution for PC/XT

Probability this can work: 98%, working well: 50% A LOT of work, and utterly ridiculous.

eric-hertzEric Hertz 02/15/2017 at 07:234 Comments

How Does A Disassembler Handle Data Bytes intermixed in machine-instructions? anyone?

Recurring running question in my endeavors-that-aren't-endeavors to implement the 8088/86 instruction-set.

(again, I'm *not* planning to implement an emulator! But it may go that route, it's certainly been running around the ol' back of the 'ol noggin' since the start of this project.)

I can understand how data-bytes intermixed in machine-instructions could be handled on architectures where instructions are always a fixed number of bytes... just disassemble those data-bytes as though they're instructions...

(they just won't be *executed*... I've seen this in MIPS disassembly, though I don't know enough to know that MIPS has fixed-length instructions).

I can also understand how they're not *executed*... just jump around them.

But on an architecture like the x86, where instructions may vary in byte-length from anywhere from 1 to 6 instruction-bytes... I don't get how a disassembler (vs. an executer) could possibly recognize the difference between a data-byte stored in "program-memory" vs, e.g. the first instruction-byte in a multi-byte instruction. And, once it's so-done, how could the disassembler possibly be properly-aligned for later *actual* instructions?

Anyone? (I dunno how to search-fu this one!)

Discussions

Kilian Hekhuis wrote 05/22/2017 at 15:13 point

I do have actual experience with 8086 assembly, and with disassembler tools. For one, each executable has a specific entry point, so the disassembler knows where to start. For ROM this is a bit of a problem, but generally disassemblers allow specifying entry points, and the various PC ROMs have defined entry points (e.g. FFFF:0000). 

There are specific 8086 for accessing data, especially in a sequential fashion, so if the disassembler encounters them, it may guess that's where data is. Also, it keeps track of a list of all jumps (a.k.a. branches on other architectures), so it knows at the receiving end there's code.

As for being in the dark if there's no entry point known (as Ted writes), the disassemblers I used back then could infer a lot from analyzing the bytes. E.g. sequences of bytes between 32 and 127 are bound to be text strings, sequences of 90h are NOPs (typically emited by compilers to word-align code) etc.

  Are you sure? yes | no

Eric Hertz wrote 01/22/2020 at 06:32 point

hey, apologies for missing this years ago, those are some great points, keeping track of jumps, word-alignment, etc. Sounds like it's a much more sophisticated process than I guessed; and I'm guessing quite a bit more than the assembly process.

Thankya for taking the time to write that up all that time ago!

  Are you sure? yes | no

Ted Yapo wrote 02/15/2017 at 18:43 point

Speaking from a standpoint of zero actual experience on 8086 disassembler internals (but having written assembly static-analysis tools on other architectures), I'll ask you a question in response: how does the CPU handle them? :-)

If you start executing (or disassembling while tracing execution) from a given entry point, you can keep track of what's what, and recursively trace each side of all the branches until you've covered all the reachable code.  You have to mark coverage as you go to prevent infinite loops, but you can analyze all code reachable from a single starting point this way.

But, if you don't have a set of starting vectors, you're in the dark - a "random" collection of bytes might be interpreted differently by the CPU (and dissassembler) depending on where you jump into it.  Some entry points will yield a valid disassembly, while others may yield nonsense.

Oh - but on this architecture, you might have data-driven branches you can't determine a priori.  Then, this doesn't work.  Interesting.

  Are you sure? yes | no

Eric Hertz wrote 02/15/2017 at 19:58 point

Right, function-pointers, etc...

It sounds like we're on the same wavelength; disassembly requires knowledge of the thing being disassembled... That could be user-supplied, or supplied through breaking at a known executable instruction, or whatnot. But could break-down depending on how much (and what) information is supplied to the disassembler.

So I guess the answer is that one *can't* e.g. dump a ROM/BIOS image to a file, then disassemble it in its entirety. Simulation/execution is necessary. (Plausibly some disassemblers do-so in the background). And, realistically, that makes sense... it's just as plausible that the firmware might contain a compressed image.

I guess I thought maybe there was some de-facto method that data-bytes were indicated, explicitly... a pseudo-instruction or something. But that would be highly architecture-specific and, again, doesn't solve the problem of e.g. compressed-images, etc.

-------

And I think that brings an end to this particular endeavor. Emulation isn't my goal. OTOH, it could be a tool toward my goal... I'm sure I'll be contemplating that, next.

  Are you sure? yes | no