Thoughts on 'Automated Synthesis of Symbolic Instruction Encodings from I/O Samples'

Some interesting research by Patrice Godefroid: http://research.microsoft.com/apps/pubs/default.aspx?id=156020

The gist: I want to build a model of what each CPU instruction does in an unguided (completely automated) manner. we'll do it by having the CPU execute instructions and observing their side effects. this lets us build a "very good" model of what each instruction does.

The problem is how do I have input samples that can let me see everything? i.e. what if add is fine for integers from 0 to 1000000 but at 1000001 there is a problem? "the only way to find out is to exhaust" you say, that is, exhaustively try every possible input value (and every combination of input values).

That sucks. They have an algorithm that lets them do it "smarter". I'm still deciphering it, it's in the paper.

This is interesting to me because you can consider each instruction as a "function" and when you say "oh I have a way to reason about these functions reactions to certain inputs", ears should perk up.

Anyway, some results:

"We also discovered cases where observed behaviors contradict the x86 reference manual (which is unsurprising given the size and complexity of the spec). For instance, we discovered by accident while debugging our template T -ARImain that the overflow OF flag should be set to 0 after executing IMUL[8] with 65 and 254 as inputs according to the Intel spec, while the OF flag is actually set to 1 after the execution of this instruction with those inputs on an Intel XEON3.7 processor.

Moreover, we discovered, again by accident, that the semantics of instruction varies across Intel processors. For instance, on an Intel XEON3.7 or Core2 or i7 M620 processors and in accordance with the x86 spec, executing instructions ROL, SHL or SHR does not set the overflow OF flag if the count argument is not 1. However, on an Intel i7-2620M processor (HP EliteBook 2760p, 2.7Ghz, 8Gb RAM, 64-bit) processor, the OF flag is set to 1 even for certain   cases when the count argument is greater than 1. Our template T -BSf lag is actually unable to capture this behavior, which is why we detected these corner cases.

Finally, and unsurprisingly, we also discovered several errors in previous manually-written x86 instruction handlers used in the whitebox fuzzer SAGE [5]"

We discovered something similar with the shift/rotate instructions while implementing a project of our own (I forget the exact details, but I think it had to do with rotating by more than the register width would "still work" even though that case wasn't covered in the intel documentation). if you can put infrastructure like this together, you can discover interesting things about CPUs.

And of course, if you see some malware doing something "weird seeming" with instructions, you could perhaps infer that they were trying to do something like fingerprint what CPU they were on or flummox static analysis tools doing  instruction-level emulation, and then you could infer that whoever wrote the malware might have had enough time on their hands to come to grips with works like this paper.

The downside to being able to make that inference is I'm sure if you know just that much, you'll be super quick to blame anything "weird looking" on "oh I bet they're doing some super awesome per-cpu heuristic etection thingy" when really it's just some behavior you've never seen before. there has to be a word for that thought trap ...

And of course, there is prior academic work in discovering discrepancies between the docs and the reality, these people did it too: http://www.cl.cam.ac.uk/~pes20/weakmemory/index3.html there are some similarities between the Cambridge approach and the MS research approach, but, the Cambridge model of the CPU is built by hand (as I recall) and the thing the MS research model brings to the table is the automated building of the model.