If we know the signature as well as the optimal register allocation of all functions, we can translate all code. But there are still several open questions:

how does the caller pass parameters?
how does the callee accept parameters?
how does the callee work with local variables?
how are the LR save and restore sequences translated?
how are creates/destroy stack frame sequences translated?
who cleans up the parameter stack?

Passing Parameters

The PPC code moves parameters to registers r3+. If it's too many parameters, they get written into the stack. On an i386, parameters are pushed onto the stack. In pass2, a call can simply be translated with some push instructions and an i386-native "call" instruction. Although it look like it at first sight, this is not as optimal as compiled i386 code, as the PPC code might move parameters to registers r3+ before - this will be translated 1:1, which is not necessary.

Accepting Parameters

Most of the time, the PPC code has all the parameters in registers already. If it's too many, some of them are on the stack. On an i386, all parameters are fetched from the stack. At the beginning of every function, all PPC-register-parameters have to be loaded into the i386 equivalents of the r3+ registers. Parameters that are on the stack on the PPC can leave on the stack; all offsets might have to be adjusted.

Local Variables

On the PPC, the most important local variables are mapped to registers. In case there is too many of them, or they need to be memory mapped (there are pointers on them) are are located on the stack. This is the same as on the i386 - PPC registers that do not fit into the i386 registers will live on the stack, and PPC stack registers as well.

All PPC register that do not fit into i386 registers will either have to be accessed using fixed stack locations (i386 way), or by accessing a an array that has 32 entries, one for every PPC register that has been originally used to hold that value (as pass 1 does it).

PPC stack locations of local variables will probably have to adjusted for the i386, i.e. all offsets of r1-relative accesses will have to be changed.

LR Sequences

If a function does not call any other functions, and does not use more local variables than fit into the registers, the C compiler typically does not emit code that creates a stack frame. In all other cases, the link register LR is saved at the beginning and restored at the using using these sequences:

save:

mfspr   r2,lr
stw     r2,0x8(r1)

restore:

lwz     r2,0x58(r1)
mtspr   lr,r2

On an i386, these sequences are not necessary, as call/ret take care of all this already.

The mfspr and mtspr instructions that have the link register as one operand can simply be ignored, they will not be translated. The stw and lwz instructions in these sequences should also be ignored. The "stw r2,8(r1)" instruction can easily be detected, as "8" and "r1" are constants, and "r2" must be the same as the first operand of the mfspr, which must have been before. But the GPR is unimportant, the instruction can be detected more easily: It is always located before the adjustment of the stack pointer (which looks like"stwu r1,-n(r1)"), which can be easily detected. Any access to 8(r1) before this adjustment means writing to the saved LR location.

The "lwz r2,0x58(r1)" can also be easily detected: Every function must be associated with the size of its stack frame, in this case 0x50. Every lwz that accesses n+8(r1), n being the size of the stack frame, reads from the saved LR location. No further checks are necessary, because an offset > 8 will always be between the creation and the deletion of the stack frame.

So, what's the algorithm?

Pass 1:

When a "stwu r1,-n(r1)" instruction is found, associate the function with the stack frame size n and the address of the stwu instruction

Pass 2:

When "mfspr rn,lr" or "mtspr lr,rn" is found, emit no i386 code
When "stw rn,8(r1)" is found and the address if the instruction > the address of the stack frame creating "stwu" found in pass 1, emit no i386 code
When "lwz rn,n+8(r1)" is found, n being the size of the stack frame, emit no i386 code

Create/Destroy Stack Frame

Stack frames are created and destroyed on the PowerPC using the following sequences:

stwu r1,-n(r1)

and

addi r1,r1,n

In pass 1, these are translated into:

mov    %esp,-n(%esp)
sub    $n,%esp

and

add    $n,%esp

On an i386, stack frame creation and deletion look like this, according to the calling conventions:

push %ebp
mov %esp, %ebp
sub $n, %esp (*)

and

mov %ebp, %esp (*)
pop %ebp

In case function has no local variables, the marked (*) instructions are omitted.

The i386 doesn't typically do esp-indexed accesses to parameters and local variables, but ebp-indexed accesses. For historic purposes, the original stack pointer is copied into the ebp (bp = base pointer) register at the beginning, after saving the original contents of ebp onto the stack.

We can decide whether to use ebp as well, and emit i386-typical sequences, or to use esp-indexed accesses. This way, the generated stack frame code would look like this:

sub $n, %esp

and

add $n, %esp

Both cases are equally easy to implement: The PPC stack frame creation/deletion instructions can be easily detected, and can be translated into either sequence. The index values of stack frame accesses might have to be changed anyway, so changing %esp to %ebp in the i386 code is also no big deal.

Parameter Stack Cleanup

In i386 C calling convention, the caller has to remove the parameters it has put on the stack. So when translating a "bl" into a sequence of pushes followed by a call, a "sub $n, %esp" but be emitted, n being the number of bytes the parameters occupied (=the number of pushes * 4).

The "blr" instruction(s) at the end(s) of every function can be translated into simple i386 "ret" instructions.

29 July 2004

When to recompile

The algorithm of a standard interpreter/recompiler combination is easy: Start interpreting. If a basic block that is about to be executed had been executed n times, recompile it. This way, only code that is executed is ever touched, and only important parts are recompiled.

If we want to recompile complete functions, this algorithm is not enough. Basic blocks within a function that, for instance, only deal with error handling may not have been analyzed, because they have never been touched.

Standard Pass 1 Algorithm

My current implementation recompiles basic block by basic block on demand, that is, only if it is absolutely necessary. This is the case for

blocks that have been interpreted more often than <threshold> times and are about to be executed again
blocks that follow a block that ends with a call

The system is designed to interpret code by default and to recompile a block after it has been interpreted a certain number of times. This behavior can of course be changed to recompile a block as soon as it is supposed to be executed for the first time. This way, the interpreter will only be used for instructions that cannot be recompiled (e.g. branches, if the destination is not yet recompiled).

When we start recompiling, we might have to recompile more blocks than one, even if this is the only block that has already reached the threshold: If a basic block ends with a "call", the first basic block of the subfunction will always either already be recompiled, or be recompiled in the next step, as it will also reach the threshold. Because the subfunction, which we can assume as being recompiled already, can return to the caller with a "blr", the return address stored in the link register by the caller must be a valid i386 code address. Therefore, the block after a call must always be recompiled.

As an optimization, blocks that follow a block that ends with a loop (branch up) should also be recompiled, so that the branch instruction at the end of the first block can be translated into an i386 instruction - as long as one of the targets of the branch isn't recompiled, the branch cannot be translated, and the monitor program must handle the branch.

Pass 2 Algorithm

Register allocation within a function can only be done if the whole function has been analyzed. The pass 1 algorithm will not make sure that functions are completely analyzed; in general, they are not. Another algorithm is necessary. The easiest solution is to do a DFS. This does not make sure that all functions are touched (function pointers), but it does make sure that all functions that are touched are completely analyzed.

Conclusion

Both algorithms are needed. If the user recompiler is configured to do only pass 1 and never pass 2, it makes no sense to analyze and recompile all code; it is better to only recompile the code that will ever be executed. If the recompiler is configured to do passes 1 and 2, it makes sense to do the complete analysis and fast recompilation during pass 1: Since the complete analysis has to done anyway for pass 2, is is faster to do the complete analysis than to decide what basic blocks are most important, recompile them, and recompile the rest later.

28 July 2004

Function Detection & Register Allocation - Algorithms

If I want to do dynamic register allocation, I need to have a lot of information - information that a compiler typically has anyway, but that a recompiler has to collect first. This includes what code belongs to what function, what are the parameters and the return values of a function, and when are registers read or written. This information is collected in these steps:

1. associate every instruction with

the ID of the function it belongs to
the set of registers that are read
the set of registers that are written to
and the set of possible successor instructions

2. do a DFS on the caller tree, on the way up, do the following on every function:

do liveness analysis on this function
reconstruct the signature
find a register allocation for this function
re-recompile the function with the new register allocation

Interpretation or the first pass of recompilation already follow the code flow. This way, every instruction is reached and the set of registers that are read, the set of registers that are written and the set of possible successors can be easily logged, as the instruction needs to be decoded or interpretation or pass 1 recompilation anyway. Also, every time a "call" instruction is found, the function ID can be associated with the called function, and the called function and all its instructions will be associated with a new ID: A DFS can easily find out at what level in the tree an instruction is. If it is already marked, ignore it. When all statistics are collected, liveness analysis and register allocation is supposed to be done. If a function calls another function, this can only be done if the signature of the called function is known, or else there would be know way to find out what registers are used and thus alive at the point of the function call. Therefore liveness analysis, register allocation and signature reconstruction has to be done on the called function first. If we do a DFS and look at every function on the way back up, there are never any dependencies on functions with an unknown signature.

So we can optimize every function we meet when traversing the tree: Liveness analysis has to be done, i.e. every instruction has to be associated with the set of registers that are alive at the time of this instruction. The registers that are alive before the first instruction are the input parameters of the function. Return parameters are more difficult to find, so we could just assume all code adheres to the calling convention and r3 is the only register that holds a return value.

18 July 2004

Graham Toal's "Static Binary Translation HOWTO"

As mentioned earlier, Graham Toal, the maintainer of a mailing list on static recompilation, published a howto on static binary translation on http://www.gtoal.com/sbt/. Its motivation is running old video games on relatively slow machines. In many small steps, it leads from a traditional interpreter to an optimizing static binary translator, creating C output that is supposed to compiled by a C compiler afterwards. Only simple optimizations within basic blocks are described, additional optimizations are supposed to be done by the C compiler.

Algorithm

The first step is to replace dispatching by opcode with dispatching by the address of the opcode: A program must be written that takes the original object code as input and outputs C code with an infinite loop containing a single big switch statement that has a case for every code address. Each case implements the instruction that is found at that address. Subsequent steps optimize the resulting code: one case runs into the next; the switch statement (dispatcher) is replaced by using gotos, updating of the program counter is removed, unnecessary labels are removed (-> basic blocks), until the resulting code looks more and more like the C transcription of the original code.

Omitting Redundant Stores

Since the output is run through a C compiler, many optimizations, like removing write operations of values that never get read afterwards, need not be done by the recompiler. Still the article describes how to do liveness analysis on basic blocks and thus remove unnecessary writes: Mark every instruction with the set of registers it defines and uses, then traverse the basic block backwards and only omit the last write to each register.

Jump Targets

One major problem of static binary translation is finding all jump targets: Jumps might be calculated or go over tables. The described solution suggests to play the video game until the end and let the interpreter log all jump targets. If a new jump target is found when running the recompiled code, it is supposed to be added to the log, and the recompilation must be done again. A good system might be able to switch to the interpreter in order not to interrupt game play.

Constant Expression Folding

The author describes another optimization, constant expression folding, which can, for example, combine several partial immediate loads on a RISC machine to a single load of an immediate. Three functions are introduced: "remember(register, value)" marks a register as holding a constant, "forget(register)" removes this mark, and "known(register, out var)" returns whether a register is holding a constant, as well as the constant itself. Every time an instruction is about to be translated that uses an immediate to modify a register, it is tested whether this register already holds a constant. If it does, a "load immediate" of the resulting value is emitted, and the register is marked to hold a constant. A "load immediate" will also mark the register this way. All other instructions and cases remove the mark of all changed registers. An immediate "load" followed by multiple immediate "or"s would then be translated into multiple load instructions, which the C compiler can optimize by removing all but the last instruction.

Verifying a CPU Core

The article further states that a new CPU core can be easily verified by adding a known-good core and always making both cores run in parallel. Most bugs can easily be found by comparing registers after every step. Comparing memory is not necessary in most cases.

Conclusions

The general approach described in the article is a lot different to mine. Still it makes sense to have a look at the ideas: The algorithm described to omit redundant stores is a simplified version of liveness analysis applied so single basic blocks; the full version is what I intend to do.

Finding out (almost) all jump targets can be easily done with a game - just play it and find all secrets. Finding all of them in a computer application would mean that all functions have to be used once. But the overall principle is: Accumulate jump targets and improve the target code. If a new target is found, the monitor needs to take care of this case, do recompilation, log statistics, then the code can be improved.

I doubt whether the described algorithm for constant expression folding is a good idea for my project. It is a quite simple algorithm, but it makes the recompiler code more complex and less readable. In addition, the algorithm would have to be extended to omit all but the last write. I wonder how much performance can be gained.

I agree that the method described for verifying a CPU core is a very good idea. Too bad it only applies to interpreters directly. For recompilers, it's not as easy: The whole basic block had to be interpreted in parallel to native execution. Bart Trzynadlowski from the Yahoo group "dynarec" also suggested using Kheperix for CPU verification.

"Binary Translation: Static, Dynamic, Retargetable?"

In "cifuentes96binary.ps", Cristina Cifuentes and Vishv Malhotra, the authors of UQBT, give an overview of existing (1996) binary translation techniques. There are some interesting facts in this paper.

Two studies, done in the 1970 and in the 1990s, revealed that 20%-25% of maintenance work on software are spent for corrective maintenance (bugfixes), 55% on perfective maintenance (new features) and 20%-25% on adaptive maintenance (porting from obsolete to new platforms and operating systems).

It is interesting that the paper defines binary translation as a technique used for porting software to a newer machine - my solution is supposed to help running software on an incompatible but equally new machine. Futhermore, the paper summarizes the following four ways to port software to a new platform, beginning with the method that leads to the highest performance:

compiling from source
recompilation
interpreter in hardware
interpreter in software

Recompilers that can translate exactly one source language into exactly one target language are referred to as first generation recompilers. Second generation solutions are retargetable, i.e. they take machine instruction specifications for any number of machines and can translate from any source into any target language.

The following quotation is very interesting: "SPARC is a big-endian RISC architecture, Intel x86 is a littleendian CISC architecture, and the PowerPC is a hybrid RISC/CISC that supports both big- and little-endian conventions." PowerPC isn't a strict RISC, okay, but doesn't the term "hybrid RISC/CISC go too far? When searching for "hybrid RISC/CISC" on the internet, most hits refer to the design of modern i386 CPUs as "hybrid RISC/CISC".

21 June 2004

Misalignment Exception

The i386 can indeed thrown an exception on a misaligned access. Koitsu gave me the following code to show it:

.align 4
.globl main
main:
	pushfl
label1:
	popl %eax
	or $0x00040000, %eax
	pushl %eax
	popfl
	movl (label1), %eax
	ret

This code sets a flag in the i386 EFLAGS register to enable misalignment exceptions. While Windows does not hand down these exceptions to the application, both Linux and FreeBSD (thanks, Axel) hand these down as SIGBUS signals. The example terminates with a "bus error" if the "or" is enabled, and terminates cleanly if the "or" is disabled. So it should be possible to trap misaligned accesses, handle them, and return to the original code.

Parameter/Return Value Passing and Register Allocation

If all code of an application is known, and the application is "closed", i.e. no function gets called from the outside, register allocation can be done on the whole application: The successor of every "call" instruction is the first instruction of the subroutine, the successors of a "ret" are the next instructions of all call instructions that lead into this subroutine. Register allocation will be optimal then, and the code will be very fast, because there is no calling convention, and the register allocator can decide at any point which register to use for parameter passing or return values - in fact the allocator doesn't even know about subroutines. Unfortunately, complexity of liveness analysis in practice is between O(n) and O(n^2) (worst case O(n4); http://www.cis.upenn.edu/~cis570/slides/lecture03.pdf).

Hardly any real-world application today is "closed". But still, this method can be applied on code that does not get called from the outside. If all functions that get called from the outside are translated separately, with defined input and output (perhaps also with i386 stack frames), the rest can be translated this way - if all code is known.

Reconstructing the Signature of a Function

All subroutines that get called from unknown places need to be analyzed so that it is known what parameters it takes and what values it returns, so that the all call instructions (inside or this executable) that call this subroutine can be associated with the correct "use" and "def" sets, i.e. it is known for liveness analysis, what registers get read and what registers get written inside this function.

The parameters of a function can be found out quite easily: All registers that are live before the first instruction are parameters (first access is a read).

Finding out the return value(s) is more difficult. All registers that are either read (parameters can be returned directly) or written (all register that are written are either return values or temporary variables) are candidates for return values. (If a parameter gets passed through the whole function without ever being touched, it is no parameter, but the function preserves the register. This is the case, for example, if a C function returns its single parameter, which translates into a single "ret" in a PowerPC.)

The following requirement is sufficient for a return value, but not necessary: If no path exists from a "ret" to an instruction where a register is written without an instruction that reads the register on the path, the register is a return value. Unfortunately, return value A could be read to calculate return value B, so this rule would not recognize A as a return value.

The code that calls a function is helpful to find out what the function returns. If the "call" instruction is annotated with "changes all registers", and liveness analysis is done on the complete caller function, all registers that are alive after the "call" are return values of the subroutine. But this method has two downsides:

Since the caller is free to ignore (and never use) any number of return values, not all return values are recognized. All callers of this function would need to be analyzed. All callers are only known if the function never gets called from the outside (i.e. from unknown code), and if that is the case, we can as well use the "closed" algorithm described above.
The caller function has to be known completely. If there is a call to an function that has not yet been analyzed just after the call of the function we are about to analyze, we do not know what registers the second function uses as parameters, so liveness analysis cannot be done. Therefore, in order to do the liveness analysis of a function, it has to be known at least what the parameters of all functions are that are called within this function.

Because of this dependency, a lot has to be known in order to find out the return values of a single function. In practice, the parameters of many other functions need to be known. Therefore it makes sense to do it like this:

Find out all code in this executable.
Find out all parameters of all functions recursively. Start with the main function. If it calls any other functions, look at them first.
Find out all return values of private functions, using the algorithm above.

There can be one problem: Recursive calls. An easy solution might be to treat the call as a branch... let's see.

Calling Conventions

If the PowerPC code adheres to the PowerPC/Darwin calling conventions, all this might be a lot easier. These conventions state which registers can be parameters, which registers can hold return values, and which registers need to be preserved during a function call. It should make sense to try out how compatible this approach is.

18 June 2004

Endianness and Alignment

There are two ways to deal with the endianness issue:

Swap all data from short and int memory accesses
Store all ints in the CPU's native format, swap all short, byte and unaligned accesses.

The latter is supposed to be faster, because most memory accesses by far are int accesses. The problem of this solution comes with the more complex byte and short accesses, as well with unaligned accesses.

If a byte is accessed at a constant location, the recompiler can change the address by XORing it with 3. Short accesses are more complicated:

a & 3 | result                     |
------+----------------------------+------------
  0   | swap(short(a+2))           |
  1   | swap(short(a))             |
  2   | swap(short(a-2))           |
  3   | byte(a-3) << 8 + byte(a+4) | (unaligned)

Again, if the address is a constant, the recompiler can generate code that adjusts the address and swaps the data. But if the address is a variable, the test has to be done at runtime.

But there is yet a case that is worse: aligned int reads need no adjustment at all, but unaligned int reads do - and it would be very inefficient to test whether an address is aligned every time before accessing memory, in fact, the whole solution would be slower than just swapping all shorts and ints. The only way around this are misalignment exceptions. They are slow, if they happen, but most times, they will not happen, and no extra code is necessary.

Can the i386 throw an exception on a misaligned access? I found nothing. If it can't I see no way to implement this idea.

10 June 2004

Relocatable Code, Position Independent Code and Position Dependent Code

On Darwin/PPC, there are two mechanisms to make code run at any location:

Position Independent Code: All addresses are calculated by the generated code at runtime using the current program counter as the base address. The following example reads the contents of a global variable into r2:
```
        bl label1
label1: mfspr r0, lr
        add r0, r0, variable-label1
        lwz r2, (r0)
```
Relocations: All addresses that are used in the code are associated with a relocation entry. The linker patches the code on load time.
```
        li r0, variable
        lwz r2, (r0)
```
If the code gets relocated, the first instruction gets patched.

GCC outputs position independent code by default, i.e. all data segment accesses are PC-relative. This can be turned off with "-mdynamic-no-pic": All data segment addresses are absolute then. In all cases, GCC keeps the code relocatable, i.e. relocation entries are created for all accesses. Apple recommends to turn this option on for all executables.

The linker can then either preserve or remove certain relocation entries. When linking executables, the linker removes them.

So .o files and libraries contain the relocation information, and can be either position-independent or fixed, depending on the GCC setting. Executables contain to relocation information but can still be either relocation-independent or fixed.

There is one special case: If executables are compiled with position-independent code turned on, this leads to an executable that has some of the PIC properties, but cannot be run at any different point in memory. The PIC setting of GCC generates PC-relative accesses of all global variables, but code references, i.e. function pointers are not generated this way. They are not position independent, but relocatable. The linker removes the relocations, so the code has the instruction for position independence in it, but running it at any other position will break it because of the function pointers.

Making Use of Relocations

A recompiler can make use of the information provided by the relocations. All function pointers can be detected, because they are attributes with a relocation entry. This way, it should be possible to find all code paths in a compiled executable, and to replace PPC function addresses with the i386 addresses everywhere.

Keeping Two Instances of a Register Apart

While conventional compilers can see every source variable as one variable, and combine several variables in one destination register, source variables of a recompiler can already be made up of several variables, because the original compiler had to map several source (C) variables to one destination (PowerPC) register. So one source variable of the recompiler, i.e. a PowerPC register, can often be decomposed into several variables. Let's call the different C variables mapped to a PowerPC register the different instances of a register.

Every instance of a register should be treated like a separate register. But how do we keep the separate instances apart? An easy way would be to traverse the liveness analysis data instruction by instruction. Every time the life of an uncoloured variable starts its life, its whole life is traversed (following all paths) and coloured. The new number of the variable is the register number combined with the colour.

If a register is read and written in the same instruction, it is not considered dead in between, but it is, if it is read in one instruction, and written to in the next one.

This algorithm can simply be plugged into the register allocator. The rest of it should work just as well with or without it. It might be a good idea to make the recompiler decide dynamically whether to turn this feature on or off at certain points of optimization.

Atomicity Problems

One PowerPC instruction might be converted into more than one i386 instruction. This can be a problem if the instruction is supposed to be atomic. Although I don't know of any examples yet, this should be investigated again later.

Register Allocation Strategies

A lot has been said about register allocation - here are the three most popular methods.

Direct Mapping

Every source register is mapped "to its own exclusive target register" (bintrans.pdf). This makes sense if the target machine has about as many or more registers than the source machine, else it is inefficient. The bintrans i386 to Alpha recompiler does it this way.

My first pass does it this way as well; two target registers are allocated dynamically with source registers on a by-instruction basis.
Register Caching

All registers are kept in memory and loaded into target registers when they are used. If all target register are allocated with source registers, the least recently used register gets written back to memory and reused for another source register. At the end of every block, the registers have to be written back. PearPC does it this way. The problem is that there is no easy way to hand loaded registers from one block to the next one without storing and reloading them. Neither bintrans nor PearPC have found a solution.

Perhaps my first pass should do it this way, as it probably generates faster code and should not slow down recompilation much.
Register Allocation

Complete functions are analyzed, liveness analysis is done and target registers are allocated only for the time an instance of a source register is alive.

My second pass is supposed to do it this way.

Possible Memory Layout Problems?

bintrans.pdf states that in general it is a problem to do dynamic recompilation of 32 bit applications for another 32 bit platform, because of memory constraints. I can't see right now why this should be a problem. The recompiler can be seen as a library, and possibly also be loaded to a dynamic location in the target address space. All extra data needed can be allocated dynamically. Sure, any application will need more physical RAM on the target machine, but it should not be critical for any application.

06 June 2004

Liveness Analysis & Register Allocation

Liveness Analysis and Register Allocation seems to be for compiler construction like scheduling algorithms are for operating systems: Few people actually implement it, but it is a major topic in the corresponding university lectures.

Liveness Analysis

The intermediate language of compilers typically handles any number of registers, that is, variables, that have to be mapped to a very limited number of registers during code generation. Also in this case, 32 general purpose registers on the PowerPC have to be mapped to 6 to 8 general purpose registers in the i386. Liveness analysis makes sense to find out the life-span of a register. Source registers that have no overlapping life-spans can be mapped to one target registers. A register is alive if its value will be used later, else it is dead. More formal: A register is alive at the time of a certain instruction if a path exists from the instruction to one that reads the register.

Let's take an iterative fib() example:

cc
                                            1111111111222222222233ltr
                                  01234567890123456789012345678901rr0
---------------------------------------------------------------------
00 fib:    cmpwi   r3,0x1       |    R                              W  
01         or      r0,r3,r3     | W  R                                 
02         ble     label1       |                                   R  
03         addic.  r3,r3,0xffff |    M                              W  
04         li      r9,0x0       |          W                           
05         li      r0,0x1       | W                                    
06         beq     label3       |                                   R  
07         mtspr   ctr,r3       |    R                             W   
08 label2: add     r2,r9,r0     | R W      R                           
09         or      r9,r0,r0     | R        W                           
10         or      r0,r2,r2     | W R                                  
11         bdnz    label2       |                                  M   
12 label3: or      r0,r2,r2     | W R                                  
13 label1: or      r3,r0,r0     | R  W                                 
14         blr                  |    R

The PowerPC code uses 4 general-purpose registers, the Count Register, and the condition register CR0. Not counting the Link Register, this is a total of six registers. According to PowerPC Calling Conventions, only r1 and r13-r31 are preserved during a function call, and r3 holds the return value. So the only register that will possibly be read after the function in this case is r3, that's why "blr" is marked to read r3.

cc
                                            1111111111222222222233ltr
                                  01234567890123456789012345678901rr0  
---------------------------------------------------------------------
00 fib:    cmpwi   r3,0x1       |    L                              L  
01         or      r0,r3,r3     | L  L                              L  
02         ble     label1       | L  L                              L  
03         addic.  r3,r3,0xffff |    L                              L  
04         li      r9,0x0       |    L     L                        L  
05         li      r0,0x1       | L  L     L                        L  
06         beq     label3       | L  L     L                        L  
07         mtspr   ctr,r3       | L  L     L                       L   
08 label2: add     r2,r9,r0     | L L      L                       L   
09         or      r9,r0,r0     | L L      L                       L   
10         or      r0,r2,r2     | L L      L                       L   
11         bdnz    label2       | L L      L                       L   
12 label3: or      r0,r2,r2     | L L                                  
13 label1: or      r3,r0,r0     | L  L                                 
14         blr                  |    L

Register Allocation

r0 overlaps with r2, r3, r9, ctr, cr0
r2 overlaps with r0, r9, ctr
r3 overlaps with r0, r9, ctr, cr0
r9 overlaps with r0, r2, r3, ctr, cr0
ctr overlaps with r0, r2, r3, r9
cr0 overlaps with r0, r3, r9.

The interference graph of the example looks like this:

                                  r0
                              **********                         
                          ****    **    ****                   
                      ****       * **       ****              
                  ****         **  * **         ****         
                r2***         *    *   *        ****r3
                 *   ****   **     *    **   ****   *  
                 *       ***       *      *** *     *  
                 *       ** ****   *  **** ***      *  
                 *      *       ******      **      *  
                 *    **       ********    *  **    *  
                 *   *      ***    *   ****     *   *  
                 * **   ****       *     *****   ** *  
                 **  ***           *    *     ***  **  
                r9**********************************ctr
                  ****             *  *                
                      ****         * *                 
                          ****     **                  
                              ***cr0

The graph has an edge between registers if the two registers interfere, that is, they overlap in time. Registers that do not interfere can be assigned a single register. This graph has all possible edges except for cr0/ctr and r2/r3. So cr0 and ctr can be assigned one target register, as can r3 and r3. This means that this algorithm can be done on the i386 with only 4 registers. The condition register might even be completely eliminated as a GPR, by replacing it with the i386 flags (see below).

LA & RA when doing Binary Translation

When compiling high level languages into assembly, source registers (variables) are rarely reused. If a register is dead, it may be alive again in the next iteration of the loop the program is in, but it will rarely be alive and hold data with a different meaning later. But this is very common when the source language already is assembly language, because, if the algorithm needs many registers, the compiler might have mapped several source (C) register to one target (PowerPC) register. So PowerPC assembly code reuses the same register for purposes. In the example above, there are holes in which r0 and r3 are dead. Register r3 in instructions 0 to 7 has nothing to do with r3 in instructions 13 to 14. So it can be seen as two different registers. If there are more registers, better allocations might be found.

Condition register & Flags optimization

As mentioned above, condition registers, flags and side effects of instructions can be optimized.

00 fib:    cmpwi   r3,0x1       | A
01         or      r0,r3,r3     | A
02         ble     label1       | A
03         addic.  r3,r3,0xffff |  B
04         li      r9,0x0       |  B
05         li      r0,0x1       |  B
06         beq     label3       |  B

The liveness of cr0 reaches only up to the next conditional branch, in either case. The two instances of cr0 are completely independent. So, in this case, cmpwi/ble and addic./beq form a team. If the instruction that modifies the flags and the instruction(s) that read the flags know about each other, both instructions can be optimized. In this case, cmpwi/ble can become cmp/jle and addic./beq can become add/jz. The flags won't be needed afterwards, so they don't have to be converted into cr0 format or written back.

Similarly, unnecessary output to the carry flag or the summary overflow flag, can be eliminated.

When we apply the cr0 optimization, and use the following register mapping:

r3, r2 -> esi
r0     -> edi
ctr    -> ecx
r9     -> ebx

then we get the following translation:

00 fib:    cmpwi   r3,0x1       | fib:    cmp esi, 1
01         or      r0,r3,r3     |         mov edi, esi
02         ble     label1       |         jle label1
03         addic.  r3,r3,0xffff |         sub esi, 1
04         li      r9,0x0       |         mov ebx, 0
05         li      r0,0x1       |         mov edi, 1
06         beq     label3       |         jz label3
07         mtspr   ctr,r3       |         mov ecx, esi
08 label2: add     r2,r9,r0     | label2: lea esi, [ebx+edi]
09         or      r9,r0,r0     |         mov ebx, edi
10         or      r0,r2,r2     |         mov edi, esi
11         bdnz    label2       |         loop label2
12 label3: or      r0,r2,r2     | label3: mov edi, esi
13 label1: or      r3,r0,r0     | label1: mov esi, edi
14         blr                  |         ret

The i386 version is just as long and just as efficient as the PowerPC version (yes, loop should perhaps be replaced with dec/jnz). The i386 calling conventions might lead to some more statements, but all this will make the function only slower by a very small percentage.

Note that output to the condition register must be explicitly turned on on the PowerPC, so there can be any number of instructions between the compare and the branch without overwriting the flags. On the i386, many instructions modify the flags, so the flags might be overwritten before the conditional branch is reached. In the example above, "xor ebx, ebx" had to be avoided in order to preserve the i386 flags. In the general case, this is not always possible. There are two solutions: The easy one is to push the flags as soon as an instruction has to be generated that modifies the flag, and pop the flags just before the branch. This can be implemented quite easily. The other solution would be to reorder the instructions, if possible. In the example, the cmp and sub instructions that modify the flags could have been put just before the branch.

03 June 2004

Dynamic Register Allocation

I earlier said that dynamic register allocation might not be necessary, because software tends to use the same registers anyway. This is only partially true. GCC starts to use registers starting with r3, but tries to avoid using registers twice to keep register dependencies low. So a calculation in 5 steps usually needs about five registers. Very simple basic blocks often use registers r3 to r10. So dynamic register allocation is probably a good idea.

During the first pass (the first unoptimized translation), information about register usage can be gathered. Every register read and write has to be logged. This includes special purpose registers. The "switch" jump table code from above

00         mfspr   r2,lr
01         bcl     20,31,label
02 label:  mfspr   r10,lr
03         mtspr   lr,r2
04         cmplwi  r3,entries
05         bgt     default
06         addis   r6,r10,0x0
07         rlwinm  r0,r3,2,0,29
08         addi    r5,r6,table-label
09         lwzx    r4,r5,r0
10         add     r3,r4,r5
11         mtspr   ctr,r3
12         bctr

would result in the following graph:

                                    c
             1111111111222222222233lt
   01234567890123456789012345678901rr total
00   W                             R    2
01   |                             W    2
02   |       W                     R    3
03   R       |                     W    3
04    R      |                          2
05    |      |                          2
06    |  W   R                          3
07 W  R  |                              3
08 |  | WR                              4
09 R  |WR                               4
10    WRR                               3
11    R                             W   2
12                                  R   1

r0 is written in instruction 7, and read in instruction 9. If r0 is register-mapped, the "W" can write to an i386, and the "R" can read from this register. If it is memory-mapped, two memory accesses are needed, so it makes sense to map r0 to an i386 register.

r2, r4, r6, r10 and ctr are very similar. They are written once, and read afterwards.

r3 is read in instructions 4 and 7, written in instruction 10 and read again in instruction 11. It certainly makes sense to map it to an i386 register.

r5 is written in instruction 8 and read in instructions 9 and 10. This is similar to r0, and mapping it to an i386 register is also a good idea.

lr gets read and written a lot and should also be register mapped.

So all registers that have been used should be register mapped. In this case, this is r0, r2, r3, r4, r5, r6, r10, lr and ctr. That is nine registers, but the i386 only has 8 registers, and 2 of them should be kept free for various purposes (multiplications/divisions, condition code conversion, ...). But not all registers have are used all the time. The total column shows that this code never uses more than 3 registers.

When to Read and when to Write Back

Let's assume that before the block is run, all PPC registers are stored in i386 memory, and at the end of the block, the PPC registers in 386 memory have to reflect all changes, that is, every basic block is completely independent of all other blocks. Optimizations are done within a block only.

If a PPC register is i386-register mapped, and the first access is a read access, it has to be read from memory before it is used for the first time. It can be read at the very beginning, not no later than just before the instruction that reads it. The earlier it gets read, the faster the execution of the block can be, because we can avoid delays because of a dependency on a register whose memory read is not completed yet. The later it gets read, the longer the register is free for another mapping.

Similarly, after the last write access, an i386-register mapped PPC register had to be written back into memory. The earliest time is just after the last write access, and the latest time is at the very end of the block. The earlier it gets written, the better the CPU can time memory accesses, and the earlier the register can be used for other mappings. So writebacks should occur just after the last write access.

Mapping Algorithm

   EEEE
   CBSD
   XXII
00 2L  
01 2L  
02 2L a
03 2L a
04  3 a
05  3 a
06  36a
07 036 
08 0365
09 0345
10  345
11 C3  
12 C   
(a = r10, L = lr, C = ctr)

That's a first-fit strategy, which might not be so bad. It is usually good if two mapping are not adjacent, because of dependencies, even if an extra register is needed. In this case, the algorithm can be translated into i386 code that uses 4 registers (plus one or two temporary registers). If we see that there are many blocks that would need more than 6 i386 registers, it will make sense to think about better strategies. Because of the limited number of target registers and the resources available when compiling offline, a brute force approach to find the optimal allocation might make sense in this case.

If the number of PPC registers that should be mapped is lower than or equal to six, we can do a 1:1 mapping. Another algorithm might be okay for both cases: Allocate six i386 registers with the most important six PPC registers, then fill the free space with the rest, using first fit. In the example, we will have the following allocation:

   EEEEEE
   CBBSSD
   XXPPII
00  2L   
01  2L   
02 a2L   
03 a2L   
04 a 3   
05 a 3   
06 a 3  6
07 0 3  6
08 0 3 56
09 0 345 
10   345 
11 C 3   
12 C     
(a = r10, L = lr, C = ctr)

Now we know when registers are allocated and when they are free. The instructions during which i386 are free can be used to prefetch PPC registers. In the above case, only two registers are read before they are written: r3 and lr. Unfortunately, lr is read in the first instruction, and in both mappings, the i386 register r3 is mapped to is used in the instruction before the first access to r3, so no further optimizations can be done. In the general case, every register that is read before is is written should be read as soon as possible. For example, if r6 was read before it is written, it could be loaded into EDI in the first instruction already.

When to Map to Registers

It does not make sense to map a PPC register to an i386 register in any case. If a register is only written once and never read, the value has to be written to memory only and there is no need to store it in a register first, in some cases it can even be slower. The following table contains all cases:

Reads	Writes	First Access	Where to map	Why
1	0		memory	The register should not be mapped. As a memory-mapped register, it is read into a temporary register as soon as the value is needed. Some i386 instructions can even use memory as the source operand so an extra read is not necessary.
0	1		memory	The register should not be mapped. As a memory-mapped register, it is written to as soon as the resulting value is calculated. Some i386 instructions can even use memory as the destination operand so an extra write is not necessary.
1	1	(same time)	memory	Read/Modify/Write. The register is neither needed before nor afterwards, so it need not be register-mapped. Since the i386 contains some RMW instructions that operate on memory directly (like inc memory), leaving it memory-mapped can even be better in some cases.
1	1	R	memory	Read/Modify/Write. The register is read one and written to later. This translates into one load and one store if it is memory-mapped. Register mapping cannot be more efficient.
>1	0		register	The register is read multiple times, so it should be read into a register.
0	>1		register	The register is written multiple times. This happens if one value is loaded into a register in one case and another one in the opposite case. Then one value will be loaded and in one case, it will be overwritten. This should now happen within a basic block, but only in bigger blocks. A register can cache the register and keep the number memory writes low.
else			register	The register is read and written multiple times (or once each, and the write came before the read). Mapping the PowerPC register to an i386 register keeps the number of memory accesses lower.

02 June 2004

Converting PowerPC "switch" Jump Tables

After I had implemented a first version of translating "switch"-style jump tables, Axel and me discussed about possible optimizations.

GCC/PowerPC "switch" Jump Tables

GCC 3.3 typically converts a switch into something like this:

        [index to table in r3]
        mfspr   r2,lr            ; save lr
        bcl     20,31,0x1f60     ; lr = label
label:  mfspr   r10,lr           ; r10 = label
        mtspr   lr,r2            ; restore lr
        cmplwi  r3,entries
        bgt     default          ; c > entries -> default
        addis   r6,r10,0x0       ; r6 = label
        rlwinm  r0,r3,2,0,29     ; r0 = c << 2
        addi    r5,r6,table-label ; r5 = table
        lwzx    r4,r5,r0         ; get entry from table
        add     r3,r4,r5         ; address of code
        mtspr   ctr,r3           ; ctr = address
        bctr                     ; jump to code
table:
        .long case0 - table
        .long case1 - table
        .long case2 - table
        [...]
case0:
        [...]
case1:
        [...]

The algorithm is quite complex compared to a simple i386 "jmp table(%eax,4)". The reason for it is that all PowerPC code is relocatable. All entries in the jump table are relative to the start of the table, and the address of the table gets calculated pc-relative, using a "bl" and a successive evaluation of the link register.

table = label + (label-table)
jump mem(table+index*4) + table

Straightforward Translation

A first translation of this basic block should be done without optimizations. The "bcl 20,31,$+4" is a special case and can be translated into a "mov address, %ebp", where address = ppc_code_in_i386_ram + $ + 4 - ppc_base, i.e. the i386 pointer to the ppc jump table that is held in RAM, so that reading the entry from the table will return the correct value. The "bctr" will return to the interpreter, with ctr pointing to the PowerPC code that is supposed to be jumped to - in i386 address format: ctr points to the i386 RAM location where the target code is stored. It can be converted back into a PowerPC address by subtracting ppc_code_in_i386_ram - ppc_base from it.

Detecting "switch" Jump Tables

Returning to the recompiler every time a jump table is encountered is very expensive. It can be moved into native i386 code in a second pass. For this, some additional data has to be collected during the first pass:

whether the basic block contains a "bcl 20,31,$+4" and ends with a "bctr" - then it's a jump using a "switch" jump table which can be optimized in a second pass
the address of "label", i.e. the address of the instruction after the "bcl 20,31,$+4"
the address of "table", i.e. the address after the "bctr"

The analysis has one problem: The test index < entries contains a conditional jump which, depending on the strategy, usually ends a basic block. If this is the case, the analysis becomes more complex: The statistics data for every block will need two additional entries: Whether it contains a "bcl 20,31,$+4", and whether it ends with a "bctr". If a block is supposed to undergo pass 2 optimization, and it contains a "bcl 20,31,$+4", it is looked up whether the "not-taken" successor ends with a "bcr". In this case, these two blocks form a "switch" jump table jump. During the second pass, the size of the jump table has to be found out first. A simple method would be to find the first int32 that is >= 256M - all valid PowerPC instructions are above or equal to this value, and the switch destination are very unlikely to be scattered over 256 MB. A cleaner solution would be to walk down the table and stop as soon as an address is reached that has been found in the table - because all code of all cases is stored directly after the table.

Emitting Optimized Code

So now we have "label", "table", "entries" and "diff" (=table-label). We need to recompile all cases first. Since we know the size of the jump table, we also know the addresses of all target blocks. The jump table can be converted into i386 format by replacing every PowerPC offset by the offset of the i386 code relative to the jump table. This jump table is of course stored within the recompiled code. We can now recompile the block again (or patch in the differences): The "mov address, %ebp" will have an "address" value of "i386_jumptable - diff", so that after "diff" gets added to it, it points to the address where the converted jump table resides in i386 memory. In addition, the "bctr" will can be translated into "mov reg.ctr, %eax; jmp *%eax". The resulting code will look somewhat like this:

        [index to table in %esi]
        mov %ebp, %ebx
        mov $x86_table - (ppc_label - ppc_table), %ebp
        mov %ebp, %ecx
        mov %ebx, %ebp
        cmp %esi, entries
        ja default
        mov %ecx, %ebx
        shl $2, %esi
        add $ppc_table-ppc_label, %ebx
        mov (%ebx+%esi), %eax
        add %ebx, %eax
        mov %eax, reg.ctr
        mov reg.ctr, %eax
        jmp *%eax
x86_table:
        .long x86_case0 - x86_table
        .long x86_case1 - x86_table
        .long x86_case2 - x86_table
        [...]
x86_case0:
        [...]
x86_case1:
        [...]

One further optimization would be to remove the "add $ppc_table-ppc_label" instruction, so that "x86_table" can be stored in the emulated link register in the beginning, which produces code that is cleaner and more readable.

27 May 2004

Converting Darwin/PPC Stack Frames into x86 Stack Frames

On Darwin/PPC (Mac OS X), stack frames look like this:

create stack frame (prolog):

mfspr   r2,lr
stw     r2,0x8(r1)
stmw    r29,0xfff4(r1)
stwu    r1,-0x50(r1)
remove stack frame (epilog):
lwz     r4,0x58(r1)
addi    r1,r1,0x50
lmw     r29,0xfff4(r1)
mtspr   lr,r4
blr

The order of the instructions can of course be permuted within some limits. In the prolog for example, the stack pointer can be decremented first, if the offsets of the stm and stmw instructions are adjusted. In the epilog, the "addi" can be at the very beginning or the very end as well. It can also be replaced by "lwz r1, (r1)" - GCC's optimization level 0 does it like this, for example. Stack frame code created by CodeWarrior doesn't look basically different. (HTE http://hte.sourceforge.net/ is a great tool for exploring executables, also in PEF format.)

Can these blocks of code be converted into something like this?

mov -12(%esp), "r29"
mov -8(%esp), "r30"
mov -4(%esp), "r31"
sub $0x4C, %esp

and

add $0x4C, %esp
mov "r29", -12(%esp)
mov "r30", -8(%esp)
mov "r31", -4(%esp)
ret

Ignoring the code that saves/restores registers, the actual create/remove code looks like this:

Method 1: Do not Save Stack Pointer

sub $0x4C, %esp
[...]
add $0x4C, %esp

The advantage of call/ret instead of a simulated link register on the i386 is performance. Although in theory a link register would increase performance, i386 processors pipes are optimized for call/ret sequences.

In this code, we just ignore the link register, and we use the i386 call/ret functions for subroutines. The stack frames are still compatible:

 ...                     ...
|LR            |        |DATA          |
|previous SP   |        |(unused word) |
|--------------|        |--------------|
|DATA          |        |return address|
|DATA          |        |DATA          |
 ...                    |DATA          |
|DATA          |         ...
|DATA          |        |DATA          |
|LR            |        |DATA          |
|previous SP   |<- SP   |(unused word) |<- SP
+--------------+        +--------------+

They occupy the same amount of stack space, and, not counting the fields for the LR register and the stack pointer, they contain the same data at almost the same position - all fields are off by one word, but this can be compensated easily by adding 4 to the offset in all sp-indexed memory accesses.

The problem with this code is that the recompiler needs to know the size of the stack frame when writing the i386 version of the epilog. If GCC has used the "addi", this is easy, but GCC sometimes uses the both the "addi" and the "lwz" method in the same executable (found in QuickTime Player, for example).

An algorithm would have to find the corresponding prolog for every epilog to find out the size of the stack frame: Every basic block that contains a prolog will have the stack size attributed to it. Basic blocks that are linked with a conditional or unconditional branch inherit the value.

There are two more solutions that just use different prolog/epilog code. Both save the contents of the stack pointer before creating the stack frame. The value can either be saved on the top of the stack, as the PowerPC code does it, or the value can be stored in a register, as it is done by compiled i386 code.

Method 2: Save Stack Pointer on Stack

mov %esp, %eax
sub $0x48, %esp
push %eax
[...]
pop %esp

The prolog can also be written like this:

mov %esp, %eax
sub $0x4C, %esp
mov %eax, (%esp)

Or like this:

mov %esp, -0x4C(%esp)
sub $0x4C, %esp

This mimics the original "stwu", but saving data in not yet allocated stack space is no good idea.

Using this method, the stack looks like this:

 ...                     ...
|LR            |        |DATA          |
|previous SP   |        |previous SP   |
|--------------|        |--------------|
|DATA          |        |return address|
|DATA          |        |DATA          |
 ...                    |DATA          |
|DATA          |         ...
|DATA          |        |DATA          |
|LR            |        |DATA          |
|previous SP   |<- SP   |previous SP   |<- SP
+--------------+        +--------------+

The stack frame has the same size. All data fields are off by one again, but, again, this can be compensated easily.

Method 3: Save Stack Pointer in a Register

push ebp
mov %esp, %ebp
sub $0x48, %esp
[...]
mov %ebp, %esp
pop %ebp

On the stack, this method looks like this:

 ...                     ...
|LR            |        |DATA          |
|previous SP   |        |DATA          |<- BP
|--------------|        |--------------|
|DATA          |        |return address|
|DATA          |        |previous BP   |
 ...                    |DATA          |
|DATA          |         DATA          |
|DATA          |         ...
|LR            |        |DATA          |
|previous SP   |<- SP   |DATA          |<- SP
+--------------+        +--------------+

This is what typical i386 stack frame code looks like. The original stack pointer is saved into the "base pointer" (BP) after the old BP has been saved on the stack. This method caches the most current old-SP in the BP, not unlike the most current return address is cached in the Link Register on the PowerPC. The main reason for this is that on the i386, the stack pointer can change within the body of a subroutine, because parameters to subroutines are passed by placing them on the TOS in-time. This way, parameters and local variables can be addressed BP-relative with constant offsets.

On the PowerPC, all stack accesses are SP-relative. It is not easily possible to convert them to BP-relative accesses on the i386, because the offset difference depends on the size of the stack frame. The algorithm presented earlier is necessary to find out for every basic blocks which subroutine it belongs to, in order to find out the offset difference for all accesses within this block. Other than shorter encoding (BP-indexed is shorter than SP-indexed), this has no advantages, unless parameter passing will also be converted from stores into the current stack frame (PowerPC) to push instructions (i386), which can only be done by analyzing complete subroutines.

This method needs all stack-relative accesses to be adjusted by 8 bytes.

As an alternative, the BP could be saved at the TOS:

mov %ebp, -0x4C[%esp]
mov %esp, %ebp
sub $0x4C, %esp
[...]
mov %ebp, %esp
mov -0x4C(%esp), %ebp

This looks like this:

 ...                     ...
|LR            |        |DATA          |
|previous SP   |        |previous BP   |<- BP
|--------------|        |--------------|
|DATA          |        |return address|
|DATA          |        |DATA          |
 ...                    |DATA          |
|DATA          |         ...
|DATA          |        |DATA          |
|LR            |        |DATA          |
|previous SP   |<- SP   |previous BP   |<- SP
+--------------+        +--------------+

All stack-based accesses have to be adjusted by 4 bytes.

Comparison

Method 1 presented would be be best in performance as all it needs is a "sub" and an "add", as it does not save the old stack pointer. Unfortunately, the algorithm to create this kind of code is more complex.

Method 2 consists of 3-4 instructions, 2 of them memory accesses.

Method 3 consists of 5 instructions, 2 of them memory accesses, and permanently occupies an i386 register. Using the algorithm that is necessary for method 1, all SP-based accesses can be converted into BP-based accesses, which have a shorter encoding.

Unless I am missing something here, method 1 looks like the clear winner. The additional algorithm isn't all that hard to implement and needs little additional resources; and the generated code is a lot faster.

Not all code might be compatible with this simplified translation, so the conventional method that simulates a link register should always be available as a runtime option.

How to do the Translation

We need to translate something like this:

mfspr   r2,lr
stw     r2,0x8(r1)
stmw    r29,0xfff4(r1)
stwu    r1,-0x50(r1)

and

lwz     r4,0x58(r1)
addi    r1,r1,0x50
lmw     r29,0xfff4(r1)
mtspr   lr,r4
blr

into this:

mov -12(%esp), "r29"
mov -8(%esp), "r30"
mov -4(%esp), "r31"
sub $0x4C, %esp

and

add $0x4C, %esp
mov "r29", -12(%esp)
mov "r30", -8(%esp)
mov "r31", -4(%esp)
ret

These are the steps necessary:

by default (signed) framesize = 0 (attribute of a basic block)
If a new subroutine is entered, framesize = 0.
If a basic block is branched to, framesize is inherited.
If an "mfspr rn, lr" is to be translated, emit no target code
If an stw is to be translated that uses stack-based addressing outside the current stackframe (offset > framesize), emit no code
For all r1-indexed stores, if "framesize" > 0, the offset will be decreased by 4.
If an "stwu" is to be translated that has r1 as both operands, set framesize = n, and emit a "sub $n, $esp", the latter with n decreased by 4.
If a "bl" is to be translated, emit a "call"
If an SP-relative load loads from above the current stackframe (offset > framesize), emit no code.
"addi r1, r1, n" gets translated literally, framesize = 0.
"lwz r1,(r1)" gets translated to "add $framesize, %esp", framesize = 0.
If "mtspr lr, rn" is to be translated, emit no code.
"blr" translates to "ret"

A futher idea would be to convert instructions that save registers on the bottom of the new stack frame, usually using stmw, into i386 push instructions. The algorithm above would have to be extended to increase "framesize" by 4 every time a value is pushed, to decrease it by 4 every time a value is popped, and to decrease the immedate value of the add and sub instructions by the value of framesize. Compiled i386 code usually uses push/pop to save/restore registers, the performance might be better using this method.

Interpretation and Recompilation

The interpreter should always execute every single instruction exactly. If calling conventions are converted into i386 style, an easy switch at runtime between recompiled and interpreted code is not possible, because the offsets of data on the stack are different. If i386 calling conventions are used, the interpreter must also act the same way (ignore link register instructions, store instruction pointer on stack when a "bl" is encountered, ...), which makes it unsuitable as a fallback mechanism.

So, if the recompiler uses i386 calls, then the interpreter can be used for less. The question is whether the interpreter is needed after all.

PearPC includes a dynamic recompiler that recompiles every basic block before execution, i.e. it never interprets. Complex instructions are converted into subroutine calls. PearPC statistics at http://pearpc.sourceforge.net/wiki/index.php?Statistics state that very little time is actually needed for recompilation:

linux + x11 + xfce + gimp: 2.5%

until darwin login screen: 0.9%

compilation of ht: 0.2%

Of what I know about his code (I haven't looked at it yet, just talked to him), his code generation is probably faster than mine by a small linear factor. But my generated code is also a lot faster, because it doesn't have to simulate the MMU. So the numbers should even be smaller in my case.

Looking at it differently: The iPhoto application consists of about 1.5 MB of code, which is about 400,000 PowerPC instructions. If one instruction needs 100 i386 clock cycles to be translated, the whole application needs 40,000,000 cycles. On a 1 GHz CPU, this is .04 seconds. If translation needs 1000 cycles pre instruction, iPhoto will still be translated in less than half a second. On my G4/800 with 640 MB of RAM, launching iPhoto after it had just been started and closed (in order to measure raw CPU time needed), it took 20 seconds until my photos appeared.

Why Dynamic Register Allocation?

How much faster is a translated program if registers are allocated dynamically in every subroutine compared to the same program with a static allocation, i.e. the same PowerPC registers map to i386 registers throughout the whole program, and certain registers are always simulated using memory?

There are two kinds of code: Code that uses at most ~7 registers, and code that uses more register. All compiled code that uses few registers always uses the same registers, typically the volatile registers r3, r4... and the nonvolatile registers r31, r30, ... If the static system uses i386 registers for these registers that are used most frequently, the static method generates code that is just as fast as code generated by a dynamic method.

If code needs more registers, the dynamic method has to map the most important PowerPC registers to i386 registers, and map the rest to memory - static allocation maps a fixed set of PowerPC registers (r3+ and r31-) to i386 registers, and maps the rest to memory. If all registers are about equally important, both approaches are equally good. If the fixed set is statistically more important, the dynamic approach is hardly much better. Only if the everything other than the fixed set is most important, the dynamic approach is really better.

All this is of course only true if "dynamic register allocation" only means finding the most important registers. A more advanced approach would be to do data flow analysis and map one or more PowerPC registers to one i386 register.

Though, I think, recompiled code may be quite fast even without doing dynamic register allocation. Sebastian Biallas from PearPC states in http://pearpc.sourceforge.net/wiki/index.php?sepp_CPU_chat that the performance problems of PearPC are caused by the software implementation of the MMU (every memory access has to be translated into a call of an algorithm that needs at least 40 cycles to execute), as well as the JITC timing/bookkeeping code and the missing FPU JITC (threaded code is used instead). The register constraints of the i386 are definitely not the main performance problem.

On the other hand, my recompiler neither has to simulate an MMU, nor to do timing. Given I will have an FPU JIT one day, all 3 problems mentioned by Sebastian will be irrelevant to my recompiler. Perhaps the i386 register constraints will be the most important problem then?

Endianness

Mac OS X is Big Endian, i386 CPUs can only do Little Endian. All ints are stored in reverse order if you compare the memory contents of both systems. A naive solution would be to swap all ints on the i386 before storing them and after reading them. This would mean two "bswap" instructions in case of a store, as bswap overwrites the original value, but a store must not modify the source register. A second bswap restores the value. In average, 1.5 additional instructions are needed for every memory access.

If the system was closed and only supported aligned int accesses to memory, nothing would have to be changed. The CPU can't find out what order the values are stored in. But as soon as the CPU can read single bytes, or data is exchanged with external sources, this matters.

So I think it will lead to better performance if all int accesses are left unchanged, and all byte accesses and transfers from/to the outside world are converted.

Byte Access Conversion

Compiled PowerPC code usually only accesses 4-byte aligned ints, so these will not be converted. The following table summarizes the differences of the byte offsets when addressing bytes within an int:

 CPU         | PowerPC | i386
-------------+---------+------
 byte number | 0123    | 0123
 order       | HhLl    | lLhH
 *2^         | 3210    | 0123

If the i386 stores all ints in its native format, byte accesses have to be adjusted to that the same byte is read that would have been read by a PowerPC. The last two bits of every address have to be taken into account: A value of 0 becomes 3, 1 becomes 2 and so on:

 PowerPC | i386
---------+------
    0    |  3
    1    |  2
    3    |  1
    0    |  0

In binary, it looks like this:

 PowerPC | i386
---------+------
    00   |  11
    01   |  10
    10   |  01
    11   |  00

So the conversion is basically an XOR with 0b11. So when doing byte accesses, all addresses have to be XORed with 3. This can be done at compile time (modify the constant address) or at runtime (modify the address register, do the access, then modify it back).

Unaligned Access Conversion

Unaligned accesses have to be converted as well, but they should be very rare. There are 3 kinds of unaligned int accesses. They can be converted into a sequence of loads and shifts.

Int16 accesses

The PowerPC architecture also allows halfword (short/int16) accesses, aligned and misaligned. These have to be converted as well.

Input/Output Conversion

All the above would be enough if the system was closed, i.e. if a raw CPU would be emulated. In my case, any program can call native libraries of the operating system. A function like fread() would return the data from a file in its natural order, which is wrong on the i386 emulating the PowerPC. All input/output functions have to swap the 4 bytes in every int, converting BE ints into LE ints, and converting a sequence of bytes into a very strange order. The same is true for the data section of executables: All data has to be prepared by the loader. (Thanks to Sebastian Biallas for this hint.)

Since there are always byte ordering problems when dealing with external data, this is no more complex than what would have to be done when storing all data in BE format.

PearPC

Pear PC emulates a PowerPC-equipped Apple Macintosh on a Windows or Linux PC. It emulates the full computer, just like Virtual PC. Although one of its aims is the same as the one of SoftPear - running Mac OS X on a PC - the method is very different. PearPC can run Mac OS X already, but at a quite low speed - still this is extremely amazing. PearPC simulates the PowerPC MMU in software, and has to track the exact state of the emulated CPU, that's why it has performance problems.

The dynamic recompiler in PearPC is quite advanced. Basic blocks are not split by conditional branches, and dynamic register allocation is done: All registers are memory mapped, and registers are cached in i386 registers when they are needed. At the end of every basic block, they are written back to memory. (I hope I got this one right.) More complex instructions and instructions that are rarely needed, as well as all floating point instructions are translated by adding calls to subroutines in assembly.

As the PearPC recompiler needs to be exact, the original PowerPC addresses are kept in the link register, which means that some control flow instructions have to return into the recompiler to evaluate the target in i386 code.

21 May 2004

How to Map PowerPC Flags

As stated earlier, the PowerPC does not have registers like Zero, Carry and Sign for storing the result of a compare. Instead, it stores the values of 8, 4 or 2 into one of the eight flags registers, meaning "<", ">" and "=", respectively. There are several ways we can map this into the recompiled x86 code. Given the following PowerPC code:

	cmpi r0, 5
[...]
	ble+ label1
	nop

Unfortunately, this cannot be easily translated into "cmp"/"jbe", because the compare instruction and the conditional jump instruction need not be in the same basic block; and further conditional jumps that depend on the result of the compare can follow in different paths of execution.

1) use intermediate x86 flags (without table)

	cmp $5, %esi
	lahf
	and $%11111011, %ah
	mov %ah, flags+0
[...]
	mov flags+0, %ah
	sahf
	jp signed
	jbe label1
	jmp label2
signed:
	jle label1
label2:
	nop

"lahf" transfers the x86 flags into the ah register. The parity flag is used to signal that the compare was an unsigned one. This value is then stored in the flags[8] array.

When the flags need to be evaluated, the value will be read from flags[8] again and copied into the x86 flags register. If the parity bit is set, it has been a signed compare, and the signed jcc instructions need to be used (label "signed").

This solution needs 3 instructions for the compare sequence (not counting the "cmp") and 4-5 instructions for the conditional jump. This makes a total of 7.5 instructions. The cascaded conditional jumps look disastrous for any CPU pipeline.

2) use intermediate x86 flags (with table)

It would make more sense to convert the x86 flags at the very beginning into a format that can be evaliated quickly afterwards:

	cmp $5, %esi
	lahf
[	mov flags_unsigned_to_signed(%ah), %ah ]
	mov %ah, flags+0
[...]
	mov flags+0, %ah
	sahf
	jle label1
	nop

x86 flags stored in flags[8] should always be in a format so that they can be correctly evaluated with the signed jcc instructions. So in case of an unsigned compare, the resulting flags have to be converted, either by copying the C flag (bit #0) into the S flag (bit #7) or by using a 256 bytes table (flags_signed_to_unsigned). In case of a signed compare (assuming signed compares are more frequent than unsigned ones), no such conversion has to be done. To evaluate the flags, the flags value has to be copied back into the flags register and a signed signed jcc is enough for the branch.

This solution needs 2 (signed compare) or 3 (unsigned compare) instructions for the comparison and 3 instructions for the conditional jumps. This makes a total of 5.5 instructions. The memory access we need for all unsigned compares (<=50% of all compares) might be a performance problem.

3) use PowerPC flags

	cmp $5, %esi
	lahf
	mov flags_unsigned_x86_to_ppc(%ah), %ah
	mov %ah, flags+0
[...]
	test $4, flags+0 // 4: ">"
	jz label1
	nop

This solution converts the x86 flags into PowePC format, using one of two 256 bytes tables, one for signed and one for unsigned flag conversion. The result is the stored in flags[8]. To evaluate it, there is no need to load it back into the x86 flags register; a "test" instruction is enough. It tests whether one or more bits are set or cleared. The combination test/jcc is directly analog to the PowerPC branch, which tests a single bit and branches in one instruction.

This solution always needs 3 instructions for the compare, and 2 instructions for the conditional jump. This makes a total of 5 instructions, but we always need one memory read.

4) use intermediate x86 flags (bad hack)

This is a combination of methods 3 and 4. We store the x86 flags in memory in case of a signed compare, and the signed-converted flags in case of an unsigned compare. The evaluation is then done using the test instruction instead of loading the flags value back into the flags register:

	cmp $5, %esi
	lahf
[	mov flags_unsigned_to_signed(%ah), %ah ]
	mov %ah, flags+0
[...]
	test $0xc0, flags+0 // Z=1 || S=1: "<="
	jnz label1
	nop

The following table contains all five cases and the code needed for them.

 meaning | x86 flags  | how to test 
---------+------------+--------------------
    >    | S=0 && Z=0 | test %11000000; jz
    <    | S=1        | test %10000000; jnz
    =    |        Z=1 | test %01000000; jnz
    >=   | S=0        | test %10000000; jz
    <=   | Z=1 || S=1 | test %11000000; jnz

Just like method 2, the flags conversion is only necessary for signed or unsigned compares, depending on which compares are less frequent.

This solution needs 2 or 3 instructions for the compare, and two for the conditional jump. This makes a total of 5 instructions. A memory read is needed in at most 50% of all cases.

Getting rid of the Table

The unsigned/signed conversion needs a memory access and is therefore slow. There are alternatives that consist of more instructions but do not need to access memory:

	and $0x7f, %ah
	test $1, %ah
	jz label1
	or $0x80, %ah
label1:

This method tests whether bit 0 is set, and if it is, bit 7 is set. Conditial jumps are bad.

	mov %ah, %al
	shl $7, %al
	and $0x80, %al
	and $0x7f, %ah
	or %al, %ah

This method shifts a copy of the value left by 7 and combines the results. No conditional jump, but 5 instructions instead of 4 (or 1).

Koitsu had the following idea:

	lea (%eax,%eax), %eax
	rcr $1, %ah

The "lea" shifts eax left by one, and the "rcr" shifts ah right again, placing the carry flag in bit 7 and restoring the position of the zero flag. Two instructions, no memory access. Perfect.

19 April 2004

Kheperix

I looked at the source code of Kheperix by Gwenole Beauchesne. Kheperix is a part of SheepShaver, and available as a separate project. Little documentation is available and even the source code doesn't make it easy to understand what exactly Kheperix does. It looks like it is a recompiler from PowerPC to x86. It is derived from QEMU, as it contains many references to it, but all frontends other than PowerPC and all backends other than x86 have been removed.

I do not know anything about the principles of this project, but since it is derived from QEMU, it might be a simple recompiler that doesn't create much more than threaded code.

Another good quotation

A lot of literature that deals with the endianness issue include the following quotation:

"It is computed that eleven Thousand Persons have, at several Times, suffered Death, rather than submit to break their Eggs at the smaller End. Many hundred large Volumes have been published upon this Controversy." - Jonathan Swift, Gulliver's Travels

It seems that the computer science terms little-endian and big-endian were actually taken from Gulliver's Travels.

Mach-O-Dump and the Interpreter

I don't have much PowerPC experience, I cannot write PowerPC assembly code, and it's not easy for me to read it. I really need to learn it. The implementation of the interpreter will help me a lot understanding how PowerPC assembly works.

"machodump" is a tool that decodes and dumps information on Mach-O files, and can load its sections into memory and disassemble the PowerPC code. It has been written by Axel Auweter and Tobias Bratfisch. As the disassembler wasn't complete, I added some instructions so that a simple "hello world" application would be correctly disassembled. On top of the disassembler, I built a very simple PowerPC interpreter, which can correctly interpret this "hello world" application.

Some PowerPC Facts

This is what I learnt from developing the interpreter so far:

The PowerPC has 32 general-purpose integer registers. r0 is special, if it is used as an operand, the value of zero is sometimes used instead of its value (depends on instruction). The other 31 registers are all alike. There is no explicit stack pointer (any register can be used, a push would be "stwu r1,-4(r1)"), but there is a link register that is not part of the 32 GP registers.

Register Usage: There is the PowerPC "EABI" standard about in what register to store the stack pointer, the function arguments and the return value. r1 is supposed to be used as the stack pointer. Parameters are passed in registers r3 to r7, all further parameters on the stack (there are exceptions when passing floating point data). r3 will also be used for the return value. r2 and r3 are temporary nonvolatile registers. r8 to r31 are volatile, i.e. they are supposed to be preserved between calls.

Signed/Unsigned Compare: On the x86, there is one instruction to compare, which sets/clears the overflow, carry, sign/negative and zero flags. Depending on the combination of the flags, conditional branch instructions can then decide whether the values were, equal, or one was above, below, above or equal or below or equal - two sets of these conditions are needed one for comparing signed and one for comparing unsigned values. So the x86 has one compare instruction and two sets of conditional jumps. The PowerPC has two compare instructions and one set of conditional jumps. Its flags directly indicate above/below and equal and are correct depending on what compare instruction has been used.

Intermediate Code - Yes or No?

Georg argues that the PowerPC, being a RISC CPU, has very few combined instructions, so it might not make sense to decompose PowerPC instructions into intermediate code. But on the other hand, there are very few PowerPC instructions that can be translated into a single x86 instruction. Three-register-instructions can only be converted into x86 two-register form if the destination and on of the source registers are the same.

PowerPC assembly looks a lot like microprogramming: An instruction is not the encoded form of some operation, but more like many independent bit fields that are responsible for independent parallel operations. A conditional branch for instance has additional bits that specify whether the CTR (count register) will be decremented and whether the branch will depend on its value.

Perhaps intermediate code makes sense after all.

Fast Instruction Decoding

The instruction decoder of the machodump disassembler currently uses switch statements for decoding PowerPC instructions. The upper 6 bits of every PowerPC instruction are the operation code (opcode). In many cases, no further decoding is necessary. But for some opcodes, there is another coded field - the disassembler uses another switch statement at this point. If the instruction supports modifying the condition register, another "if" is necessary.

A switch means either one or more "if" statements or a jump using a jump table. To decode an instruction, at least one switch, perhaps two, and perhaps even another "if" is necessary. For the interpreter and the decoder part of the recompiler, this is too slow.

One idea would be to collect statistics about the frequency of instructions and change the interpretation tree in a way so that more frequent instructions are decoded faster.

I showed the current decoder to Costis (a PowerPC/GameCube expert from California that I know from GameCube hacking), and he suggested to replace the switch statements with a table that includes a bit mask, a value and a function for every instruction. The decoder would have to search through the table. The C code of this decode would be more readable, but even slower. Then he suggested to use jump tables instead of the switch - this wouldn't make much difference, since a switch will most often be compiled into a jump table anyway. But we worked out another method: We can combine all bits that *can* encode the instruction into a ~16 bits value, i.e. the upper 6 bits will be shifted down etc. This code will also include the Rc bit, which is set if the result should modify the condition register, as the LSB.

We can simply decode the instruction by using a jump table then. This jump table will have 64 K entries if the encoded instruction is 16 bits. This should be okay. Those instructions that don't use the extended opcode field will have lots of duplicate entries in the jump table. Those who don't support the Rc bit will have at least two adjacent entries in it.

This algorithm can also be used for the disassembler. It would have strings like "stmw %rS, %d(%rA)" in its table. We would only need 2 or 3 C lines of duplicate code for the disassembler, the interpreter and the recompiler's decoder.

Mapping the Stack Pointer

The main purpose of the recompiler is supposed to be running code that was compiled by gcc for the Darwin operating system. Darwin conforms to the PowerPC EABI standard in concerns of register usage. I had a look at some GCC-compiled code, and I saw that the way stack frames were used is very weird.

00001f5c	mfspr	r0,lr		; get link register
00001f60	stmw	r30,0xfff8(r1)	; save r30 and r31 on stack
00001f64	stw	r0,0x8(r1)	; save r0 (link register) *above* stack
00001f68	stwu	r1,0xffb0(r1)	; decrease stack pointer by 0x30 and save it
00001f6c	or	r30,r1,r1	; move stack pointer to "base pointer" [...actual code in the function...]
00001f80	lwz	r1,0x0(r1)	; restore stack pointer
00001f84	lwz	r0,0x8(r1)	; get r0 (link register) from *above* stack
00001f88	mtspr	lr,r0		; restore link register
00001f8c	lmw	r30,0xfff8(r1)	; restore r30 and r31
00001f90	blr			; return

While the x86 has very efficient "push" (store with predecrement) and "pop" operations (load with postincrement) instructions, on the PowerPC, multiple values are usually stored in the stack at once, decrementing the stack pointer by a multiple of 4 afterwards. The stmw instruction stores 2 registers on [sp-8], but does not update the stack pointer. The stwu instruction later updates the stack pointer. The stw instruction even stores data above the stack (pointer), i.e. within the caller's stack region. It is a convention that, if you call a function, the third (32 bit) word on the stack (sp+8) may be overwritten. If a function is called, there must be three words on the stack- the uppermost is our own saved SP, and second can be a saved CR and the third one will be the LR saved by the called function. Apple's Book "Mach-O Runtime Architecture" (MachORuntime.pdf) has details on page 31.

What happens if an interrupt occurs and the assembly code has written data on the stack but not decreased the stack pointer yet? Nothing, because the interrupt handler does not use the user mode stack. What happens on a Unix signal? Signals can have their own signal stack, or just use the main stack. Probably the kernel's signal handler is aware of this convention, so that it is no problem.

And the most important question: Can we use the x86 "ESP" register for the PowerPC stack? Interrupts won't be a problem, as they have their own stack on the x86 as well. Signals could be a problem. This has to be found out.

How to Detect Self Modification

Yes, gcc-compiled code for Darwin is our priority, but what about self-modifying code? Sure, modified code has to be interpreted. How do we detect self-modification? There are two ways. If the modifying code uses constant values to address the code segment, the recompiler can detect it during translation. In all other cases, we can detect it at runtime: We just have to catch the segmentation fault and mark the block of code that was written to as "interpret next time". If the ordering of blocks has been optimized, a jump has to be placed at the beginning of the block that has been removed. If the block gets executed more than once, it can be recompiled again.

Given that we cannot detect self modification during translation every time (which has to be found out), the latter method will be the preferred method, as it will not slow down recompilation.

11 April 2004

Optimizing within Blocks

The idea of a block is that it is atomic, i.e. it is always executed from the first to the last execution, since all jump/branch/call instructions separate blocks. As we emulate user mode only and don't have to cope with problems like finding out the instruction pointer in the source code and the exact register contents when an interrupt occurs, we have the freedom to:

* perform out-of-order-execution

* remap registers

and possibly apply further optimization techniques. Basically all we have to do is make sure the block exits with the same register contents as the source block. We're free in everything else.

Register Mapping

A very simple approach for PowerPC to x86 register mapping would be to map the first 7 PowerPC registers to the 7 x86 "general purpose" registers (EAX, EBX, ECX, EDX, ESI, EDI, EBP - the PowerPC stack pointer would be the x86 stack pointer anyway). The other PowerPC registers would be mapped to fixed memory locations. As L1-cached memory is very fast, this might not even be a big performance problem.

But a static mapping of the registers is often not very efficient. For instance, the PowerPC architecture can use any of the 32 general purpose register for indexing and we cannot index with the memory-mapped registers on the x86. So we need to reserve a register that can be used as an index and use it whenever the originally mapped register cannot be used as an index register - which will be the case in ~81% of all cases (1 - (32-6)) in theory. Let's say we reserve EAX. Six registers left. with a direct mapping.

We might even have to reserve EDX, because EAX and EDX are used for multiplication and division, unless we want to push the contents of EDX every time we do a mul or div. As these don't occur very often, it might be more efficient to actually push and pop EDX. We'll have to find out. In the other case: Five registers left.

81% (if we have 6 registers left) sounds bad. We should always map registers that are used as indexes to non-memory (i.e. x86 registers).

Optimizing Full Functions

Mapping PowerPC registers that are used as indexes to x86 registers instead of memory is very important for performance. The problem is that init/index/increment/compare never occur in the same block. Init is always outside the loop and thus outside the block, for example. In addition, the branch-not-equal at the end of the loop ends the loop block, and the not-taken successor block needs to have the correct value of the index register... in the original index register! More complicated loops might even be split into many blocks.

In many cases, optimizing register allocation might only be possible if applied to a bunch of blocks. We could find out what code locations are the starting points of functions, and find leaf functions this way, i.e. functions that contain no calls. If these are really functions (no code that does not belong to the function branches/jumps/calls a label within it), it doesn't matter at all *how* things are done inside it, as long as the registers contain the same data at the end of the function. Basically these leaf functions are similar to blocks, and we can do additional optimizations, like remapping registers.

It is necessary to find out first what registers are changed within the function and what data does not survive the function (data flow analysis). All data that does not survive the function can be remapped to different registers.

If we have statistics about all leaf functions (what registers are considered as parameters and what registers are changed), we can optimize all functions in this way that only call leaf functions. These all these functions will be like leaf functions in turn and we can apply this method recursively.

I don't want to go into much detail at this point, but this approach is definitely worth considering.

Is a Call Supposed to Split a Block?

We could assume that a call always returns, so we can just go on translating until the next jump/branch and also translate the first block of the function, and then run the translated block until the end of the first block within the function. This would make blocks bigger and translation faster and optimization better.

But a block wouldn't be atomic any more, we can't remap register within a block, because the called function might rely on some registers.

Everything I have seen so far splits blocks on calls.

Object Oriented Languages

So far, I always only considered compiled C code. But a lot of applications have been compiled from C++ or Objective C, i.e.. they are object oriented. These languages generate additional typical structures in the assembly code. I assumed that function pointers are only used in switch statements (jump tables) and when the C code explicitly uses them, but object oriented languages might use them in other cases as well. The "Codeflow Analysis" presentation at the 19C3 said something about "vtables" that contain pointers to virtual functions - these might make translation more difficult. Let's see.

A Good Quotation

"8086 is the hunchback among processors" - halvar/Team Teso ("Codeflow Analysis" presentation)

This one is similar to the classic:

"The x86 isn't all that complex - it just doesn't make a lot of sense." - Mike Johnson, Leader of 80x86 Design at AMD, Microprocessor Report (1994)

There might be nice place to fit this in the thesis.

Data Flow Analysis

Data flow analysis is the technique to analyze the way of data from a starrting point to the point the data is overwritten. As data can be copied, there may be multiple paths.

The "Codeflow Analysis" presentation deals with data flow analysis, and it might be important for advanced optimization techniques.

Hot Paths

I read parts of the paper "Optimising Hot Paths in a Dynamic Binary Translator" that describes how UQBT/UQDBT finds and optimizes hot paths within recompiled code, i.e. code flow paths that are executed more frequently.

In order to find out a hot path, it is not only necessary to find out the most frequently used blocks (nodes), but especially the most frequently used transitions between blocks (edges). Profiling means that code that increments counters and compares them with a threshold value is inserted into the code, and sampling means that interrupt code checks whether in the range of which block the instruction pointer is. Profiling is more exact, but is more expensive (UQDBT: 5-20% overhead). Sampling can't be used in our case anyway, since we're user mode only.

I didn't read all the details in this paper, but the basic idea is that as soon as we know a hot path, we can put them next to each other to optimize the cache (instead of the methods described in "Where to Place Translated Blocks in Memory / Defragmentation" above), and flip branches so that the most frequent way is the fall-through.

Optimization Principles & Components of a Recompiler

Sometimes the basic principles are so obvious that nobody cares to write them down. Nevertheless it is important to put them in words.

Optimization only makes sense for parts that are frequently executed. Unoptimized code is expensive, and optimizations are expensive. The faster execution time of optimized code must amortize the optimization costs.

Therefore it's not only important to have good optimization code, but also to have good code that find out what to optimize and how intensively.

The main parts of the recompiler will be the block manager, the hot-block logic and the translator/optimizer.

Intermediate Language or no Intermediate Language

Compiling one language directly into another one, using a lot of specialized code that deals with the two languages sounds a lot like compiler construction in the 1950s. Systems like UQBT use description languages for the assembly languages, and the actual recompiler engine is very adaptive.

I don't think we need to be that flexible, but the dragon book is definitely worth a read.

Persistence of Statistics

Of course we will not discard out translated blocks when a program terminates, we can use the work already done for the next time the program is run. It will start off a lot faster. This implies that we also save the statistics we have made and continue making them based on the existing data, which makes a lot of sense, as running a program again is nothing else than jumping to the first instruction again.

There is a lot of code in every program that only gets executed once. If we continue with the existing statistical data, this code will be executed again, and the recompiler will translate these blocks as they have been executed more than once. There are two possibilites:

* only translate blocks that have been executed more than once per program run.

* just ignore this "problem" and translate all this code.

I'd go with the second idea. This makes the second time a program is run slower, but if a program gets executed twice, it will get executed more often, and we don't want to interpret all the initialization code every time. Perhaps there is a solution between these ideas, like, just translating parts of the non-loop code every time the application is run, or mark them as to be translated after they have been interpreted twice, and translate them "offline" while the machine is idle. I think FX!32 did something like this.

Inline Functions

Why not just paste the code of functions at the place they are called? Of course, again, statistics need to be made, as too much inlining will hurt the code cache. But we can easily count edges from a call instruction to a function.

8 April 2004

Code Flow Analysis Video

I watched parts of "19C3-392-codeflow-analyse.mp4" and the accompanying slides. The presentation has been done at the "19C3" Chaos Communication Congress 2002 in Berlin. It is about code flow and data flow analysis in order to programmatically "understand" code and patch it or extract information from it.

What they presented resembled a lot static recompilation, i.e. they split the whole code at every jump, branch or call and put the blocks into a directed graph. Then they used graph algorithms on it.

What is interesting is that in tight loops (= the code that counts), a clock consists of only about 5 (x86) instructions in average.

Do we Need Intermediate Code

I talked to Georg Acher, and he raised the very interesting question whether we really need intermediate code. Being a RISC CPU, the PowerPC should not have (many) instructions that do more than one thing. Therefore it might not be necessary to decompose PowerPC instructions. Intermediate code would be just the same as the original code, but written a bit differently. "What can you do with intermediate code that you can't do with PowerPC assembly?"

Both ARMphetamine and tARMac use intermediate code when translating from ARM to x86, though. Their point is that it is necessary to decompose some instructions. Maybe it makes more sense for the ARM, which supports conditional execution of every instruction, for example. I'll have to read their reasoning again.

I'll also have to look at PowerPC assembly closely and compare it to x86 assembly. Is translation from PPC to x86 1->1 or n->1 most times - or is it also 1->n sometimes?

Anyway, without intermediate code, adding different frontends and backends will be a lot harder. But translation speed will improve.

Scientific Resources

What I have been looking at so far were mainly hobbyist open source projects. Georg suggested to browse the scientific library and pointed me to www.citeseer.com. There are 97 scientific documents that contain the term "binary translation" - a lot to look at. He also pointed me to the "Workshop on Binary Translation" (WBT). It took place four times, from 1999 to 2002 - not in 2003. The "programm committee" sounds very interesting. It contains people from IBM, Sun, Compaq, Transmeta, Intel, Microsoft as well as some universities. It always took place on one day in the morning only, so it's not that much content. In average, there were about five presentations per year.

Still this is a lot to read.

I also found http://www.program-transformation.org/, a Wiki on program transformation. It contains information on and links to decompilation.

Ways to Acquire Knowledge

There is loads of resources on recompilation available, and it takes a long time to read and understand it all - but there is another approach to learn it: Develop parts of the knowledge yourself. This way, it can be easier to understand the advantages and disadvantages of something, and to go new ways because you are unbiased.

When we did this on encryption, we developed a primitive pseudo number generator, and when we did it on compression, we developed something similar but inferior to Huffman coding. Both of these solutions better fit into the foreword of a book on the respective topic than into the actual book contents, but it made understanding the book a lot easier. If you understood half of the book already, you can apply this method again, and perhaps develop parts of what the rest of the book says yourself - or even go new ways.

When and how much to Translate

Christian Hessmann is also working on some sort of a recompiler as a diploma thesis, also supervised by Georg. He is supposed to develop an assembly language that can be easily translated into any RISC assembly - a program (like an audio or video codec in an embedded system) written in this language can be distributed and, using a simple recompiler, can run on many devices with different CPUs.

Hessi and me did some brainstorming on when to translate blocks of code and what (how much) to translate. The idea of a "hot spot" recompiler is that it uses the interpreter first and translates code when it is executed for the second time. The tARMac paper illustrates this nicely: Natively executing an instruction takes 1 cycle. Interpreting an instruction takes 10 and recompiling it takes 50 instructions. If you use an interpreter only, the average time that one instruction takes will always be 10 cycles. If you use a dynamic recompiler that compiles everything on execution, It takes 50 cycles the first time and 2 cycles from then on. If the code gets executed twice, the average is (50+2)/2 = 26. For three times, it is (50+2+2)/3 = 18 and so on. If the recompiler interprets it on first execution and recompiles it the second time, the numbers are 10, 30 and 21 (instead of 50, 26, 18). This is worse than the second case if the instruction is executed more than once. But as 90% of the time are spent in 10% of the code, a lot of code is only executed once. The HotSpot method leads to better results on typical code.

We discussed about what to translate when code is supposed to be executed that has already been interpreted before. Sooner or later, there will be a conditional branch instruction - we could translate the path last taken, or just put a jump into our own code there and continue translation. The more code we translate in one go, the less jumps back to the recompiler are necessary, and the faster the execution is in total. But as translation is very expensive, it is probably no good idea go guess a branch. We should only translate up to the branch, the single jump is not worth it.

Another question is whether we should translate up to the conditional branch, or just up to the next jump/call/branch. Translating more than one block at a time, if we know that they will all be executed, (again) has the advantage of less jumps back to our code. I see no disadvantage.

Jumping from one Block to the Next One

Hessi and me discussed how to connect blocks? If we translated a block up to a conditional branch, we put a jump to our code at the end. Our code evaluates the branch instruction to decide which route to take, and translates the next block(s), at the end of which, again, control goes back to our code. After one of the possible successors of a block that ends with a branch has also been translated, there is no reason to jump into our code when the branch goes into the translated block. If both paths lead into translated blocks, we can put a real conditional branch instruction at the end of the block, and a jump to the alternative after it. After some time, every block will jump to another block without ever jumping back to our code.

Splitting Blocks

Having some more thoughts (and being inspired of the 19C3 presentation again), a problem came up: What happens in this case:

	instr. #1
label1:
	instr. #2
	instr. #3
	instr. #4
	beq label3
	instr. #5
	instr. #6
	instr. #7
	jmp label1
label3:

There is a block from the beginning up to the "beq". We translate it, execute it, and control returns. Then we translate instr. #5 up to the "jmp" - which jumps in the middle of the block before. We don't have the address of label1 in the translated code, as we didn't know the instruction would ever be addressed.

In some cases, the interpreter might already have logged the information. So the interpreter must log all destinations of jumps/branches/calls, and the translator must check whether there's a label within a block and then split the block.

In other cases, the information is not available when translating the block. We can never be sure, as one single line of code that almost never gets executed could jump in the middle of some block when the program had been running for a long time. (Static recompilation would find this out, though.) So we need to find out the address of "label1" somehow. There are three possibilities:

* Discard the original translation, split the block in two and translate them again.

* maintain a table of source code address -> destination code address mappings for every single instruction, so it can be looked up easily. Pro: No block splitting. Con: No optimizations possible that combine two source instructions. A lot of memory and processing time needed (either for the optimized sorted data structure, or for lookup)

* same as above, but table does not need to be complete, as they contain no data when instructions have been combined. Pro: No block splitting for most cases. All optimizations possible Con: Still slow.

I cannot imagine that the mapping table can be efficient. Translating again sounds a lot better. The first code can be placed at the same location as the original block, as it cannot be larger, so no references break. I'm looking forward to seeing how this is implemented in existing solutions. Perhaps I'm all wrong.

Where to Place Translated Blocks in Memory / Defragmentation

A new block has to be translated, where do we put the destination code? Do we just put it block after the last one? Or do we spread them in destination RAM and keep the original order?

Method 1: If we place every block after the last one, there will be many cases in which control only skips a few bytes to the next block - as many blocks that are executed one after the other are in memory in the same way. If the code contains no calls at all, the destination code will have the same layout as the source code. If it does, after a block with a call, the beginning of the subroutine will be inserted right after it, as well as the complete path through the subroutine that has been taken. The block after the caller's block follows. If codeflow takes another path through the subroutine when it is called again, it will be inserted after the block that contains the call that caused this. (No block can directly run into its successor (only valid for blocks that end with conditional branches anyway), because there needs to be some room at the end of every block for our code to patch in jumps and branches when we added translated blocks. Well, we can pad with nops...)

This might not be true for all cases, but in general, code cache should be quite thankful.

Method 2: We could also mimic the layout of the original code, by multiplying the offset of the source block in the original code with a constant factor (like 2, for instance), and use it as the offset for the destination code in the destination code buffer. If a block in the destination language can in practice never be, for instance, twice as big as the same block in the source language, this should be no problem. The advantage is that the layout stays the same and the code cache can use pretty much the same strategy as on the original processor - it just needs to be bigger.

Another advantage of using the original layout is that defragmentation will be easier. After the program has been running for some time, and new blocks don't have to be translated all the time any more, (or perhaps even earlier, regularly) it makes sense to defragment the destination code. If we use method 1, the blocks are next to each other, but they don't run into each other. By shifting them closer together, we avoid unnecessary nops and jumps. By changing the order (we could make statistics which block has been executed how often by jumping back to our code after each block and counting), we could optimize the usage of the code cache.

If we use method 2, we could also just strip the padding, and get code that looks a lot like the original code - but without the code that never got reached so far. This method can only be applied once, as if new translated blocks are added, this method doesn't work any more.

The two methods aren't that different at all - we can also put together the code in its original order if we used method 1, because we have a table that maps the beginning of every (ever used) source block into a destination block. Since the second method only works once, and (not counting possible caching differences of fragmented code) it has no advantages, it should be the best way to put one block after another and do defragmentation once in a while - either using statistics, or using the original code as a template.

06 April 2004

Important Existing Recompilers

I collected information about existing recompilers on the internet and put together information about them:

Transmeta Crusoe CPU Family

Transmeta CPUs are VLIW machines. They start off in their native mode and load a recompiler from the system ROM into reserved RAM and continue executing recompiled x86 code. The Transmeta software seems to be very efficient, but the target CPU has been optimized for this, so their software is not a very typical recompiler.

DR Emulator and Virtual PC for Mac

Eric Traut (www.traut.com) was the main engineer of Apple's DR Emulator (DR = dynamic recompiler) used in Mac OS 8 (68K to PowerPC recompiler), as well as of Virtual PC for Macintosh (x86 to PowerPC recompiler). He has written about his work in BYTE magazine, and he's the autor of Apple's DR Emulator technote.

Not much detail is known about DR Emulator and Virtual PC, as both are closed source. Patent US 5,790,825 might be worth a read, though. My own experiences with VirtualPC for Mac are that a real PC is about 4 times as fast as an emulated PC in VirtualPC, if PC and Mac have the same clock speed (whatever that has to say).

Open Source Recompilers PPC -> x86

I found four (free) projects that work on a PowerPC emulator and that are in active development: QEMU, SheepShaver, Dolphin, Dolwin.

Dolphin and Dolwin are Nintendo GameCube emulators. The GameCube contains a G3 class PowerPC, and its system software never really uses the MMU, so the PowerPC emulators contained in these projects are pretty much user mode only.

Unfortunately, Dolphin is not Open Source, and Dolwin does not seem to include a recompiler, just an interpreting emulator. Yet.

QEMU by Fabrice Bellard is a portable anything-to-anything recompiler, and it can do PowerPC to x86. It trades portability against speed: As I understood the technical documentation, GCC compiles functions that implement the behavior of assembly instruction of the source language into target code, and the recompiler puts these together at runtime. Just one step more than a compiler that replaces every instruction with a call. Benchmarks on the QEMU site show that the emulated CPU reaches about 1/10th of the machine's original speed. QEMU's principles do not seem to fit with my ideas (see below).

SheepShaver, a PowerMac emulator, includes "Kheperix" by Gwenole Beauchesne - I don't even know yet what Kheperix is, because of the lack of documentation, and because I didn't have time to look into it closely. Either it is a benchmark and regression suite for PowerPC emulators, or a PowerPC emulator or both. I found references to Fabrice Bellard in the code. I can't say much about it yet; I'll have to read some more source code.

Both QEMU and SheepShaver are GPLed.

Other Open Source Recompilers

ARMphetamine and tARMac are both ARM to x86 recompilers that have been university projects. Both are Open Source and have been written at about the same time (2000/2001) and independently. ARMphetamine runs Linux/ARM binaries in Linux/x86 and tARMac is a plug in for the "Red Squirrel" Acron RISC emulator.

As both have been university projects, there is extensive documentation available about the technical details.

Both projects are quite similar to my project, as they do RISC -> CISC recompilation and focus on user mode. They translate ARM code into intermediate code (called "phetacode" resp. "armlets") before converting it into x86 code. Both projects ran out of time and could only implement basic recompilation without much optimization. The work of these projects will hopefully simplify my own start, so that I can focus on more advanced techniques and on optimization.

There are some other Open Source recompilers out there. A long ago, I put together a list of Open Source home computer/game console emulators that contain dynamic recompilation code (all target the x86):

* UAE (Amiga), Basilisk II JIT (Mac) and Hatari (Atari): all contain the same (UAE's) 68K recompiler

* MAME (MIPS3)

* Pcsx (Playstation, MIPS3)

* Fake64 (N64, MIPS4): includes some dynarec documentation

* Mupen64 (N64, MIPS4)

* SixtyForce (N64, MIPS4): target PowerPC!

* 1964 (N64, MIPS4)

* blade64 ("WIP45"), tr64 (true reality): "TrueReality - N64/cpu.c"

* daedalus (N64, MIPS)

* project 64 (RSP compiler)

* project unreality N64

* ultra64 N64

* Ultra HLE?

* Titan (Saturn, SH2) - only binary?

* AdriPSX ILE (PSX)

* FPSE (PSX)

* PCSX2 (PSX2)

The following Open Source emulators do not contain a recompiler as of May 2003: WinFellow (Amiga, 68000), Turbo68k (68000), STonX (Atari, 68000), Generator (Genesis 68000), DGen (Genesis 68000), True Reality? (Nintendo, MIPS4), MESS various, psmac PSX, nincest 64 N64, pc64s N64, nsx2 (PSX2), mappyvm (GBA), VisualBoy Advance (GBA)

Static recompilation is very rare. "corn" is the only emulator I found that uses it. Too bad it is not Open Source and has been discontinued.

Mailing Lists

Graham Toal is maintaining a mailing list on yahoo.com about static recompilation and has also written a 20 pages introduction to this topic. I'll need to have a closer look on this. There is another mailing list on yahoo.com, on dynamic recompilation.

Motivation for RISC -> CISC Recompilers

When I read about the x86 architecture in the ARMphetamine paper, an interesting thought came to my mind: Both Intel and AMD will stick with a CISC architecture for their 64 Bit desktop chips (AMD64 architecture). This has two implications:

1. It makes sense to do research on recompilers that target CISC, as CISC CPUs will stay with us for a long time.

2. The main reason why AMD and Intel stay with CISC on the desktop is binary compatibility, which is easy to achieve if you don't switch architectures. Switching architectures on the other hand can be done if you have fast recompilers.

What Level of Recompilation?

System Emulation

Most existing recompilers are used in emulators of a whole machine. VirtualPC, UAE and all gaming console emulators do not only emulate user mode code, but also the operating system. In case of VirtualPC, this includes very complex techniques to emulate virtual memory. Other emulators might not have to support virtual memory, but all system emulators have to handle IRQs and mode switches.

User Mode Emulation

My recompiler is supposed to emulate user mode only, i.e. the system software is provided by some other means, and we only want to run real user mode applications that interface with a kernel that runs natively on the machine. This doesn't only mean that we may omit complicated parts like virtual memory support and the emulation of privileged instructions - it can make the whole approach different.

Problems when doing System Emulation

A recompiler that has to emulate a complete system needs to do the emulation very closely to the original CPU. If a scheduler/timer interrupts occurs during a calculation in user mode, the operating system wants to save the contents of all registers in memory and possibly switch to another thread. When the original thread gets switched to later again, the operating system loads the contents of all registers from memory again and continues execution at the same point.

* The recompiler must make sure that interrupts can only occur between two source assembly instructions: If one source instruction got translated into several destination assembly instructions, the emulated CPU is in a state that an original CPU can never be with this code.

* The recompiler must find out the original value of the Program Counter when the interrupt occured; the emulated one is not enough.

* The recompiler cannot use a very dynamic register allocation scheme, since it must know the contents of the complete register set at all times.

Advantages of User Mode Emulation

User mode emulation can take place at a much higher level. The recompiler must only create program in the destination language that is equivalent to the source language. It is never important which register holds what value. The recompiler can even change stack allocation (use registers instead of local variables on stack, if the source text is not very optimized, or use local variables on stack instead of registers if the destination machine runs out of registers).

This is like converting the assembly code back to C and compiling it into another assembly language. If the source assembly text has been compiled from C without any compiler optimizations, this approach can be quite simple. In practice, this means a lot of work.

My idea at the moment is to write a traditional recompiler that uses a quite low-level intermediate language, like the ones used in ARMphetamine and tARMac. It should be possible to make the intermediate language more high-level and incorporate more optimizations into this system later.

Caching and Static Recompilation

Static Recompilation is faster than dynamic recompilation. All code gets translated in the very beginning, and only once. The application looks like a native application then, and can reach almost native speed, if the recompiler is a good one. No translation and no emulation is ever necessary.

Static Recompilation has many compatibility problems. This might not be true for assembly code that has been compiled from C, though, as the use of jump tables and function pointers (these are examples of what is most problematic) always works the same way, and the recompiler can detect it.

It might be possible to extend the recompiler to be a static one. Caching all translated code is a good idea anyway, and perhaps it is possible to use the recompiler to do static recompilation for code that allows it, or at least complete the translation of a program that has already been partially translated.

How to Make the Recompiler Portable

A recompiler from PowerPC to x86 makes sense, as there are many uses to it. But it also makes sense to keep it very generic, so that it can be made to translate other assembly languages as a compile time option of the recompiler.

Implementing new source and target assembly languages would usually be done by replacing some files in the source code of the recompiler. A new CPU would be implemented by code. But it could also be implemented using tables. Any assembly language can be described using tables: Names and width of registers, instruction decoding, mapping from assembly instructions to intermediate code. These tables would be enough to describe both a source and a destination language.

The advantage of a recompiler that takes tables to understand new languages is of course that it is a lot easier to add new languages. The disadvantage is that the recompiler code might be very bloated and inefficient.

It might be a good idea to "hardcode" PowerPC and x86 in the beginning (but not to be too specific), and try to replace it with tables later, as long as the translation time doesn't grow too much.

Principles of the Recompiler

My recompiler should be

* fast. user mode. high level. caching. static.

* flexible (source and target machine can be replaced).

In this order.