Twobit Pass 2: customization

Customizing Pass 2

Most of the optimizations in Pass 2 are independent of the target architecture. Those that are sensitive to the code generator or target are controlled by parameters, customizable policies, and compiler switches:

Register set
Inlining of primitive operations
Inlining of local procedures
Lambda lifting
Interprocedural register targeting
Compiler switches

The register set

The number of general registers that Twobit allocates for arguments and temporaries is a parameter. The assembler maps Twobit's general registers onto some combination of hardware registers and memory locations.

The front end of Twobit currently has no knowledge of specialized representations or registers. In particular, Twobit boxes all numbers and uses subroutine calls for all floating point calculations.

Inlining of primitive operations

The IEEE standard for Scheme allows any procedure to be redefined. Since most operations are procedures in Scheme, this prevents the compiler from generating inline code for most operations.

A compiler switch allows the programmer to promise that none of the usual primitive procedures will be redefined, which allows Twobit to compile them specially. When this switch is true:

Pass 1 standardizes the number of arguments passed to procedures such as + and performs some constant folding.
Pass 2 performs certain local source transformations that could not otherwise be performed.
Pass 4 generates inline code for calls to primitive procedures.

A table of primitive operations is supplied as a parameter. For each primitive operation, this table specifies:

The name of the operation.
Its arity.
Its MacScheme machine assembler mnemonic or encoding.
A predicate that recognizes constant operands for which an immediate form of the operation is implemented by the assembler.

Except for the assembler, which is target-dependent, Twobit has no further knowledge of primitive data types and operations.

Inlining of local procedures

Incremental lambda lifting

It is always possible to lift all known local procedures to the outermost lambda expression. This is not always desirable, however, because the formal parameters that must be added represent extra registers that must be saved across a non-tail-recursive call. This is a target-dependent tradeoff, which should be evaluated by the interprocedural register allocator.

At each stage of incremental lambda lifting, after the flow equations have been solved but before the source code is transformed, Twobit calls POLICY:LIFT?, a predicate that determines whether lifting will be performed. POLICY:LIFT? is given three arguments to examine:

The lambda expression that contains the internal definitions to be lifted.
The enclosing lambda expression to which the internal definitions will be lifted.
The solutions to the flow equations, which list the formal parameters that will be added if lifting is performed.

Twobit lifts all or none, because all can share the same closure.

Many factors could be taken into account when deciding whether to lift a group of known local procedures. Here are just a few:

If the body of the source lambda expression contains a lambda expression for which a closure must be created anyway, then its internal definitions could share that closure without lifting.
If the destination lambda expression already contains internal definitions, then adding more will not add to the cost of creating a closure at the destination.
If lifting would add too many new formal parameters, then it should not be done.
If a procedure contains non-tail-recursive calls or a large number of local variables, then adding new local variables is likely to increase the cost of register spilling within that procedure.
If many procedures are available for lifting, then it is likely that the programmer defined them for use in a computation whose cost will dominate the cost of creating a closure.

For example, Twobit currently transforms

  (define (sieve n)
    (define (new-generator next-prime prime)
      (let ((psquare (* prime prime)))
        (lambda ()
          (define (loop t)
            (if (or (< t psquare) (not (zero? (remainder t prime))))
                t
                (loop (next-prime))))
          (loop (next-prime)))))
    ...)

into

  ((lambda ()
     (define .sieve_3
       (lambda (.n_5)
         ...))
     (define .new-generator_6
       (lambda (.next-prime_10 .prime_10)
         ((lambda (.psquare_11)
            (define .loop_13
              (lambda (.t_15)
                (if
                 ((lambda (.T93_16)
                    (if .T93_16
                        .T93_16
                        (not (zero? (remainder .t_15 .prime_10)))))
                  (< .t_15 .psquare_11))
                 .t_15
                 (.loop_13 (.next-prime_10)))))
            (lambda () (.loop_13 (.next-prime_10))))
          (* .prime_10 .prime_10))))
     ...))

Here the known local procedure .loop_13 has been lifted out of only one lambda expression. Lifting to the enclosing lambda expression would add .psquare_11 as another argument, and there would be no benefit at all: A closure must be created anyway for the body of the lambda expression that defines .loop_13, and .loop_13 can share that closure at no cost.

The policy used for incremental lambda lifting does not appear to be critical in practice. It is usually advantageous to lift all known local procedures to the outermost level, as is done by conventional lambda lifting.

Interprocedural register targeting

Since the compiler can by definition locate all calls to a known local procedure, it can reorder the procedure's formal parameters and rewrite each call to match the new order. In Twobit, the formal parameters are just names for the registers that are live at entry to the procedure, so reordering the parameters amounts to interprocedural register targeting.

Register targeting is easier than interprocedural register allocation, because register allocation involves decisions concerning which values should be kept in registers and how long they should be kept there. Twobit simply assumes that the arguments to known procedures will be kept in registers for the duration of each call, so all there is to do is to select a particular register for each argument. The goal of register targeting is to minimize register shuffling when one procedure calls another.

Twobit performs interprocedural register targeting by reordering the formal parameters that it adds to known local procedures during lambda lifting. The ordering is chosen by by a customizable policy procedure.

If the set A of formal parameters to be added to f is equal to the set of formal parameters for the lambda expression L that defines f, then Twobit adds the new parameters to the beginning of the parameter list for f, after sorting them so that, for a call to f from the body of L, these parameters will already reside in the correct registers.

If A is a proper subset of the formal parameters of L, and this is the first lifting of f, then the parameters in A are allocated to the positions they occupy in the formal parameter list of L, and the original formal parameters of f are filled in around them. If a formal parameter x of L is not an element of A but occurs as an actual parameter in some call to f, then the corresponding formal parameter of f should be placed in the position left vacant by x. These heuristics reduce register shuffling for calls between mutually recursive procedures, including loop nests.

Compiler switches

Several switches affect Twobit's optimizations:

(integrate-usual-procedures #t) tells Twobit to assume that Scheme's usual procedures will not be redefined, which allows Twobit to generate inline code for them.
(benchmark-mode #t) tells Twobit to compile the (define (f ...) ...) syntax for a top-level definition as if it were
```
        (define f
          (let ()
            (define (f ...) ...)
            f))
```
This compiler switch has no effect on the (define f ...) syntax for top-level definitions, nor does it affect the interpretation of internal definitions.
(local-optimizations #t) activates Twobit's Pass 5.

These switches are true by default.

The first two switches are taken from MacScheme. They allow a programmer to assert that preserving the precise semantics of Scheme when certain top-level procedures are dynamically redefined is less important that generating efficient code.

The benefits of benchmark-mode are explained by

Andrew W Appel. Loop headers in lambda-calculus or CPS. Princeton University CS-TR-460-94, June 1994.