Larceny Note #6: Larceny on the SPARC

Lars T Hansen / May 5, 1998

1. The SPARC assembler
1.1. Structure of the SPARC assembler
1.2. Native code generation
1.3. The SPARC native assembler
1.4. Peephole optimization

2. The Programming model
2.1. Registers
2.2. Scheme-to-Scheme calling convention
2.3. Scheme-to-Millicode calling convention

3. The Instruction Cache

0. The SPARC architecture

The SPARC architecture is a RISC architecture descended from the Berkeley RISC projects. It distinguishes itself from most other RISC architectures in its use of register windows and its support for tagged arithmetic. The primary programmer's reference for the SPARC is The SPARC Architecture Manual, published by Prentice-Hall. The most recent edition covers Version 9; Larceny uses only instructions documented in the Version 8 manual.

1. The SPARC assembler

The compiler/assembler interface is parameterized by definitions in Compiler/sparc.imp.sch, as explained in the Larceny Note on adding primitives.

1.1. Structure of the SPARC assembler

The SPARC assembler has four parts:

Assembly table
Code generator
Native assembler
Peephole optimizer

The assembly table is the target-dependent part of Twobit's pass 5 (assembly). The code for this part is in Asm/Sparc/pass5p2.sch; this file defines a table of assembler procedures which is then used by the target-independent part of pass 5 to perform assembly. Each assembler procedure in the table is called with an assembly structure and a MacScheme instruction.

The assembler procedure picks the instruction apart and calls procedures in the code generator to emit machine code. Those procedures, found largely in Asm/Sparc/gen-msi.sch and Asm/Sparc/gen-prim.sch, decide on the instructions to generate, and call procedures in the native assembler. The native assembler creates the machine code for each instruction and performs certain local optimizations on the instruction stream such as branch delay slot filling and instruction scheduling.

The peephole optimizer is invoked on the instruction stream by the target-independent part of pass5, and coalesces common patterns of MacScheme instructions into instructions specialized for the SPARC.

1.2. The Code Generator

      taddcc  %R1, %R2, %R1
      bvc,a   L1
      nop
      sub     %R1, %R2, %R1
      ... call millicode ...
  L1:

The instruction at L1 will usually be lifted into the branch delay slot. If the destination of the addition is neither R1 or R2, or if R1 and R2 are the same register, then a temporary must be used in order to recover, and three instructions are necessary (FIXME: this has been improved):

      taddcc  %R1, %R2, %TMP0
      bvc,a   L1
      mov     %TMP0, %R3
      ... call millicode ...
  L1:

Larceny never uses the trapping version of the tagged add instruction.

Conditional move

The following two instructions conditionally move the value VALUE into register R1; in this case, the condition is "less than".

      ble,a   .+8
      mov     VALUE, %R1

To generate a boolean object, the following three-instruction sequence is used:

      mov     FALSE_CONST, %R1
      ble,a   .+8
      mov     TRUE_CONST, %R1

On SPARC version 9, a conditional move instruction can be used. Larceny currently uses no SPARC-9 specific instructions.

Generate address of instruction

It is possible to compute the virtual address of any instruction with a call instruction that branches ahead two instructions:

  L0: call    .+8
      nop

after which register o7 contains the address of L0. The branch delay slot can be filled in the usual manner or it can be used to adjust the value of o7 if L0 is not at the call instruction.

It has been suggested that this kind of code may cause the processor to slow down, and an alternative way of computing the value of L0 instruction using the following code (in generated code only; a different method could be used in code that won't move):

  L0: ld     [ %REG0 + 3 ], %TMP0   ! get code vector
      sub    %TMP0, 9, %TMP0        ! 1 for tag, 8 for address of _add_
      add    %TMP0, L0, %o7

This code depends on the assembler resolving L0 as the relative address within the code vector (which is a feature). The memory reference is undesirable, but may be scheduled for earlier execution; the control dependence of the call instruction is gone.

Millicode call

See the section on millicode calling, below.

Generic comparison

FIXME.

Exception handling

Exception handling in generated code is not done all that consistently, and needs to be cleaned up.

1.3. The SPARC native assembler

Overview

The native assembler, like other assemblers, allows the rest of the assembler to deal only in symbolic machine instructions (in our case, they're more like "procedural machine instructions"). The native assembler provides one procedure for each SPARC instruction it knows how to assemble; for example, there are procedures called sparc.bl.a (branch if less with annul bit set), sparc.ornrcc (logical or with complement from register and set condition codes), and sparc.andi (logical and with immediate). The mnemonics chosen are close to the standard SPARC mnemonics; the only difference is that the assembly procedures require the specification of whether the (typically) second operand is a register or an immediate.

All the mnemonics are defined at the end of Asm/Sparc/sparcasm.sch. There are also some abstractions there; for example, on the SPARC a move is expressed as a logical or with %g0, but this low-level implementation obscures the meaning of the instruction; therefore, a sparc.move instruction is provided.

Additionally, there is a sparc.label pseudo-operation, used for definining branch targets, and a sparc.slot instruction, which can be used in the branch delay slot of annulled branches.

Branch target resolution

The target-independent part of pass5 provides an ad-hoc (and, as of v0.33, rather crufty) failure mechanism that allows the assembler to deal with span-dependent branches in a crude way. Initially, the assembler assumes that all branches are "short", that is, the target is within +/-4KB from the branch instruction. (The SPARC branch instructions have a 24-bit offset, but the jmpl instruction is limited to the +/-4KB range. Since the jmpl instruction is used when generating code for the MacScheme jump instruction, some uses of jmpl occasionally need a larger offset than 4KB.) If, during target resolution, a too-large offset is discovered, then the failure mechanism is invoked. Assembly is then retried with long offsets temporarily enabled.

Optimizations

In addition to creating the machine code for instructions and resolving branch targets and other expressions, the native assembler performs a limited amount of optimization. Currently, it fills branch delay slots using a simple algorithm: if the branch target is an instruction that can go in a delay slot, then insert the target instruction in the delay slot and increment the branch offset by 4.

A more sophisticated delay slot filling algorithm is desirable, because the above algorithm does not decrease code size -- it only makes a taken branch slightly faster. A better algorithm would work at least on basic blocks and try to move instructions across branches, into the slot following the branch, whenever possible. It would not be hard to implement this efficiently.

The native assembler could also perform instruction scheduling, but that is unlikely to pay off much until we get rid of most dynamic type checks.

1.4. Peephole optimization

The SPARC peephole optimizer (Asm/Sparc/peepopt.sch) folds sequences of MacScheme Machine instructions into faster forms that are known only to the SPARC assembler. The optimizations currently performed are: copy avoidance, boolean evaluation for control, allocation optimization, and nontail-call optimization.

The peephole optimizer is currently a simple hand-coded decision tree. This is fast (certainly fast enough), but makes maintenance harder than it could be with a more pattern-based approach. Clinger has written a generic pattern-based peephole optimizer and this should be adapted to be used with Larceny.

Furthermore, a number of the names introduced by the SPARC assembler are unintelligible for no good reason. This should be cleaned up.

Copy avoidance

This class of optimizations seeks to avoid register-to-register moves that are present in the output from Twobit because the intermediate code (MacScheme Assembly Language) is a 2-address code, but that are not necessary on a 3-address machine such as the SPARC. Most of the optimizations performed by the peeophole optimizer fall into this class.

For example, a sequence such as

      reg     1
      op2imm  +, 7
      setreg  2

that adds 7 to the value in R1 and stores it into R2 is peephole-optimized into the single instruction (FIXME: new names now)

      dresop2imm  +,1,7,2

This optimization is possible because the semantics of setreg specify that it destroys RESULT; hence, the use of RESULT can be avoided altogether in many cases.

The peephole optimizer has a table of primitives that it applies this optimization to. At the time of writing, these primitives are car, cdr, CELL-REF, +, -, eq?, and cons. Others could easily be introduced.

Boolean evaluation for control

Twobit produces code like the following when a primitive that produces a boolean result is used in the test part of a conditional expression:

      reg     1
      op1     null?
      branchf Ln

Here, null? performs a test on the object in RESULT and sets RESULT to either #t or #f. The branchf instruction then compares RESULT with #f to determine whether to branch or not. The complete sequence of machine instructions looks something like this:

      mov   %r1, %result
      cmp   %result, 10      ! 10 = empty list
      mov   2, %result       ! 2 = true
      bne,a .+8
      mov   6, %result       ! 6 = false
      cmp   %result, 6
      be    Ln
      nop

In this case, the creation of a boolean object can be avoided; instead, code can be generated that checks the condition bits directly. The above sequence of MacScheme instructions is collapsed into

      optbreg1 null?, 1, Ln

for which the generated code is the much simpler

      cmp   %r1, 10
      bne   Ln
      nop

Again, this optimization is possible because the semantics of branchf is to destroy RESULT.

The primitives for which this optimization is performed are currently null?, pair?, zero?, eq?, <, <=, >, >=, =, char=?, char<, char<=, char>, and char>=. More should be added, notably many type primitives that are used by system code.

Allocation optimization

This optimization replaces generalized allocation code by specialized code. Consider a particular call to make-vector:

      (make-vector 7 #f)

In general make-vector can take any expression for the length and initialization arguments, and the above call is compiled as

      const   #f
      setreg  1
      const   7
      op2     make-vector, 1

When the assembler translates the last instruction to native code, it doesn't know the value in RESULT, and must therefore translate it as a call to the generic make-vector millicode (actually, mem_alloci). That code allocates memory and then initializes it in a loop. For short vectors of known length it is possible to do much better, and the peephole optimizer translates the above pattern into

      const   #f
      setreg  1
      op2     make-vector:7, 1

The SPARC assembly code for the primitive now looks something like this:

      set     6, %r1                         ! 6 = false
      jmpl    [ %millicode + M_ALLOC ], %o7  ! allocate
      set     32, %result                    !   8 words
      set     0x1ca2, %g1
      st      %g1, [ %result ]               ! header
      st      %r1, [ %result+4 ]             ! initialize elt 0
      st      %r1, [ %result+8 ]
      st      %r1, [ %result+12 ]
      st      %r1, [ %result+16 ]
      st      %r1, [ %result+20 ]
      st      %r1, [ %result+24 ]
      st      %r1, [ %result+28 ]
      add     3, %result                     ! insert tag

Additionally, the allocation code can be in-lined.

The optimization is performed only for make-vector, and for all vectors of known length less than a cut-off point. The current cut-off is 10, which is not completely arbitrary: there's a trade-off between fast initialization and blow-up in code size. More sophisticated techniques are possible for longer vectors, for example specialized unrolled loops, but in those cases, generic code with an unrolled initialization loop is likely to do nearly as well.

Nontail-call optimization

Twobit generates code for a non-tail call that looks like this:

      global   the-procedure
      setrtn   Lx
      invoke   3               ! 3 arguments
      .align   4               ! ignored on the SPARC
  Lx:

Splitting the setrtn and the invoke is convenient for the compiler and reduces the size of the intermediate code instruction set. However, on the SPARC a setrtn requires some nontrivial machinery; in the above case it would look something like this:

      call     .+8
      add      %o7, k, %o7      ! k reflects the distance from 'call' to Lx
      st       %o7, [ %stkp+4 ] ! store in the stack frame

and the invoke ends with

      jmpl     %tmp0 - 1, %g0   ! jump and discard return address
      set      12, %result      ! setup argument count

But since the return point immediately follows the call in this case, the return address thrown away in the invoke is the return address we want, so we can fold the above MacScheme instructions into

      global   the-procedure
      invoke-with-setrtn  3, Lx
   Lx:

where invoke-with-setrtn will end in the code sequence

      set      12, %result      ! setup argument count
      jmpl     %tmp0 - 1, %o7   ! jump and discard return address
      st       %o7, [ %stkp+4 ] ! store return address

The same optimization applies to local non-tail calls, which use branch rather than invoke. In that case, the SPARC's call instruction can be used as a PC-relative branch that saves the return address.

As it happens, this peephole optimization speeds up local non-tail calls but slows down non-local non-tail calls, on a SPARC Ultra 1. The problem with the non-local calls is probably that a pipeline dependency or memory bottleneck is created between the jump instruction and the store instruction in its delay slot. Certainly, in the case of a branch, the target is known to the processor well before the branch instruction is executed, hence the target instruction has probably been fetched already, and the memory system is available for the store.

The optimization has therefore been disabled for non-local calls. Once the run-time system and assembler are rewritten to allow the return address to be kept in a register, the dependency will go away, and this optimization should pay off also for non-tail calls.

2. Programming model

Generated code must maintain a number of target-independent invariants. The target-dependent variants are explained in this section.

2.1. Registers

The mapping from logical to hardware registers is defined in Rts/regs.cfg.

The register GLOBALS points to a table of run-time system variables and values. The globals table is generated automatically from Rts/globals.cfg. Part of the GLOBALS table is a jump table into millicode routines (see below).

Larceny does not use the SPARC's register windows. Instead, generated code uses a set of virtual machine registers that is kept partly in processor registers, partly in the GLOBALS table. Virtual machine registers that are kept in hardware registers also have shadow locations in the GLOBALS table.

The predicate hardware-mapped?, defined in Asm/Sparc/sparcutil.sch, determines whether a virtual machine register is kept in a processor register.

The globals Scheme variable *nregs*, which is defined in Compiler/sparc.imp.sch, contains the number of general-purpose virtual registers. Parts of the run-time system knows the value of this "variable", so changing it is non-trivial.

In addition to the general-purpose register, several other values are kept in registers:

TIMER holds a software timer.
RESULT is the accumulator and the first argument to millicode routines.
ARGREG2 is used for the second argument to millicode routines.
ARGREG3 is used for the third argument to millicode routines.
TMP0, TMP1, and TMP2 are temporary registers that are not inspected by the garbage collector.
STKP points to the top of the stack cache.

2.2. Scheme-to-Scheme calling convention

FIXME. [Although, I can't think of anything out of the ordinary that goes here.]

2.3. Scheme-to-Millicode calling convention

Millicode arguments are placed in RESULT, ARGREG2 and ARGREG3, and millicode is called indirectly with a single jmpl instruction through the GLOBALS register:

      jmpl    [ %GLOBALS + offset ], %o7
      nop

where offset is the table offset for the millicode routine in question, as defined in Rts/globals.cfg.

The address passed in %o7 should be the address of the return point minus 8 (as per the normal SPARC conventions, in case you're puzzled), so if the millicode procedure should not return to the point following the nop above, then the slot must be used to adjust the return address. For example,

      jmpl    [ %GLOBALS + offset ], %o7
      add     %o7, L0 - (. - 4) - 8, %o7
      ...
L0:

Above, (.-4) is the address of the jmpl instruction, which is also the address in o7, and L0 is the address of the return point, so L0-(.-4)-8 is the quantity to add to o7 to set up a return point at L0-8.

3. The Instruction Cache

Generally, generated code does not need to worry about the processor's instruction cache, which the garbage collector will automatically flush as necessary, depending on the machine model. However, be aware that most modern SPARC implementations do have separate instruction and data caches, and that a write to a location cached in the instruction cache (for example when overwriting a code vector with a breakpoint instruction) may need to be followed by the appropriate number of IFLUSH instructions.

$Id: note6-sparc.html 426 1998-12-21 14:29:31Z lth $
larceny@ccs.neu.edu