1. The SPARC assembler
1.1. Structure of the SPARC assembler
1.2. Native code generation
1.3. The SPARC native assembler
1.4. Peephole optimization
2. The Programming model
2.1. Registers
2.2. Scheme-to-Scheme calling convention
2.3. Scheme-to-Millicode calling
convention
The assembler procedure picks the instruction apart and calls
procedures in the code generator to emit machine code. Those
procedures, found largely in Asm/Sparc/gen-msi.sch and
Asm/Sparc/gen-prim.sch, decide on the instructions to generate,
and call procedures in the native assembler. The native assembler
creates the machine code for each instruction and performs certain local
optimizations on the instruction stream such as branch delay slot
filling and instruction scheduling.
The peephole optimizer is invoked on the instruction stream by the
target-independent part of pass5, and coalesces common patterns of
MacScheme instructions into instructions specialized for the SPARC.
It has been suggested that this kind of code may cause the processor
to slow down, and an alternative way of computing the value of L0
instruction using the following code (in generated code only; a
different method could be used in code that won't move):
All the mnemonics are defined at the end of Asm/Sparc/sparcasm.sch.
There are also some abstractions there; for example, on the SPARC a move
is expressed as a logical or with %g0, but this low-level
implementation obscures the meaning of the instruction; therefore, a
sparc.move instruction is provided.
Additionally, there is a sparc.label pseudo-operation, used
for definining branch targets, and a sparc.slot instruction,
which can be used in the branch delay slot of annulled branches.
In addition to creating the machine code for instructions and
resolving branch targets and other expressions, the native assembler
performs a limited amount of optimization. Currently, it fills branch
delay slots using a simple algorithm: if the branch target is an
instruction that can go in a delay slot, then insert the target
instruction in the delay slot and increment the branch offset by 4.
A more sophisticated delay slot filling algorithm is desirable,
because the above algorithm does not decrease code size -- it only makes
a taken branch slightly faster. A better algorithm would work at least
on basic blocks and try to move instructions across branches, into the
slot following the branch, whenever possible. It would not be hard to
implement this efficiently.
The native assembler could also perform instruction scheduling, but
that is unlikely to pay off much until we get rid of most dynamic type
checks.
The peephole optimizer is currently a simple hand-coded decision tree.
This is fast (certainly fast enough), but makes maintenance harder than
it could be with a more pattern-based approach. Clinger has written a
generic pattern-based peephole optimizer and this should be adapted to
be used with Larceny.
Furthermore, a number of the names introduced by the SPARC assembler
are unintelligible for no good reason. This should be cleaned up.
For example, a sequence such as
The peephole optimizer has a table of primitives that it applies this
optimization to. At the time of writing, these primitives are
car, cdr, CELL-REF, +, -,
eq?, and cons. Others could easily be introduced.
Again, this optimization is possible because the semantics of branchf
is to destroy RESULT.
The primitives for which this optimization is performed are currently
null?, pair?, zero?, eq?,
<, <=, >, >=,
=, char=?, char<, char<=,
char>, and char>=. More should be added, notably
many type primitives that are used by system code.
The optimization is performed only for make-vector, and for all vectors
of known length less than a cut-off point. The current cut-off is 10,
which is not completely arbitrary: there's a trade-off between fast
initialization and blow-up in code size. More sophisticated techniques
are possible for longer vectors, for example specialized unrolled loops,
but in those cases, generic code with an unrolled initialization loop is
likely to do nearly as well.
The same optimization applies to local non-tail calls, which use
branch rather than invoke. In that case, the SPARC's
call instruction can be used as a PC-relative branch that saves
the return address.
As it happens, this peephole optimization speeds up local non-tail
calls but slows down non-local non-tail calls, on a SPARC Ultra 1. The
problem with the non-local calls is probably that a pipeline dependency
or memory bottleneck is created between the jump instruction and the
store instruction in its delay slot. Certainly, in the case of a
branch, the target is known to the processor well before the branch
instruction is executed, hence the target instruction has probably been
fetched already, and the memory system is available for the store.
The optimization has therefore been disabled for non-local calls. Once
the run-time system and assembler are rewritten to allow the return
address to be kept in a register, the dependency will go away, and this
optimization should pay off also for non-tail calls.
Generated code must maintain a number of target-independent invariants. The target-dependent
variants are explained in this section.
The mapping from logical to hardware registers is defined in
Rts/regs.cfg.
The register GLOBALS points to a table of run-time system variables
and values. The globals table is generated automatically from
Rts/globals.cfg. Part of the GLOBALS table is a jump table
into millicode routines (see below).
Larceny does not use the SPARC's register windows. Instead,
generated code uses a set of virtual machine registers that is kept
partly in processor registers, partly in the GLOBALS table. Virtual
machine registers that are kept in hardware registers also have shadow
locations in the GLOBALS table.
The predicate hardware-mapped?, defined in
Asm/Sparc/sparcutil.sch, determines whether a virtual machine
register is kept in a processor register.
The globals Scheme variable *nregs*, which is defined in
Compiler/sparc.imp.sch, contains the number of general-purpose
virtual registers. Parts of the run-time system knows the value of this
"variable", so changing it is non-trivial.
In addition to the general-purpose register, several other values are
kept in registers:
The address passed in %o7 should be the address of the return point
minus 8 (as per the normal SPARC conventions, in case you're puzzled), so
if the millicode procedure should not return to the point following the
nop above, then the slot must be used to adjust the return
address. For example,
0. The SPARC architecture
The SPARC architecture is a RISC architecture descended from the
Berkeley RISC projects. It distinguishes itself from most other RISC
architectures in its use of register windows and its support for tagged
arithmetic. The primary programmer's reference for the SPARC is The
SPARC Architecture Manual, published by Prentice-Hall. The most
recent edition covers Version 9; Larceny uses only instructions documented
in the Version 8 manual.
1. The SPARC assembler
The compiler/assembler interface is parameterized by definitions in
Compiler/sparc.imp.sch, as explained in the Larceny Note on adding primitives.
1.1. Structure of the SPARC assembler
The SPARC assembler has four parts:
The assembly table is the target-dependent part of Twobit's pass 5
(assembly). The code for this part is in
Asm/Sparc/pass5p2.sch; this file defines a table of assembler
procedures which is then used by the target-independent part of pass 5
to perform assembly. Each assembler procedure in the table is called
with an assembly structure and a MacScheme instruction.
1.2. The Code Generator
Overview
The file Asm/Sparc/gen-msi.sch contains procedures that
generate code for all MacScheme instructions (msi = MacScheme
Instructions) except the op1, op2, op2imm,
and op3 instructions. Those instructions are handled in
Asm/Sparc/gen-prim.sch, where one code generator procedure is
defined for each primitive.
Code generation idioms
This section lists some common idioms that you will see in disassembled
code or in millicode.
Generic addition
A tagged add of R1 and R2 into R1 or R2 can be accomplished with a
two-instruction sequence and some cleanup code when the addition fails:
taddcc %R1, %R2, %R1
bvc,a L1
nop
sub %R1, %R2, %R1
... call millicode ...
L1:
The instruction at L1 will usually be lifted into the branch delay slot.
If the destination of the addition is neither R1 or R2, or if R1 and R2 are
the same register, then a temporary must be used in order to recover,
and three instructions are necessary (FIXME: this has been improved):
taddcc %R1, %R2, %TMP0
bvc,a L1
mov %TMP0, %R3
... call millicode ...
L1:
Larceny never uses the trapping version of the tagged add instruction.
Conditional move
The following two instructions conditionally move the value VALUE into
register R1; in this case, the condition is "less than".
ble,a .+8
mov VALUE, %R1
To generate a boolean object, the following three-instruction sequence
is used:
mov FALSE_CONST, %R1
ble,a .+8
mov TRUE_CONST, %R1
On SPARC version 9, a conditional move instruction can be used. Larceny
currently uses no SPARC-9 specific instructions.
Generate address of instruction
It is possible to compute the virtual address of any instruction with a
call instruction that branches ahead two instructions:
L0: call .+8
nop
after which register o7 contains the address of L0. The branch delay
slot can be filled in the usual manner or it can be used to adjust
the value of o7 if L0 is not at the call instruction.
L0: ld [ %REG0 + 3 ], %TMP0 ! get code vector
sub %TMP0, 9, %TMP0 ! 1 for tag, 8 for address of _add_
add %TMP0, L0, %o7
This code depends on the assembler resolving L0 as the relative address
within the code vector (which is a feature). The memory reference is
undesirable, but may be scheduled for earlier execution; the control
dependence of the call instruction is gone.
Millicode call
See the section on millicode calling, below.
Generic comparison
FIXME.
Exception handling
Exception handling in generated code is not done all that consistently,
and needs to be cleaned up.
1.3. The SPARC native assembler
Overview
The native assembler, like other assemblers, allows the rest of the
assembler to deal only in symbolic machine instructions (in our case,
they're more like "procedural machine instructions"). The native
assembler provides one procedure for each SPARC instruction it knows how
to assemble; for example, there are procedures called
sparc.bl.a (branch if less with annul bit set),
sparc.ornrcc (logical or with complement from register and set
condition codes), and sparc.andi (logical and with immediate).
The mnemonics chosen are close to the standard SPARC mnemonics; the only
difference is that the assembly procedures require the specification of
whether the (typically) second operand is a register or an immediate.
Branch target resolution
The target-independent part of pass5 provides an ad-hoc (and, as of
v0.33, rather crufty) failure mechanism that allows the assembler to
deal with span-dependent branches in a crude way. Initially, the
assembler assumes that all branches are "short", that is, the target is
within +/-4KB from the branch instruction. (The SPARC branch
instructions have a 24-bit offset, but the jmpl instruction is
limited to the +/-4KB range. Since the jmpl instruction is
used when generating code for the MacScheme jump instruction,
some uses of jmpl occasionally need a larger offset than 4KB.)
If, during target resolution, a too-large offset is discovered, then the
failure mechanism is invoked. Assembly is then retried with long
offsets temporarily enabled.
Optimizations
1.4. Peephole optimization
The SPARC peephole optimizer (Asm/Sparc/peepopt.sch) folds
sequences of MacScheme Machine instructions into faster forms that are
known only to the SPARC assembler. The optimizations currently
performed are: copy avoidance, boolean evaluation for control,
allocation optimization, and nontail-call optimization.
Copy avoidance
This class of optimizations seeks to avoid register-to-register moves
that are present in the output from Twobit because the intermediate code
(MacScheme Assembly Language) is a 2-address code, but that are not
necessary on a 3-address machine such as the SPARC. Most of the
optimizations performed by the peeophole optimizer fall into this class.
reg 1
op2imm +, 7
setreg 2
that adds 7 to the value in R1 and stores it into R2 is peephole-optimized
into the single instruction (FIXME: new names now)
dresop2imm +,1,7,2
This optimization is possible because the semantics of setreg specify
that it destroys RESULT; hence, the use of RESULT can be avoided altogether in
many cases.
Boolean evaluation for control
Twobit produces code like the following when a primitive that produces a
boolean result is used in the test part of a conditional expression:
reg 1
op1 null?
branchf Ln
Here, null? performs a test on the object in RESULT and sets RESULT to
either #t or #f. The branchf instruction then compares RESULT with #f
to determine whether to branch or not. The complete sequence of machine
instructions looks something like this:
mov %r1, %result
cmp %result, 10 ! 10 = empty list
mov 2, %result ! 2 = true
bne,a .+8
mov 6, %result ! 6 = false
cmp %result, 6
be Ln
nop
In this case, the creation of a boolean object can be avoided; instead,
code can be generated that checks the condition bits directly. The above
sequence of MacScheme instructions is collapsed into
optbreg1 null?, 1, Ln
for which the generated code is the much simpler
cmp %r1, 10
bne Ln
nop
Allocation optimization
This optimization replaces generalized allocation
code by specialized code. Consider a particular call to make-vector:
(make-vector 7 #f)
In general make-vector can take any expression for the length and
initialization arguments, and the above call is compiled as
const #f
setreg 1
const 7
op2 make-vector, 1
When the assembler translates the last instruction to native code, it
doesn't know the value in RESULT, and must therefore translate it as a
call to the generic make-vector millicode (actually,
mem_alloci). That code allocates memory and then initializes
it in a loop. For short vectors of known length it is possible to do
much better, and the peephole optimizer translates the above pattern
into
const #f
setreg 1
op2 make-vector:7, 1
The SPARC assembly code for the primitive now looks something like this:
set 6, %r1 ! 6 = false
jmpl [ %millicode + M_ALLOC ], %o7 ! allocate
set 32, %result ! 8 words
set 0x1ca2, %g1
st %g1, [ %result ] ! header
st %r1, [ %result+4 ] ! initialize elt 0
st %r1, [ %result+8 ]
st %r1, [ %result+12 ]
st %r1, [ %result+16 ]
st %r1, [ %result+20 ]
st %r1, [ %result+24 ]
st %r1, [ %result+28 ]
add 3, %result ! insert tag
Additionally, the allocation code can be in-lined.
Nontail-call optimization
Twobit generates code for a non-tail call that looks like this:
global the-procedure
setrtn Lx
invoke 3 ! 3 arguments
.align 4 ! ignored on the SPARC
Lx:
Splitting the setrtn and the invoke is convenient for the
compiler and reduces the size of the intermediate code instruction set.
However, on the SPARC a setrtn requires some nontrivial machinery;
in the above case it would look something like this:
call .+8
add %o7, k, %o7 ! k reflects the distance from 'call' to Lx
st %o7, [ %stkp+4 ] ! store in the stack frame
and the invoke ends with
jmpl %tmp0 - 1, %g0 ! jump and discard return address
set 12, %result ! setup argument count
But since the return point immediately follows the call in this case,
the return address thrown away in the invoke is the return
address we want, so we can fold the above MacScheme instructions into
global the-procedure
invoke-with-setrtn 3, Lx
Lx:
where invoke-with-setrtn will end in the code sequence
set 12, %result ! setup argument count
jmpl %tmp0 - 1, %o7 ! jump and discard return address
st %o7, [ %stkp+4 ] ! store return address
2. Programming model
2.1. Registers
2.2. Scheme-to-Scheme calling convention
FIXME. [Although, I can't think of anything out of the ordinary that goes
here.]
2.3. Scheme-to-Millicode calling convention
Millicode arguments are placed in RESULT, ARGREG2 and ARGREG3, and
millicode is called indirectly with a single jmpl instruction
through the GLOBALS register:
jmpl [ %GLOBALS + offset ], %o7
nop
where offset is the table offset for the millicode routine
in question, as defined in Rts/globals.cfg.
jmpl [ %GLOBALS + offset ], %o7
add %o7, L0 - (. - 4) - 8, %o7
...
L0:
Above, (.-4) is the address of the jmpl instruction, which is
also the address in o7, and L0 is the address of the return point, so
L0-(.-4)-8 is the quantity to add to o7 to set up a return point at L0-8.
3. The Instruction Cache
Generally, generated code does not need to worry about the processor's
instruction cache, which the garbage collector will automatically flush
as necessary, depending on the machine model. However, be aware that
most modern SPARC implementations do have separate instruction and data
caches, and that a write to a location cached in the instruction cache
(for example when overwriting a code vector with a breakpoint
instruction) may need to be followed by the appropriate number of IFLUSH
instructions.
$Id: note6-sparc.html 426 1998-12-21 14:29:31Z lth $
larceny@ccs.neu.edu