1. Overview
2. Tagged Words
3. Basic Structure Layouts
4. Numbers
5. Strings
6. Procedures, code, and constants
7. Extensions
This note describes the representations used in 32-bit mode by Larceny/Sparc, Larceny/x86, and Petit Larceny. (Common Larceny's representations are rather different because the Common Language System cannot express a union of reference and value types.)
Extensions of the layouts described here are reasonable also in a 64-bit setting, but Larceny does not yet run in 64-bit mode.
Larceny bases all its data types on a few basic representations. Type tags accessible to Scheme code distinguish between actual data types which are mapped onto the basic representations.
There are six basic representations: fixnums, immediates, and pointers to pairs, vector-like structures, bytevector-like structures, and procedures. Each representation consists of a tagged word and, in the case of pointer types, heap-allocated memory.
A tagged word has a 2 or 3 bit type tag in the low order bits:
xxxx xxxx xxxx xxxx xxxx xxxx xxxx xx00 fixnum xxxx xxxx xxxx xxxx xxxx xxxx xxxx xx10 immediate pppp pppp pppp pppp pppp pppp pppp p001 pointer to pair pppp pppp pppp pppp pppp pppp pppp p011 pointer to vector struct pppp pppp pppp pppp pppp pppp pppp p101 pointer to bytevector struct pppp pppp pppp pppp pppp pppp pppp p111 pointer to procedure struct
Immediates are further distinguished by more bits in the low-order byte; they are roughly divided into constants (booleans, empty list, characters, and miscellaneous system constants) and headers:
0000 0000 0000 0000 0000 0000 0000 0010 #f 0000 0000 0000 0000 0000 0000 0000 0110 #t 0000 0000 0000 0000 0000 0000 0000 1010 empty list xxxx xxxx xxxx xxxx xxxx xxxx 0001 0110 miscellaneous 000c cccc cccc cccc cccc cccc 0010 0110 character 0sss ssss ssss ssss ssss ssss 100x xx10 reserved header 0sss ssss ssss ssss ssss ssss 101x xx10 vector-like header 0sss ssss ssss ssss ssss ssss 110x xx10 bytevector-like header 0sss ssss ssss ssss ssss ssss 1111 1110 procedure header
The bits marked xxx
in the low byte of a header are fields
for type tags: they are used by Scheme code to distinguish between
different data types mapped onto the basic structures. The type tags
can be manipulated with the typetag
and
typetag-set!
primitives.
The bits marked s
in a header contain the size of the data
structure in bytes, not including the header word, and not including
any padding. For vector-like structures and procedures, the two low
bits of the length field will be 0.
Miscellaneous constants include #!unspecified,
,
#!undefined
, and #!eof
.
Characters are represented as Unicode scalar values, multiplied by 256, with the character tag in the low-order byte.
The structure layouts are listed below. When it is stated that a pointer of a particular kind points to a particular location, the meaning is "the pointer with its tag stripped off".
A header word may under no circumstances be stored in the data area of a pair, vector-like structure, or procedure structure.
Fixnums (small exact integers) are unboxed and kept in the high 30 bits of a word, with the two low bits always 0 (figure 1).
+------------------------------+--+ | fixnum |00| +------------------------------+--+
Figure 1: fixnum.
Bignums (large exact integers) are bytevector-like with the sign in the first two bytes (0 for positive, 1 for negative), followed by a digit count (two bytes) and then base-$2^{32}$ digits in the next words, with the least significant word first; each word is stored big-endian (figure 2). While bignums cannot be 0 in user code, system code sometimes creates bignums of value 0 in an internal calculation. A bignum with value 0 is distinguished by a digitcount of 0; the sign is immaterial.
+------------------------+--------+ | length | hdrtag | +------------------------+--------+ | sign | digitcount | +---------------------------------+ | lsd | +---------------------------------+ ...
Figure 2: Bignum with 32-bit bigits.
The rationale for keeping a digit count which is different from the vector length is that while (in particular) the multiplication routine must create a vector whose length is the sum of the digit counts, some of the leading digits may be 0, and hence we can have a lower digit count without having to reallocate memory or use a temporary buffer.
Bignums can also be considered with a different logical layout: Each 32-bit digit can be interpreted as two 16-bit digits, also in big-endian fashion within the word; interpreted this way, the bignum gets a funny access pattern (figure 3). The digit count is still the number of 32-bit digits used; see above discussion for sign and bignums of value 0.
+------------------------+--------+ | length | hdrtag | +------------------------+--------+ | sign | digitcount | +---------------------------------+ | nlsd | lsd | +---------------------------------+ ...
Figure 3: Bignum with 16-bit bigits.
Ratnums (exact rationals), shown in figure 4, are vector-like, with the first word of the vector being the numerator as a Scheme object (fixnum or bignum), and the second word being the denominator (greater than 1), also a Scheme object.
+------------------------+--------+ | vectorlength | hdrtag | +------------------------+--------+ | numerator | +---------------------------------+ | denominator | +---------------------------------+
Figure 4: ratnum.
Rectnums (exact complexes), shown in figure 5, look like ratnums, except that the first word is the real part (an integer or ratnum) and the second word is the imaginary part (ditto). Both parts are exact reals, and the imaginary part is nonzero.
+------------------------+--------+ | vectorlength | hdrtag | +------------------------+--------+ | real-part | +---------------------------------+ | imag-part | +---------------------------------+
Figure 5: Rectnum.
Flonums (inexact reals) are bytevector-like. The first word is unused, and the two next words contain the double in IEEE double precision format. The rationale for the unused word is this: since everything in the system is aligned on an 8-byte boundary, one word will be wasted anyway. By putting it before the number rather than after, the number becomes 8-byte aligned, and we can use a ``load double'' instruction to load it. (Figure 6.)
+------------------------+--------+ | length | hdrtag | +------------------------+--------+ | unused | +---------------------------------+ | IEEE double precision | | | +---------------------------------+
Figure 6: Flonum.
Compnums (inexact complexes) are bytevector-like. The first word is unused (see the description of the flonum for a rationale). The two next words contain the real part. The two last words contain the imaginary part. (Figure 7.) Both parts are IEEE double precision numbers.
+------------------------+--------+ | length | hdrtag | +------------------------+--------+ | unused | +---------------------------------+ | IEEE double precision | | (real part) | +---------------------------------+ | IEEE double precision | | (imaginary part) | +---------------------------------+
Figure 8: Compnum.
In Larceny v0.93, strings are bytevector-like and contain Latin-1 characters encodings. In some future version of Larceny, strings will be bytevector-like with 4 bytes per character. Each 4-byte character is fully boxed, so Larceny's strings are not in UTF-32 format. When Unicode conversion is complete, the header tag for old-style Latin-1 strings will be recycled as the header tag for immutable Unicode strings.
As stated earlier, a procedure holds tagged words. Larceny's procedures are very simple and serve a dual role as closures and ribs in the static environment.
A procedure structure contains a bytevector pointer in slot 0 (it points to the code vector), a vector pointer in slot 1 (it points to the constant vector), a static link (procedure pointer) or #f in slot 2, and saved register values in slots 3 and up.
A code vector is a plain byte vector; there is no way to distinguish the two without context.
A constant vector is a plain vector. Data held by a constant vector is immutable with one exception: a constant vector holds pointers to all global cells which are referenced by the procedure owning the constant vector. These cells may be assigned to. In the current implementation, global cells are pairs where the first element holds the value and the second element optionally holds some documentation about the variable (currently its name).
Larceny provides one new data type. A bytevector is vector that holds bytes: exact integers in the range 0..255.