Larceny Note #2: Data Representations

Lars T Hansen / June 27, 1997 William D Clinger / March 20, 2007 (Unicode, multiple targets)

1. Overview
2. Tagged Words
3. Basic Structure Layouts
4. Numbers
5. Strings
6. Procedures, code, and constants
7. Extensions

1. Overview

This note describes the representations used in 32-bit mode by Larceny/Sparc, Larceny/x86, and Petit Larceny. (Common Larceny's representations are rather different because the Common Language System cannot express a union of reference and value types.)

Extensions of the layouts described here are reasonable also in a 64-bit setting, but Larceny does not yet run in 64-bit mode.

Larceny bases all its data types on a few basic representations. Type tags accessible to Scheme code distinguish between actual data types which are mapped onto the basic representations.

There are six basic representations: fixnums, immediates, and pointers to pairs, vector-like structures, bytevector-like structures, and procedures. Each representation consists of a tagged word and, in the case of pointer types, heap-allocated memory.

2. Tagged Words

A tagged word has a 2 or 3 bit type tag in the low order bits:

  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xx00   fixnum
  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xx10   immediate
  pppp pppp  pppp pppp  pppp pppp  pppp p001   pointer to pair
  pppp pppp  pppp pppp  pppp pppp  pppp p011   pointer to vector struct
  pppp pppp  pppp pppp  pppp pppp  pppp p101   pointer to bytevector struct
  pppp pppp  pppp pppp  pppp pppp  pppp p111   pointer to procedure struct

Immediates are further distinguished by more bits in the low-order byte; they are roughly divided into constants (booleans, empty list, characters, and miscellaneous system constants) and headers:

  0000 0000  0000 0000  0000 0000  0000 0010   #f
  0000 0000  0000 0000  0000 0000  0000 0110   #t
  0000 0000  0000 0000  0000 0000  0000 1010   empty list
  xxxx xxxx  xxxx xxxx  xxxx xxxx  0001 0110   miscellaneous
  000c cccc  cccc cccc  cccc cccc  0010 0110   character
  0sss ssss  ssss ssss  ssss ssss  100x xx10   reserved header
  0sss ssss  ssss ssss  ssss ssss  101x xx10   vector-like header
  0sss ssss  ssss ssss  ssss ssss  110x xx10   bytevector-like header
  0sss ssss  ssss ssss  ssss ssss  1111 1110   procedure header

The bits marked xxx in the low byte of a header are fields for type tags: they are used by Scheme code to distinguish between different data types mapped onto the basic structures. The type tags can be manipulated with the typetag and typetag-set! primitives.

The bits marked s in a header contain the size of the data structure in bytes, not including the header word, and not including any padding. For vector-like structures and procedures, the two low bits of the length field will be 0.

Miscellaneous constants include #!unspecified,, #!undefined, and #!eof.

Characters are represented as Unicode scalar values, multiplied by 256, with the character tag in the low-order byte.

3. Basic Structure Layouts

The structure layouts are listed below. When it is stated that a pointer of a particular kind points to a particular location, the meaning is "the pointer with its tag stripped off".

A header word may under no circumstances be stored in the data area of a pair, vector-like structure, or procedure structure.

4. Numbers

Fixnums (small exact integers) are unboxed and kept in the high 30 bits of a word, with the two low bits always 0 (figure 1).

    +------------------------------+--+
    |       fixnum                 |00|
    +------------------------------+--+

Figure 1: fixnum.

Bignums (large exact integers) are bytevector-like with the sign in the first two bytes (0 for positive, 1 for negative), followed by a digit count (two bytes) and then base-$2^{32}$ digits in the next words, with the least significant word first; each word is stored big-endian (figure 2). While bignums cannot be 0 in user code, system code sometimes creates bignums of value 0 in an internal calculation. A bignum with value 0 is distinguished by a digitcount of 0; the sign is immaterial.

    +------------------------+--------+
    |       length           | hdrtag |
    +------------------------+--------+
    | sign          |   digitcount    |
    +---------------------------------+
    |              lsd                |
    +---------------------------------+
    ...

Figure 2: Bignum with 32-bit bigits.

The rationale for keeping a digit count which is different from the vector length is that while (in particular) the multiplication routine must create a vector whose length is the sum of the digit counts, some of the leading digits may be 0, and hence we can have a lower digit count without having to reallocate memory or use a temporary buffer.

Bignums can also be considered with a different logical layout: Each 32-bit digit can be interpreted as two 16-bit digits, also in big-endian fashion within the word; interpreted this way, the bignum gets a funny access pattern (figure 3). The digit count is still the number of 32-bit digits used; see above discussion for sign and bignums of value 0.

    +------------------------+--------+
    |       length           | hdrtag |
    +------------------------+--------+
    | sign          |   digitcount    |
    +---------------------------------+
    |    nlsd       |      lsd        |
    +---------------------------------+
    ...

Figure 3: Bignum with 16-bit bigits.

Ratnums (exact rationals), shown in figure 4, are vector-like, with the first word of the vector being the numerator as a Scheme object (fixnum or bignum), and the second word being the denominator (greater than 1), also a Scheme object.

    +------------------------+--------+
    |       vectorlength     | hdrtag |
    +------------------------+--------+
    |       numerator                 |
    +---------------------------------+
    |       denominator               |
    +---------------------------------+

Figure 4: ratnum.

Rectnums (exact complexes), shown in figure 5, look like ratnums, except that the first word is the real part (an integer or ratnum) and the second word is the imaginary part (ditto). Both parts are exact reals, and the imaginary part is nonzero.

    +------------------------+--------+
    |       vectorlength     | hdrtag |
    +------------------------+--------+
    |       real-part                 |
    +---------------------------------+
    |       imag-part                 |
    +---------------------------------+

Figure 5: Rectnum.

Flonums (inexact reals) are bytevector-like. The first word is unused, and the two next words contain the double in IEEE double precision format. The rationale for the unused word is this: since everything in the system is aligned on an 8-byte boundary, one word will be wasted anyway. By putting it before the number rather than after, the number becomes 8-byte aligned, and we can use a ``load double'' instruction to load it. (Figure 6.)

    +------------------------+--------+
    |      length            | hdrtag |
    +------------------------+--------+
    |      unused                     |
    +---------------------------------+
    |      IEEE double precision      |
    |                                 |
    +---------------------------------+

Figure 6: Flonum.

Compnums (inexact complexes) are bytevector-like. The first word is unused (see the description of the flonum for a rationale). The two next words contain the real part. The two last words contain the imaginary part. (Figure 7.) Both parts are IEEE double precision numbers.

    +------------------------+--------+
    |      length            | hdrtag |
    +------------------------+--------+
    |      unused                     |
    +---------------------------------+
    |      IEEE double precision      |
    |      (real part)                |
    +---------------------------------+
    |      IEEE double precision      |
    |      (imaginary part)           |
    +---------------------------------+

Figure 8: Compnum.

5. Strings

In Larceny v0.93, strings are bytevector-like and contain Latin-1 characters encodings. In some future version of Larceny, strings will be bytevector-like with 4 bytes per character. Each 4-byte character is fully boxed, so Larceny's strings are not in UTF-32 format. When Unicode conversion is complete, the header tag for old-style Latin-1 strings will be recycled as the header tag for immutable Unicode strings.

6. Procedures, code, and constants

As stated earlier, a procedure holds tagged words. Larceny's procedures are very simple and serve a dual role as closures and ribs in the static environment.

A procedure structure contains a bytevector pointer in slot 0 (it points to the code vector), a vector pointer in slot 1 (it points to the constant vector), a static link (procedure pointer) or #f in slot 2, and saved register values in slots 3 and up.

A code vector is a plain byte vector; there is no way to distinguish the two without context.

A constant vector is a plain vector. Data held by a constant vector is immutable with one exception: a constant vector holds pointers to all global cells which are referenced by the procedure owning the constant vector. These cells may be assigned to. In the current implementation, global cells are pairs where the first element holds the value and the second element optionally holds some documentation about the variable (currently its name).

7. Extensions

Larceny provides one new data type. A bytevector is vector that holds bytes: exact integers in the range 0..255.


$Id: note2-repr.html 87 1998-11-25 14:38:41Z lth $
larceny@ccs.neu.edu