Plasma Abstract Machine

Paul Bone
<paul@plasmalang.org>
version 0.1, January 2017
Draft. Copyright © 2015-2017 Plasma Team License: CC BY-SA 4.0
Table of Contents

This document describes the behaviour of the Plasma Abstract Machine (PZ Machine). The PZ file format is described in PZ Bytecode Format. and implementations of the PZ abstract machine are discussed elsewhere (TODO)

In this document we use the textual version of the .pz files for illustrative purposes. However the textual format is never used as an interchange format and rarely used as a language so it does not need or have a specification.

Basic data types

The abstract machine supports words of varying sizes, with the symbols representing them.

  • 8bit (w8)

  • 16bit (w16)

  • 32bit (w32)

  • 64bit (w64)

  • fast word width (w)

  • a word width the same width as a pointer (wptr)

  • a pointer (ptr)

A fast word width is a width that should be the fasted word width for integers on the platform. This may take into account register size, memory usage and maybe implementation convenience.

A word with the same width as a pointer and a pointer differ only in whether the garbage collector may trace them. Which is significant in some contexts (like structures) but not in others (like instruction parameter widths).

Note
TODO: Polymorhism
Handle polymorphism for wptr/ptr. We’ll probably remove wptr and handle the pointer vs non-pointer distinction another way.

Some instructions only make sense with either signed or unsigned data, this is up to individual instructions, the PZ format and abstract machine don’t care. This way "move a 32bit word" makes sense regardless of whether the word is signed, unsigned, or something else (float, bitfield etc).

The PZ machine also supports structures and arrays, more on those later.

Registers

The PZ Machine is a stack based machine, it has a single register: the program counter (PC). The program counter "points to" the next instruction to execute.

Stacks

The basic abstract machine is a stack machine with two stacks. A return stack and an expression stack. The return stack is used to handle procedure call and return. Very little control of the return stack is available. Both basic instructions and procedures are a transformation of the top of the expression stack.

Notation

A procedure or instruction’s signature may look like:

add (w w - w)

This describes the instruction + as taking two words from the top of stack and replacing them with a word. Calling conventions for procedures work the same way. The expression stack is used for argument passing and temporary storage.

fibs (w - w)

From a callee’s perspective, there is little difference between an instruction and a call.

If an instruction is available for all word sizes it may be written as:

add (* * - *)

This is a convention only, there is no support for polymorphism. When using the textual format for PZ, you may disambiguate which instruction you need with a suffix.

eg:

add:8
add:16
add:32
add:64
add:w   (fast word width)
add     (no suffix also means fast word width)
add:ptr (pointer word width)

This works similarly for literal data. This is a byte containing the number 23.

23:8

This is only available for instructions, not calls.

Also in our notation we indicate immediate data with CamelCase, and in the case of calls and literal data, the instruction name is not provided. The instruction to use is available via context.

High level bytecode items

Each item in a bytecode file belongs in one of three types and is referred to by a 32bit ID. Each item type has its own ID-space. In other words data item 5 and procedure 5 are separate. Names are used in .pzt files but are discarded when these are compiled to .pz files. The exceptions are imported items, exported items (TODO), and in the future some names and other information may be stored for debugging.

Structs

A struct is a record type, and has a lot in common with a C struct. Each struct has a fixed number of fields and each field has a width (as above). Structs allow the bytecode interpreter to make its own data layout decisions. Which it may do differently on different platforms.

Tip
Example usage of this information
When a program is loaded and the loader reads a struct type. For that struct type it computes offsets for each of the fields, computes the total size.

Data

Data items come in three types:

  • Basic data: a single data item of a specific width.

  • Array data: a number of data items of the same width, usually packed together.

  • Structure data: a structure of data, the data item provides the struct ID and the value of each field.

TODO: Separate read-only / read-write.

Procedures

Procedures contain executable code. A procedure’s signature is a "stack transformation" it represents the top of stack values before and after a call to this procedure. This is explained above.

Procedures are made up of blocks which are used for control flow. The first block in each procedure is executed when the procedure is called. Within each procedure blocks are numbered sequentially starting at 0. Jump instructions refer to their destination by block ID.

Note that execution can never "fall through" a block, the last instruction in every block must be an unconditional control flow instruction.

Instructions

Each instruction is made from an opcode, between zero and two operand widths and optionally an immediate value.

Zero extend, Sign extend and Truncate

ze (* - *)
se (* - *)
trunc (* - *)

Zero extends, sign extends or truncates the value on the top of the stack. By truncate we mean discard the most significant bytes. While most instructions work on a single operand width, these instructions use two operand widths. For example.

ze (w16 - w32)

Note that it is not necessary (or advised) to use these instructions to convert to and from pointer data, for example to manipulate tagged pointers.

Arithmetic

add (* * - *)
sub (* * - *)
mul (* * - *)
div (* * - *)
mod (* * - *)

Integer addition, subtraction, multiplication, division and modulus.

lshift (* w8 - *)
rshift (* w8 - *)
and (* * - *)
or (* * - *)
xor (* * - *)

Bitwise operations. Note that right shift is unsigned. A signed version will be added later.

not (* - *)

Logical negation

Comparison

lt_u (* * - w)
lt_s (* * - w)
gt_u (* * - w)
gt_s (* * - w)
eq (* * - w)

Less than and greater than on unsigned and signed data. Note that the result is always fast word width. Likewise conditional instructions always take their argument in the fast word width.

Stack manipulation

Stack manipulation instructions don’t care about data width, the machine concepually places all data in the same sized slots.

drop N

Drop the top N items from the stack.

roll N

Rotate the top N items on the stack. The top N-1 items move to the left, the leftmost item becomes the rightmost. (rightmost is top-of-stack). Note that roll 2 is the same as swap.

pick N

Push the _N_th item on the stack to the top of the stack. Note that pick 1 is the same as "dup"

Procedure calls: call and ret

ProcId (-)

Call the procedure given by ProcId. Push the value of the program counter onto the return stack and load the program counter with the address of the first instruction in the first block of the procedure.

ret (-)

Pop the value off the return stack and load it into the program counter register.

TODO: Indirect calls

Jumps: jmp, cjmp

jmp BlockId (-)

Jump unconditionally to the indicated block by loading the address of the first instruction of the block into the program counter.

Note that only blocks can be the target of jump instructions, this way all jmp targets are known.

cjmp BlockId (w -)

Pop a value of the expression stack, if it is non-zero load the address of the first instruction of the given block into the program counter.

Note that this instruction always consumes the value on the stack.

TODO: indirect jumps or some mechanism for computed gotos.

Loops

TODO: Some loops may be handled differently than using blocks and jumps,

Data

Load immediate number

N (- *)

Loads the immediate value onto the expression stack. (N is any value).

Load data reference

DataId (- ptr)

Loads the address of the static data referenced by DataId. (DataId is the ID of any static data).

Load and store memory

load StructId FieldNum (ptr - * ptr)

Read the value of a field from the object at the given address. StructId and FieldNum are literal.

store StructId FieldNum (* ptr - ptr)

Store a value into the field of an object at the given address.

TODO: Make sure that we can easily handle memory barriers for GC.

TODO: Ordinary and array loads and stores.

Memory allocation

alloc StructId (- ptr)
alloc_mutable StructId (- ptr)
alloc_array ElementWidth (w - ptr)
alloc_array_mutable ElementWidth (w - ptr)

Plasma will use immutable structures more often than mutable ones, so immutable is the "normal" type. Note that this does not prevent store from being used on immutable data, but doing so would be bad. The GC among other things will use immutability information to optimise its algorithms.

Garbage collection

The Garbage Collector must be aware of which values are pointers and which are not. Above we explained how information about structures can be used to calculate this for typical heap cells.

This information must also be available for stack frames. There are multiple ways to make this available at runtime. One simple solution is at each GC save point execute code that updates a word in the stack frame containing a bitfield that specifies which stack slots contain a pointer into the heap.

We will probably require .pz programs to provide such bitfields within their instruction streams. Since we use separate expression and return stacks it will need to include information about how many of the top stack values belong to the current procedure.

Note
TODO: Polymorphism
Any polymorphic values will need their "is a pointer" bit filled in at runtime. We can generate runtime code that takes an argument and constructs the bit field. This information can be passed to the procedure by adding extra argument(s) to the procedure, which is how polymorphism transformations work in general.

Optimisations

The objects' bitfields can easily be stored together, as mark bits are already stored in a GC. To save further on memory usage objects with particular layouts can be allocated in particular heap regions. These heap regions themselves provide this information. If a heap layout stores object sizes with the object, the bitfields for most object sizes could easily be packed with the object size.

Some of the information required for stack frames is implicit within the instruction stream. Requiring it to be made explicit makes writing pzrun easier, but some of it could be omitted in a later version.

Builtin operations

See runtime/pz_builtin.c

Misc
print (ptr -)
int_to_string (w - ptr)
die ()
Pointer tagging
// Combine a pointer and a tag into a tagged pointer
make_tag (ptr ptr - ptr)

// Combine a word and a tag into a tagged word (shifting the word)
shift_make_tag (ptr ptr - ptr)

// Extract the pointer and tag from a tagged pointer
break_tag (ptr - ptr ptr)

// Extract the word and tag from a tagged word (shifting the word)
break_shift_tag (ptr - ptr ptr)

// Unshift a tagged value
unshift_value (ptr - ptr)
Deprecated
concat_string (w w - w)
free (w -)

Linking to other modules

TODO

Working with foreign code

TODO

Using PZ

A note about data

The stack cannot be used to store complex data, neither can it be intermediate in the instruction stream. Complex data (structs and arrays) must be either statically allocated or allocated on the heap. In either case the PZ machine needs to know about the structure or array being used.