- There are standard mechanisms for annotations and their resolution
- Every annotation may have its own implementation, be it a hashtable
of object to value association, or an array of values, or some
executable code.
- lazy evaluation and futures, typing, UI interface, scoping are done
through annotations.
- for example, an object's concrete type (relative to the GC mechanism)
may be determined from bits in its address, whether statically or
dynamically.
LLL Difficulties
- See HLL difficulties...
Hardware independence
- At what level shall word size be decided ?
- How can objects migrate back and forth
between machines with different bytesize
with no loss in information ?
Garbage Collection
- Infix pointers (that do not point at some globally constant offset
to the beginning of an allocation unit) greatly complicate the GC. They
should be forbidden whenever possible.
- "C" like infix pointers can still be simulated with a segment-offset
pair, with an aligned pointer and an offset inside the segment.
- The GC may have to accept infix pointers as for code return addresses,
or else the calling convention may become grossly unefficient
I propose that code "segments" should not cross, say 4K or 8K boundaries, so
that finding the right code segment is just a matter of checking the list of
segments in the segment obtained by masking the right bits.
- Big problem: how to efficiently
differentiate pointers from numbers, etc ?
structural differentiation is powerful, but may slow the GC considerably,
unless descriptors are simple (can be just an integer for length of a pointer
array, for most objects), and forbids dynamic differentiation, mixing integers
and pointers in an array (e.g. simple stack), etc.
That's why we'll use a simple bit pattern to differentiate integers (raw data)
from pointers (structured data), and different kind of pointers from each other
(that's a BIg Bunch Of Pages kind of GC).
- we must choose between integers having low bit set or low bit cleared.
Having it set (and thus having bit cleared for pointers) may allow faster
pointer access on RISC machines, but slows any arithmetics. Having bit set
for pointers allow easier arithmetics, but forces the use of an offset for
all memory accesses.
-
The big question is: will integers be stripped of their low bit, which
would simplify overflow testing code to naught, and make the implementation
portable, but make a little harder doing pointer arithmetics and mixing of
true integers with 31 bit ones. Or stripping them from their overflow bit,
which makes integer overflows to generate GC-readjustable pointers, rather
than providing flat modulo arithmetics, but allows easy pointer
arithmetics and mixing of 31-bit integers and 32-bit ones ?
- We shall implement both ways, and compare actual execution time and
code space measurements !!!
- A high-level page directory is used to determine the GC type of objects
according to the page it is in. It is a multi-level hashed structure that
may evolve with the GC code, so that it may allow to find quickly the type of
objects. Typically a mix between arrays and balanced binary trees to recognize
bit patterns.
- The GC type of an object, as determined by its address gives us routines
to update the object during a GC, to destroy the object when it is not accessed
anymore, etc.
- The GC type of a page chunk allows us to track down the beginning of
individual objects pointed to on the page (in case infix pointers are used),
also gives us the policy to follow when swapping out the page (which may be
copying the page to disk, sending it to the network, or compressing it to
memory for possible further actual swapping out of memory, etc).
Persistence
- Be careful with distributed persistence: always remember previous
states until all transactions using it are finished and confirmed.
- Real-Time persistent objects will have to time-allocate the
copying of their data into buffers for synchronized persistent
store, even though this operation may be done only once in a
while.
Implementation Ideas
Implementation language
- We shall use the m4 preprocessor (or later our own HLL)
to produce assembly source files for all our different target
processors from mostly the same meta-source.
- We shall use C as though it was a mostly regular assembler,
with labels, jumps, etc,
so the same meta-source also produces the C source files.
Hey, wasn't C called a portable assembler ? ;->
Modules
- modules have some install/uninstall annotation fields explaining how
to restore/resume the object
from the state log as gotten from persistent store.
In general, this will be a call to a standard trusted
high-level module.
However, this can be low-level code in the very first
bootstrapping modules...
- This scheme can be used for migration in general: persistence,
garbage collection, dynamic linking, etc.
Mixed cooperative/preemptive multithreading:
-
Preemption is necessary
for real-time response as well as for unsecure tasks;
but cooperation allows far better context switch time
(on the i386, compare saving some 5 registers to saving the whole
CPU+FPU+MMU state),
not to talk about trivial implementation of mutual exclusion,
and all kinds of similar system-wide assumptions,
such as those needed for a garbage collector.
-
The Tunes way, as utsl points out,
is to allow the HLL to be preemptively multithreaded
(or rather, it will manage seamlessly concurrent objects),
while the LLL is cooperatively multithreaded.
-
Actually, if we define
cooperation by the fact that
you may rely on some software properties of code,
and preemption by the fact that
a thread may yield execution without having to poll the scheduler,
then the two techniques are not incompatible, as we'll see.
-
In traditional systems (as well as in emulation boxes for them),
all software is so completely and deeply unsecure
that only uncooperative preemption is possible
for the system not to crash too often
(as examples of often-crashing cooperative non-preemptive system
are winblows 3.1 and MacOS).
Now, Tunes is a secure system, and we can and shall use
cooperation whenever possible (=almost always):
we shall require standard compilers to generate cooperative code,
that PAUSEs every now and then.
-
Of course, real-time threads need some kind of preemption,
and untrusted environments (which emulation boxes are) as well;
but as long as these use cannot access the preempted thread's data,
or any resource shared by the preempted thread
(like garbage collected memory), everything is fine.
-
Actually, threads are interrupted and signaled by more priviledged threads,
but not preempted by equally-priviledged threads.
(they might be preempted by foreign threads, to which they can't
compare priviledge).
This is why we rather call that interruptible cooperative threads
instead of preemptible threads.
-
Now, a common problem with cooperation is that PAUSEs are often too
irregular, and while having them too far away for each other yields poor
response time, having them to near yields too many context switches and
related overhead (even in the best case, lots of cache misses). We can
solve this problem by requiring some fine PAUSE resolution, but do actual
context switch only when the timer says so (by modifying the code
of a PAUSE code to just a RET/NEXT until it's time).
-
This is perfect, but may slow tight loops significantly.
One way is to unroll tight loops enough so that this cost is reduced.
If this still isn't satisfactory,
a more elaborate way involves a "it's time to switch" signal.
Before to enter a tight loop, some "interrupt recovery" routine
is registered as the signal handler (and is unregistered afterwards);
if an interrupt happens while in the tight loop,
then the recovery routine is called,
which has the responsibility to put the system
in a state that follows the usual calling conventions.
-
Typically, the routine will continue normal execution of the loop
until some modified point which will jump back
into the remaining of the recovery routine,
which can finish the work
because it knows the exact point of execution in the loop
(which could have been at any point inside the loop).
This modified loop might be achieved through
safely self-modifying code, or safely runtime-generated code,
if the CPU architectures doesn't make it excessively costly
by e.g. requiring to flush all of a big instruction cache afterward;
in case expensive cache-flushing make runtime-code-generation/modification
too expensive, a pre-modified copy of the code could be used,
which wastes memory
(but usually, who can afford bloated hardware with large instruction caches
can afford some more memory, too).
-
I realize that there might be problems in some cases,
like page faults, when the recovery routine just can't complete
until the interrupt handler returns.
This means that either the handler must live in a lower space
and doesn't need the GC invariants provided by interruptible code,
or the recovery routine must be able to roll-back
rather than push-forward execution,
or critical routines must lock pages in before
to enter recoverable mode,
or any combination of the previous.
-
In all cases, this makes the interrupt handling more complicated,
because it has to determine the recovery routine and jump in.
But that's no problem since interrupts are a most infrequent case
as compared to normal execution.
Because of the space/time cost of recovery routines,
they shall only be used for time-critical loops and recursive code,
where inserting PAUSEs would be too costly.
Whereever a tight loop involves a significant of CPU work,
the added cost for such a signal handler is negligible;
when the loop doesn't involve significant CPU work,
then it can be coded in a way that includes a check for context-swich time,
so recovery is just toggling a flag, and jumping back to the
unmodified original code, that will do everything perfectly.
-
Currently, even Linux (>2.1.8) now uses a
(restricted version) of this technique, developed independently by Linus:
software checks for the validity of user-memory access from the kernel
are replaced by hardware check with a code recovery technique.
Of course, all the complicated code can be automatically generated
from a few high-level macros.
-
Machines with lots of registers or register banks
may have disjoint registers/banks for real-time and normal threads.
-
Note that this way to handle thread scheduling
doesn't interfere anyhow with critical real-time interrupt handling,
that wouldn't wait for the scheduled thread to recover,
but would take place before the recovery routine is called.
-
Hence, there could be a hierarchy of threads,
some thread being able to preempt some other,
while these other threads cooperate with each other.
A thread could be made of subthreads,
that would cooperate with each other or not, etc.
-
Let us give a simple argument why some kind of cooperation or other
is always needed:
there are always lots of extensions and resources
that appear and disappear dynamically,
every thread using few of them.
A preempter that would pretend to keep track of them all
when preemptively saving the context
would either do lots of costly useless work
(e.g. saving the state of the graphics cards),
or fail to manage resources.
The preempted thread knows better what needs to be saved or not
("drop that screen -- I'll redraw it faster than you can dump&save it,
if I count all the page misses and swapping you'll obtain!").
Surely, lazy saving and shallow-binding techniques can (and should) be used
to reduce the overhead of thread switching.
But as in other domains where it is used,
lazy evaluation is known to be always less efficient
whenever the evaluation is know to be strictly required.
Shallow binding involves a permanent run-time overhead.
And all these techniques won't solve the problem of atomicity of
GC operations, etc.
independently from the preempter.
A non-cooperative preempter means that these resources cannot be
taken into account and must be handled elsewhere,
which is not solving the problem,
or that the preempter wi
Modules to implement
- generic memory management modules
- A generic generational garbage collector
with full support for persistence
(checkpointing and restarting of objects).
- A generic module for inter-heap (perhaps distributed)
synchronized persistent garbage collector
- A generic module for back-reference unsynchronized
persistent conservative garbage collector.
- generic support for chunks of executable binary
- generic full support for GOOSE modules.
- mechanisms for the migration manager
(from the Migration subproject) to use:
swapping to disk, to a compressed RAM zone,
to another host, etc.
- low-level management of user interface hardware
- A console adapter for text-mode
- a text-mode adapter for serial consoles
(perhaps just use the ncurses package)
- A screen windowing multiplexer for text-mode screens
- A generic output-synchronizing input multiplexer
- A standard combination of the previous two.
- A fast graphic library to attract game programmers
- File formats
(I guess these should presumably go to the HLL stdlib...)
- A partition manager, that multiplexes hard disks according to
standard partitioning methods.
- Support for various existing file systems:
MS-DOS FAT FS (and WindowsNT or Linux UMSDOS extensions),
Linux EXT2 FS, etc.
- Graphic file formats:
GIF, JPEG, PCX, TIFF, etc
- Various Audio file formats for sound samples
Note: Why GC ?
Garbage collection is just NEEDED in a dynamical system
(like for instance a distributed multi-user multi-tasking
interactive development system).
You can't escape it because there's NO WAY to manually or statically
determining when an arbitrary word will no longer be used.
Surely on a non-interactive, non-development single-threaded embedded system,
GC can be done manually and statically.
This is just not the general case, that TUNES claims to handle.
I'd be happy the TUNES GC can be stripped down by the compiler
when producing standalone code that does not need it;
but the "main" reflexive development system just HAS TO implement it.
See the
GC FAQ
or the GC introduction text from Smalltalk/X
for more insight.
Please note that GC exists independently from the way it is implemented:
- All GC techniques exist independently from
their being done manually or automatically,
with or without compiler help,
in a language that can express the higher-order nature of
GC constructs or that force one to manually transcribe it
in an error-prone way, etc.
- Particularly, all the manual memory management techniques
traditionally used in the particularly inexpressive
assembly, C, and C++ languages ARE some kind of GC techniques,
though considerably much slower, less efficient, and more error-prone,
both for development time and execution time,
than the automatic techniques that expressive languages
allow compilers to generate.
Qualities of a GC:
- The most important quality for a GC is that it be Sound,
that is, it won't free memory from objects still in use.
- Next, a GC is all the more Lively that it can
soundly survive arbitrary kind of allocation pattern
with higher amounts of live memory (from needed objects) being used
for a longer period.
- A sound GC that cannot guarantee liveliness is called "conservative".
- A GC where the allocatability of a new objects depend only on
the type/sizes of the live objects and new objects,
not on the history of allocations of live and dead objects,
is said to be immune to the "swiss cheese" syndrom.
Other GCs are said to be all the more subject to this syndrom
than the history prevent new objects from being allocated.
Particularly, all conservative GCs are subject to it.
- Finally, a GC is hard-real-time if it can guarantee a bound
on the time spent during the allocation of a new object
that is proportional to the size of the new object.
The main GC techniques:
- Freeing no memory is called the "trivial GC".
It's sound, hard-real-time, but the least lively GC
that does not waste memory on purpose.
- Building all memory freeing operations statically in the program,
without maintaining any dynamic information about cross-references;
this is called "static GC",
but of you can only free what you statically know,
so it is not maximally lively.
It can be lively only when you can allocate everything on a tree,
and cut eventually cut all dead branches.
- Automatic compiler-generated stack allocation IS
a limited kind of such GC,
particularly efficient when not using a single-threaded model
without data sharing,
but not very sound or lively in the general case.
- Counting the number of references to heap-allocated objects
IS some kind of GC, called "reference counting".
It is not lively, because it won't remove circular references
(see circular or doubly linked structures, that are so
essential to many efficient algorithms).
- Having back-references for all references instead of just counting them
IS some kind of GC technique, called "reference listing";
It is a useful technique for sound GC
in an asynchronous distributed environment;
but if not used in conjunctions with other techniques,
it has the same liveliness as reference counting on local GCs.
- Tracing live objects as those reachable from a set of "roots" is
the ultimate GC technique in the general case;
used in cunjunction with a moving technique (copying or compacting),
only it can solve the swiss-cheese syndrom.
- Linear objects have a trivial implementation of reference-listing:
if you ever reach such an object,
then you just used the only reference to it,
hence you don't need consult any "backpointer"
to find that reference...
To Do on this page
- Find all LLL requirements.
- Find all implementational problems.
- Make first proposals.
- Divide into subfiles.
- Define separately semantics & representation
- Find out how multiple architectures can be made interoperable...
- Find some nice implementation for annotations...
- Separate annotation mechanisms from annotation resolving policies...
- Open a "Store" subproject about encodings and algorithms for
the dynamically and actively annotable, distributed, persistent,
garbage-collected store. Actually, distribution, persistence,
and garbage-collection could be obtained
by proper active annotations...
- Add a pointer to
Patrick Premont's
sources.
Back to the
Tunes Home page or
Subprojects page
Page Maintainer:
Faré
-- rideau@clipper.ens.fr