The Tunes LLL Subproject

"Rem non spem, factum non dictum, amicus quaerit"



Contents of the LLL Subproject

Here are the current contents of this LLL project:

As you see, the LLL subproject is still in a draft process :(
NOTE: existing code and corresponding technical comments lie in directory src/LLL of the source distribution.
Feel free to contribute !

You may also want to
look at the what Review subproject said about implementations of other systems,
go back to the main Tunes page, or
jump back to the Tunes Subproject page.


Goals for the Low Level Language

The goal of this Subproject is a define and implement a low-level language that will fulfill all the requirement to be used as a basis for the Tunes system, including self-extending into the full Tunes HLL, etc.
The LLL is not a back-end to the HLL, but a low-level core for its implementation.
Why a LLL, some ask ?
Well, firstly, the LLL is meant as a somehow portable way of implementing Tunes. It is a purely internal language, that shouldn't make it to most users, may disappear or change drastically at any new major release, and intended to make implementation on many platforms as easy as possible. Somehow, the LLL project is the basis of any implementation of Tunes.
Why put yet another intermediate between the human and actual computations ?
Sure the LLL is another artificial layer of computerware, that brings it amount of computing overhead. The hope is to minimize not runtime overhead (though we hope not to make it too large), or even compile-time overhead, but development-time overhead. Once Tunes runs, it is still time to focus on compile-time and run-time speed. Moreover, "the LLL" is not to understand as an identical abstract language on all machines (even C has different sizes and pointers according to the hardware), but as a general implementation model, that differs according to the hardware. It is up to the HLL above to make abstract whatever is needed. Every particular implementation may have its own version of the LLL, incompatible with that of other implementations, yet still resembling enough so that all can be called "the LLL".
Is the LLL a standard virtual machine to achieve interoperability on heterogeneous networks?
Not exactly. Surely, virtual machines are an easy way to achieve interoperability, but they have serious performance drawbacks: interpretation of virtual machine code is a very slow process, that loses at least one order of magnitude in execution speed. Of course, if the VM code is properly designed, code can be compiled instead of interpreted; but then, the higher-level the code, the longer it takes to compile to efficient code; and interoperability is such that the VM code cannot be adapted as an efficient low-level code for all possible targets, so a lot of compromises must be done. To sum up, virtual machine techniques are very useful whenever speed of execution is not a major issue relatively to compactness of code (with associated speed up and cost reduction in data transfer/storage). This is particularly the case when performance-critical primitives can be isolated in a few locally-optimized code, but isn't enough for high-performance inter-operable software. And very importantly, we refuse the idea of having a one priviledged virtual machine, that this machine would be built statically as a standard for all future work, without knowledge and tuning of this machine, or even with tuning of the machine for today's hardware and software, whereas future hardware and software performance constraints are most likely to change. Such a VM would be no better than the hardware-based standard that IBM-PC under MSDOS constituted up to now; perhaps with a simpler virtual hardware, but still a hardware-based approach. Surely, the Tunes project should allow and encourage VM techniques to be used whenever useful; some implementations might share a same VM, and the LLL might be a good basis for such a VM. But Tunes just won't embark on a universal VM as a central static concept. Also, a VM won't solve any problem but interoperability of software above it with hardware below it.
Why isn't C or assembly the LLL?
It couldn't be C or assembly, because it has to actively support the low-level memory model used to implement migration and annotation (which for a standalone system would be persistency, GC, and weak pointers).
Don't you support assembly ?
We do support assembly, as part of the LLL, and are aware that full control over the hardware is the only way to achieve near-optimal code. However, we are confident that eventually, most programming should be done at a higher level, and that even low-level control could and should be done with high-level tools that warranty that the low-level interactions have the right semantics while optimizing code. As for the cases where better code than available optimizer produces is wanted, we feel that adding specialized optimizer tactics leads to safer, surer, more reusable and portable results, that can be checked, whereas hand-writing optimizing leads to unsafe, short-lived code.
Aren't the LLL and the HLL subproject partially or totally redundant ?
Actually, the HLL/LLL concept is all about an harmony between complementary constraints of programming: Low-level programming is meant to provide as much performance as possible from the computer hardware, while still having useful semantics. High-level programming is meant to provide the most expressive and cleanly defined semantics, while still being efficiently doable. So we see these are complementary, not opposite or redundant.


LLL Requirements


LLL Semantics


Low-level objects

Annotations


LLL Difficulties

See HLL difficulties...

Hardware independence

At what level shall word size be decided ?
How can objects migrate back and forth between machines with different bytesize with no loss in information ?

Garbage Collection

Infix pointers (that do not point at some globally constant offset to the beginning of an allocation unit) greatly complicate the GC. They should be forbidden whenever possible.
"C" like infix pointers can still be simulated with a segment-offset pair, with an aligned pointer and an offset inside the segment.
The GC may have to accept infix pointers as for code return addresses, or else the calling convention may become grossly unefficient
I propose that code "segments" should not cross, say 4K or 8K boundaries, so that finding the right code segment is just a matter of checking the list of segments in the segment obtained by masking the right bits.
Big problem: how to efficiently differentiate pointers from numbers, etc ?
structural differentiation is powerful, but may slow the GC considerably, unless descriptors are simple (can be just an integer for length of a pointer array, for most objects), and forbids dynamic differentiation, mixing integers and pointers in an array (e.g. simple stack), etc. That's why we'll use a simple bit pattern to differentiate integers (raw data) from pointers (structured data), and different kind of pointers from each other (that's a BIg Bunch Of Pages kind of GC).
we must choose between integers having low bit set or low bit cleared. Having it set (and thus having bit cleared for pointers) may allow faster pointer access on RISC machines, but slows any arithmetics. Having bit set for pointers allow easier arithmetics, but forces the use of an offset for all memory accesses.
The big question is: will integers be stripped of their low bit, which would simplify overflow testing code to naught, and make the implementation portable, but make a little harder doing pointer arithmetics and mixing of true integers with 31 bit ones. Or stripping them from their overflow bit, which makes integer overflows to generate GC-readjustable pointers, rather than providing flat modulo arithmetics, but allows easy pointer arithmetics and mixing of 31-bit integers and 32-bit ones ?
We shall implement both ways, and compare actual execution time and code space measurements !!!
A high-level page directory is used to determine the GC type of objects according to the page it is in. It is a multi-level hashed structure that may evolve with the GC code, so that it may allow to find quickly the type of objects. Typically a mix between arrays and balanced binary trees to recognize bit patterns.
The GC type of an object, as determined by its address gives us routines to update the object during a GC, to destroy the object when it is not accessed anymore, etc.
The GC type of a page chunk allows us to track down the beginning of individual objects pointed to on the page (in case infix pointers are used), also gives us the policy to follow when swapping out the page (which may be copying the page to disk, sending it to the network, or compressing it to memory for possible further actual swapping out of memory, etc).

Persistence

Be careful with distributed persistence: always remember previous states until all transactions using it are finished and confirmed.
Real-Time persistent objects will have to time-allocate the copying of their data into buffers for synchronized persistent store, even though this operation may be done only once in a while.


Implementation Ideas

Implementation language

We shall use the m4 preprocessor (or later our own HLL) to produce assembly source files for all our different target processors from mostly the same meta-source.
We shall use C as though it was a mostly regular assembler, with labels, jumps, etc, so the same meta-source also produces the C source files. Hey, wasn't C called a portable assembler ? ;->

Modules

modules have some install/uninstall annotation fields explaining how to restore/resume the object from the state log as gotten from persistent store. In general, this will be a call to a standard trusted high-level module. However, this can be low-level code in the very first bootstrapping modules...
This scheme can be used for migration in general: persistence, garbage collection, dynamic linking, etc.

Mixed cooperative/preemptive multithreading:

Preemption is necessary for real-time response as well as for unsecure tasks; but cooperation allows far better context switch time (on the i386, compare saving some 5 registers to saving the whole CPU+FPU+MMU state), not to talk about trivial implementation of mutual exclusion, and all kinds of similar system-wide assumptions, such as those needed for a garbage collector.
The Tunes way, as utsl points out, is to allow the HLL to be preemptively multithreaded (or rather, it will manage seamlessly concurrent objects), while the LLL is cooperatively multithreaded.
Actually, if we define cooperation by the fact that you may rely on some software properties of code, and preemption by the fact that a thread may yield execution without having to poll the scheduler, then the two techniques are not incompatible, as we'll see.
In traditional systems (as well as in emulation boxes for them), all software is so completely and deeply unsecure that only uncooperative preemption is possible for the system not to crash too often (as examples of often-crashing cooperative non-preemptive system are winblows 3.1 and MacOS). Now, Tunes is a secure system, and we can and shall use cooperation whenever possible (=almost always): we shall require standard compilers to generate cooperative code, that PAUSEs every now and then.
Of course, real-time threads need some kind of preemption, and untrusted environments (which emulation boxes are) as well; but as long as these use cannot access the preempted thread's data, or any resource shared by the preempted thread (like garbage collected memory), everything is fine.
Actually, threads are interrupted and signaled by more priviledged threads, but not preempted by equally-priviledged threads. (they might be preempted by foreign threads, to which they can't compare priviledge). This is why we rather call that interruptible cooperative threads instead of preemptible threads.
Now, a common problem with cooperation is that PAUSEs are often too irregular, and while having them too far away for each other yields poor response time, having them to near yields too many context switches and related overhead (even in the best case, lots of cache misses). We can solve this problem by requiring some fine PAUSE resolution, but do actual context switch only when the timer says so (by modifying the code of a PAUSE code to just a RET/NEXT until it's time).
This is perfect, but may slow tight loops significantly. One way is to unroll tight loops enough so that this cost is reduced. If this still isn't satisfactory, a more elaborate way involves a "it's time to switch" signal. Before to enter a tight loop, some "interrupt recovery" routine is registered as the signal handler (and is unregistered afterwards); if an interrupt happens while in the tight loop, then the recovery routine is called, which has the responsibility to put the system in a state that follows the usual calling conventions.
Typically, the routine will continue normal execution of the loop until some modified point which will jump back into the remaining of the recovery routine, which can finish the work because it knows the exact point of execution in the loop (which could have been at any point inside the loop). This modified loop might be achieved through safely self-modifying code, or safely runtime-generated code, if the CPU architectures doesn't make it excessively costly by e.g. requiring to flush all of a big instruction cache afterward; in case expensive cache-flushing make runtime-code-generation/modification too expensive, a pre-modified copy of the code could be used, which wastes memory (but usually, who can afford bloated hardware with large instruction caches can afford some more memory, too).
I realize that there might be problems in some cases, like page faults, when the recovery routine just can't complete until the interrupt handler returns. This means that either the handler must live in a lower space and doesn't need the GC invariants provided by interruptible code, or the recovery routine must be able to roll-back rather than push-forward execution, or critical routines must lock pages in before to enter recoverable mode, or any combination of the previous.
In all cases, this makes the interrupt handling more complicated, because it has to determine the recovery routine and jump in. But that's no problem since interrupts are a most infrequent case as compared to normal execution. Because of the space/time cost of recovery routines, they shall only be used for time-critical loops and recursive code, where inserting PAUSEs would be too costly. Whereever a tight loop involves a significant of CPU work, the added cost for such a signal handler is negligible; when the loop doesn't involve significant CPU work, then it can be coded in a way that includes a check for context-swich time, so recovery is just toggling a flag, and jumping back to the unmodified original code, that will do everything perfectly.
Currently, even Linux (>2.1.8) now uses a (restricted version) of this technique, developed independently by Linus: software checks for the validity of user-memory access from the kernel are replaced by hardware check with a code recovery technique. Of course, all the complicated code can be automatically generated from a few high-level macros.
Machines with lots of registers or register banks may have disjoint registers/banks for real-time and normal threads.
Note that this way to handle thread scheduling doesn't interfere anyhow with critical real-time interrupt handling, that wouldn't wait for the scheduled thread to recover, but would take place before the recovery routine is called.
Hence, there could be a hierarchy of threads, some thread being able to preempt some other, while these other threads cooperate with each other. A thread could be made of subthreads, that would cooperate with each other or not, etc.
Let us give a simple argument why some kind of cooperation or other is always needed: there are always lots of extensions and resources that appear and disappear dynamically, every thread using few of them. A preempter that would pretend to keep track of them all when preemptively saving the context would either do lots of costly useless work (e.g. saving the state of the graphics cards), or fail to manage resources. The preempted thread knows better what needs to be saved or not ("drop that screen -- I'll redraw it faster than you can dump&save it, if I count all the page misses and swapping you'll obtain!"). Surely, lazy saving and shallow-binding techniques can (and should) be used to reduce the overhead of thread switching. But as in other domains where it is used, lazy evaluation is known to be always less efficient whenever the evaluation is know to be strictly required. Shallow binding involves a permanent run-time overhead. And all these techniques won't solve the problem of atomicity of GC operations, etc. independently from the preempter. A non-cooperative preempter means that these resources cannot be taken into account and must be handled elsewhere, which is not solving the problem, or that the preempter wi


Modules to implement


Note: Why GC ?

Garbage collection is just NEEDED in a dynamical system (like for instance a distributed multi-user multi-tasking interactive development system). You can't escape it because there's NO WAY to manually or statically determining when an arbitrary word will no longer be used. Surely on a non-interactive, non-development single-threaded embedded system, GC can be done manually and statically. This is just not the general case, that TUNES claims to handle. I'd be happy the TUNES GC can be stripped down by the compiler when producing standalone code that does not need it; but the "main" reflexive development system just HAS TO implement it.

See the GC FAQ or the GC introduction text from Smalltalk/X for more insight.

Please note that GC exists independently from the way it is implemented:

Qualities of a GC:

The main GC techniques:


To Do on this page


Back to the Tunes Home page or Subprojects page


Page Maintainer:
Faré -- rideau@clipper.ens.fr