One way of understanding the problem described in “Into the Abyss” is in terms of the domain of streams being used.

The data-push model, discovered originally by Gilles Kahn, models a stream as either a finite or infinite sequence of data items. The infinite sequences represent ‘healthy’ streams that run without interruption – in particular without deadlock. They represent streams produced by devices that never freeze up, that always eventually produce one more output item.

The finite streams, on the other hand, represent computations that at some point do freeze up. Streams produced by devices that at some point permanently stop producing output. In Kahn’s data-push model, deadlock is indeed permanent. If the (say) index 6 component of a stream never appears, neither will the index 7, or 8, or 9 etc components. Kahn asserted that his least fixed point domain semantics corresponded exactly to the data-push computation model, and he was right. (My first PhD student, Tony Faustini, gave the first formal proof of this claim in his 1982 PhD).

As we pointed out in the last Lucid post, the problem is that the Kahn domain does not validate the axioms and inference rules given in the published work on Lucid, including in particular the vital if-then-else rules

if true then X else Y = X

if false then X else Y = Y

In terms of domains, these rules assume a domain of streams in which every stream is formally infinite, but includes domains in which some of the components are ⟂, the ‘result’ of a nonterminating computation. We can think of the Kahn domain as a subset of this Lucid domain in which once ⟂ appears, every component thereafter is also ⟂.

We were faced with a dilemma. One choice was to adopt the Kahn dataflow model and (therefore) along with it the Kahn stream domain, which did not obey the nice rules and axioms given in the Lucid publications.

The other choice was to find an alternate computation model corresponding to the “intermittent streams” domain. Fortunately David May (and, simultaneously, other implementors) found just such a model: tagged, demand-driven dataflow, later called *eduction*.

The demand-driven system abandons any attempt to interpret sequence index as corresponding to computation time. Instead, the i^{th} value of a stream is not computed until an explicit demand for it is generated. The program begins by demanding the index 0 value of the output. These demands propagate backwards through the defining expressions of the program variables and generate demands for other values at possibly different indices. The indices of demands are not necessarily in increasing order, and some values of a stream may not be demanded at all. It is not hard to see that this model supports intermittent streams. In fact it can be shown that an eductive (demand-driven) computation of a program corresponds exactly to the least fixed point of the program over the domain of intermittent streams.

Unfortunately the eduction model in its simplest form is not practical – there is no provision for storing the results of intermediate computations that may be needed more than once. This problem does not arise in data-push dataflow. Or rather, it does, but it becomes the responsibility of the programmer, who has to ensure that when needed multiple copies are produced and sent to the right destinations. Sometimes this can be quite a challenge.

In eduction this is handled automatically, through the use of a cache called the *warehouse*. Every time the index i value of variable J is computed, this value is stored in the warehouse with the identifying tag (i,J). If later on the index i value of J is demanded again, the interpretation first looks in the warehouse (which is an associative store) and launches a computation only if no value tagged (i,J) is found.

This scheme, however, is still not practical, because the warehouse quickly fills up with data that in fact will never be accessed. We had to devise a scheme for periodically garbage collecting the warehouse. This was no easy task, since in general there is no way to predict whether or no the value tagged (i,J) will ever be needed again. We did, however discover a heuristic, which we called the *retirement* *plan*, that dynamically calculated (for each variable) the number of collection cycles for which the values of the variable should be retained.

(The existence of the warehouse and its storage management heuristic should make it clear that eduction is a computation model that differs essentially from standard imperative assignment-based computing, from Kahn dataflow, and from lazy LISP-style models like that of Haskell. Not necessarily uniformly better, but very different).

We soon discovered that eduction brought benefits other than compliance with the official semantics, axioms and rules. Eduction opened up a new possibility not available in data-push dataflow: multidimensionality. Why not have streams where a value depends on more than one index – say, a ‘space’ coordinate as well as a time coordinate? In particular, one unexpected application of having extra dimensions was a novel dataflow approach to user-defined function, including recursive ones

These posts continue to be very interesting.

I’d like to know more about how the retirement plan worked and what the novel approach to user-defined functions was.

If I remember it correctly, the default way of handling a warehouse-entry was to give it a certain reference count from the start, increment it at every access and decrement it at every collection cycle.

What is pretty cool is that memory is only used for speeding up computation, a warehouse-miss is recomputed regularly. This gives some great leeway for possible parallel computations.

Thanks Bill. Very interesting.

Could you explain a bit more about the difference between the evaluation strategy of Lucid and Haskell? To me the warehouse/retirement system sounds isomorphic to having a heap for thunks with garbage collection, but I am likely missing something here.

I’ll be saying more about the retirement plan in a future post.

-ww