next_inactive up previous


Another look at Reference Counting

Chetan Vaity (02329901)

27 July, 2003

Abstract:

Garbage collection using reference counting has some serious drawbacks: incapability of reclaiming cyclic objects, extra processing in adjusting the reference counts and extra space for maintaining reference counts. But it also offers several exciting advantages, which become quite appealing in a real-time environment where long pauses, characteristic of tracing collectors, cannot be tolerated. As processors become faster, the extra processing required in reference counting may not matter for some applications.

1 Introduction

Reference counting systems perform automatic memory management by keeping a count in each object, usually in a header, of how many references there are to the object. Objects to which there are no references cannot be accessed by the mutator(the actual running program); they are therefore dead and may be reclaimed. The reference count is incremented for each new reference, and is decremented if a reference is overwritten. If a reference count falls to zero, then the object is no longer required and the memory can be reclaimed.

When an object is reclaimed, its pointer fields are examined, and any objects it holds pointers to also have their reference counts decremented, since references from a garbage object do not count in determining liveness. Reclaiming one object may therefore lead to the transitive decrementing of reference counts and reclaiming many other objects.

2 Advantages

2.1 Reference counting is incremental

The reference counting collector operations (adjustment of reference counts and reclamation when counts hit zero) are both interleaved with the execution of the program, because they may occur whenever a pointer is created or destroyed. One of the important advantages of reference counting is this incremental nature of most of its operation. It can be made real time, that is, performing at most a small and bounded amount of work per unit of program execution. Clearly, the normal reference count adjustments are intrinsically incremental, never involving more that a few operations for any given operation that the program executes. Sometimes, large data structures may become free due to a ``root'' reference count becoming zero. Such transitive reclamation of whole data structures can be deferred, and also done a little at a time, by keeping a list of freed objects whose reference counts have become zero but which haven't yet been processed.

This incremental collection can satisfy ``real time'' requirements, guaranteeing that memory management operations never halt the executing program for more than a very brief period. This can support applications in which guaranteed response time is critical; incremental collection ensures that the program is allowed to perform a significant, though perhaps appreciably reduced, amount of work in any significant amount of time.

The JOSES project [JOS] was initiated to develop an optimizing Java compiler environment producing high performance native code. The real-time garbage collection used is based on reference counting.

In JOSES, garbage collection based on reference counting is adopted, since it distributes the garbage collection overhead throughout the program: each creation or destruction of a reference to a garbage collected object involves maintenance of the reference count. Redundant reference count updates are removed by a peephole optimizer. Although reference counting solves the problem of the unbounded execution time of an allocation statement, a straightforward implementation would shift the problem to the de-allocation statement due to cascading de-allocation. When an object is de-allocated, the object is scanned for child references. The reference counts of its children are decreased. If the reference count of any of the children falls to zero, that child is also de-allocated. This makes the execution time of the de-allocation statement unpredictable. This problem is tackled by not recursively de-allocating the child objects, but by adding them to to-be-freed lists which are processed during later allocation or de-allocation statements.

2.2 Finalization

In many languages, it is often necessary to perform actions on some objects after they are no longer in use and before their memory can be recycled. These actions are known as finalization or termination. A common use of finalization is to release resources when the corresponding ``proxy'' object dies. For example, an open file might be represented by a stream object. When this object has been proven dead by the collector, it is certain that the file is no longer in use by the program, and it can and should be closed before the stream is recycled. Note that finalization is not, in general, guaranteed to be prompt, and this can cause problems if it is used to manage scarce operating system resources such as file descriptors. Reference counting collectors usually detect dead objects immediately and hence finalization can be carried out without delay. This cannot be said for most of the tracing collectors.

2.3 Memory usage

The immediacy of reclamation in reference counting systems can have advantages for overall memory usage. A reference counting system may perform with little degradation when almost all of the heap space is occupied by live objects, while other collectors rely on trading more space for higher efficiency. In mark and sweep collectors, for example, the cost of collection is proportional to the size of the heap, including both live and garbage objects. In reference counting systems, on the other hand, the cost is proportional to the work done by the running program. When memory space is scarce, reference counting systems are attractive because their performance is independent of the ratio of live data to total storage.

Another aspect is that the collector's working set closely matches that of the main application, resulting in excellent locality. That is, the objects which are to be reclaimed are almost certain to be in main memory.

3 Optimizations

3.1 Hybrid strategy

One of the most important drawbacks of reference counting is its inability of collecting cyclic structures. If any objects are part of a cyclic data structure then they will always have a non-zero reference count, and hence won't be reclaimed when they are dead. Some implementations (Perl and Python) simply ignore this problem and let the cyclic data build up.

A hybrid strategy of using reference counts along with a tracing garbage collector was proposed by Deutsch and Bobrow [DB76]. Standard reference counting is deployed and reclaims objects whose reference counts fall to zero. Of course, self referential structures will not be collected. For this purpose, a tracing garbage collector is used. Most of the inaccessible objects are reclaimed by the reference counting scheme and hence the runs of the tracing collector can be less frequent. In fact, the user can be given the choice (responsibility) for choosing a convenient moment to initiate a tracing collector run.

Python (version 2 on-wards) uses such a strategy.

3.2 Deferred reference counting

Consider the code to swap the (reference-counted) values contained in two variables a and b which naively performs the following operations:

incrc(a); tmp := a;
incrc(b); decrc(a); a := b;
incrc(tmp); decrc(b); b := tmp;
decrc(tmp); (tmp goes out of scope)

After six reference count operations, the reference counts of the objects referred to by a and b (now by b and a) are unchanged. It is more efficient to recognize this and leave the reference counts unchanged.

Much of the cost of unnecessary increment/decrement operations can be optimized way by a special treatment of local variables [DB76]. Reference counts are only adjusted to reflect pointers from one heap object to another. This means that reference counts may not be accurate because pointers from the stack may be created or destroyed without being accounted for; which in turn means that objects whose count drops to zero may not be actually be reclaimable. Garbage collection can only be done when references from the stack are taken into account as well. Such objects are added to a zero count table (ZCT) instead. Periodically, the reference counts are brought up to date by scanning the stack for pointers to heap objects. Then, any objects in the ZCT whose reference counts are still zero may safely be reclaimed. The interval between these phases is generally chosen to be short enough that garbage is reclaimed often and quickly, yet still long enough that the cost of periodically updating the counts (for stack references) is not high.

Deferred reference counting has been used successfully with several languages, notably Smalltalk.

3.3 One-bit reference counts

A very small count field, perhaps only a single bit, can be used to avoid the need for a large field per object [Wil92]. Given that deferred reference counting avoids the need to continually update the count of pointers from the stack, a single bit is sufficient for most objects. The minority of objects whose reference counts are not zero or one cannot be reclaimed by the reference counting system, but are caught by a fall-back tracing collector.

4 A reference counting garbage collector for multiprocessors

A fully concurrent pure reference counting garbage collector has been implemented [BAL$^+$01] [BAR$^+$01] in the Jalapeno Java virtual machine running on shared memory multiprocessors. The authors believe that a fresh look at reference counting is justified as processor clock speeds increase while RAM becomes plentiful but not significantly faster. In this environment the locality properties of reference counting are appealing, while the purported extra processing power required is likely to be less relevant.

At the same time, Java's incorporation of garbage collection has thrust the problem into the mainstream, and large, mission critical systems are being built in Java, stressing the flexibility and scalability of the underlying garbage collection implementations. As a result, the supposed advantages of tracing collectors: simplicity and low overhead are being eroded as they are being made ever more complex in an attempt to address the real-world requirements of large and varied programs. Furthermore, the fundamental assumption behind tracing collectors, namely that it is acceptable to periodically trace all of the live objects in the heap, will not necessarily scale to the very large main memories that are becoming increasingly common.

Here, a short description of the algorithm used is presented. Initially, the synchronous collector is described followed by the concurrent version.

4.1 Synchronous version

The reference counting collector, as described here, does not handle cyclic structures (Extensions to the basic algorithm to handle these are detailed in the references). It can support multiple mutator threads running on multiple processors, but the collection itself is synchronous (also known as stop-the-world ). This collector will then be extended to handle concurrent collection.

Every object has a reference count field, which for object S is denoted RC(S). During normal operation, the collector only keeps track of writes to the heap, while ignoring writes to the stack. RC(S) may be out of date, but it is guaranteed that it will not reach zero until the object is garbage.

This guarantee is maintained as follows: increments and decrements to the heap are deferred by writing them into buffers, called Inc and Dec, respectively. When the program runs out of memory, a collection is triggered, which performs the following steps:

  1. scan the stack of each thread and increment the reference count for each pointer variable in the stack
  2. perform the increments deferred in the Inc buffer
  3. perform the decrements deferred in the Dec buffer, recursively freeing any objects whose RC drops to 0
  4. clear the Inc and Dec buffers
  5. scan the stack of each thread again and place each pointer variable into the Dec buffer.
The effect of these steps is that any pointers which are on the stack are accounted for during the current collection as increments, and during the next collection as decrements. By accounting for them during the current collection as increments, it is ensured that objects pointed to by the stack are not collected. By accounting for them during the following collection as decrements it is ensured that if field RC(P) was incremented because P was on the stack, and P is subsequently popped off the stack, the reference count field will be appropriately decremented.

4.2 Concurrent version

The concurrent version (Recycler) is a producer-consumer system: the mutators produce operations on reference counts, which are placed into buffers (INC and DEC) and periodically turned over to the collector, which runs on its own processor. The collector is single-threaded, and is the only thread in the system which is allowed to modify the reference count fields of objects.

During mutator operation, updates to the stacks are not reference-counted. Only heap updates are reference-counted, and those operations are deferred with by storing the addresses of objects whose counts must be adjusted into mutation buffers, which contain increments or decrements. Objects are allocated with a reference count of 1, and a corresponding decrement operation is immediately written into the mutation buffer; in this manner, temporary objects never stored into the heap are collected quickly.

Time is divided into epochs, which are separated by collections which comprise each processor briefly running its collector thread. Epoch boundaries are staggered; the only restriction being that all processors must participate in one collection before the next collection can begin.

Periodically, some event will trigger a collection cycle: either because a certain amount of memory has been allocated, or because a mutation buffer is full, or because a timer has expired. In normal operation, none of these triggers will cause the mutator to block; however, they will schedule the collector thread to run on the first processor.

On the first processor when the collector thread wakes up it scans the stacks of its local threads, and places the addresses of objects in the stack into a stack buffer. It then increments its local epoch number, allocates a new mutation buffer, and schedules the collector thread on the next processor to run. Finally, it dispatches to the thread that was interrupted by collection.

The collector thread performs these same operations for each processor until it reaches the last processor. The last processor actually performs the work of collection.

The last processor scans the stacks of its local threads into a stack buffer. Then it processes increments: the reference count of each object addressed in the stack buffer for the current epoch computed by each processor is incremented. Then the mutation buffer for each processor for the current epoch is scanned, and the increment operations it contains are performed.

To avoid race conditions that might cause the collector to process a decrement before the corresponding increment has been processed, not only must the increment operations be processed first, but the decrement operations must be processed one epoch behind. So the last processor scans the stack buffers of the previous epoch, and decrements the reference counts of objects that they address, and then processes the mutation buffers of the previous epoch, performing the decrement operations.

During the decrement phase, any object whose reference count drops to zero is immediately freed, and the reference counts of objects it points to are recursively decremented.

Finally, the stack and mutation buffers of the previous epoch are returned to the buffer pool, and the epoch number is incremented. The collection has finished and all processors have joined the new epoch, and now any processor can trigger the next collection phase.

4.3 Results

On a multiprocessor environment, this algorithm runs without ever blocking the mutators, and achieves maximum pauses which are about 100 time shorter than mark-and-sweep collectors without sacrificing end-to-end execution time. Detailed results on various benchmarks are available in [BAL$^+$01].

5 Conclusion

This report has surveyed various aspects of reference counting garbage collectors and also explored some variations which deal with the drawbacks of this scheme. A concurrent pure reference counting algorithm for a multiprocessor environment with excellent performance and very short pauses for collections was also seen.

Bibliography

BAL$^+$01
David F. Bacon, C. Richard Attanasio, Han Lee, V. T. Rajan, and Stephen Smith.
Java without the coffee breaks: A nonintrusive multiprocessor garbage collector.
In SIGPLAN Conference on Programming Language Design and Implementation, pages 92-103, 2001.

BAR$^+$01
David F. Bacon, Clement R. Attanasio, V.T. Rajan, Stephen E. Smith, and Han B. Lee.
A pure reference counting garbage collector, 2001.

DB76
L. P. Deutch and D. Bobrow.
An efficient incremental automatic garbage collector.
In Communications of the ACM. Association for Computing Machinery, September 1976.

JOS
The joses project.
http://joses.pds.twi.tudelft.nl/rt-gc.html.

Wil92
Paul R. Wilson.
Uniprocessor garbage collection techniques.
In Proc. Int. Workshop on Memory Management, number 637, Saint-Malo (France), 1992. Springer-Verlag.


next_inactive up previous
Chetan Vaity 2003-11-28