If carcinization happens when languages evolve to be more like Rust, then what do you call it when Rust evolves to be more like Java? Caffeination?
Over this summer, I’ve had a decent amount of time to kill. What better way to spend a beautiful hot summer than sitting inside, staring at core dumps in a supposedly memory-safe language? I built a garbage collector - in short, a piece of software that manages allocations for another, more exciting piece of software. The cool part: I made it in Rust, for Rust - a language designed to eliminate garbage collection and which provides few facilities for making it work properly. Not only that, I managed to make my garbage collector work with relatively few compromises, which I’ll describe in further detail lower down.
If you don’t like reading, don’t care about the journey to implement it, or just want to get a sales pitch for the final result, check out the source code on GitHub here. You can also download it directly from crates.io.
Background
If you’re familiar with the details of Rust and its standard library, feel free to skip this section.
The core backing behind Rust’s memory model is affine typing and the borrow checker. Values may only be bound to one identifier at a time, and borrows (a.k.a. references) may not outlive the scope binding their referent.
For example, the following code is invalid:
let x = vec![1, 2, 3];
let y = x;
println!("{x:?}"); // compile error - x has already been moved
Normally, we work around this by borrowing against a binding, such as by making
y = &x
in the example above. However, we often need to share some heap-allocated value without
knowing which binding will live the longest. The solution to this problem is shared ownership via garbage
collection.
Rust’s standard library offers two simple reference-counted garbage collectors: the single-threaded
Rc
and its atomically-indexed counterpart
Arc
. They operate by maintaining a reference count in each heap allocation. Under most circumstances, these work
great, but they can’t handle cyclic references. Combined with interior mutability, it’s trivial to refute them.
use std::{cell::OnceCell, rc::Rc};
struct Foo(OnceCell<Rc<Foo>>);
let x = Rc::new(Foo(OnceCell::new()));
x.0.set(Rc::clone(&x));
// My foo has a reference to itself. It can never be freed!
This is why people actually get paid money to build garbage collectors. If using a reference counter were all you needed, a number of people working at Oracle would be out of a job.
Battle plan
We’d like to create some Gc
data structure with a similar API to Rc
and
Arc
, which can accept nearly any data type contained within, and still manage to detect and collect
cycles.
We have a few weapons at our disposal:
-
Drop
: Every non-copiable data type in Rust can implement theDrop
trait to ensure some code is called every time it is dropped. In our case, we can implementDrop
forGc
to try to glean some knowledge about when an allocation becomes inaccessible. -
Traits: We can construct some trait (in our case, let’s call it
Collectable
) as a mandatory requirement to be contained in aGc
. Creating this trait has some major downsides (libraries upstream ofdumpster
can’t implement it) but it’s a necessary evil.
However, our formidable tools are matched by equally formidable challenges:
-
Undetectable moves. When a value is moved in Rust, there is no way to detect that fact from
within that value. If we had some sort of trait like
OnMove
which could allow for a function to be called every time it had moved, we could use it to detect when aGc
moved inside anotherGc
, making it unrooted, and allowing us to create a simple mark-sweep collector. At this rate, though, we would just reinvent C++’s copy constructors. -
Variable-sized data. I foolishly decided to make it so that
Gc
could store?Sized
types, which is more flexible for library users (enabling things likeGc<[T]>
). - Type erasure. In a typical Rust program, generics are implemented via monomorphization, and no type information is retained at runtime. This makes it harder to clean up an allocation without prior context.
With these two, relatively simple tools, we have enough to build a collector which can handle just about any data type inside.
My approach
We need to detect whether a Gc
is reachable without actually scanning the stack. I’ll start with a
few definitions, using graph-theoretical language.
- An allocation graph is a directed graph with a root node whose indegree is zero.
- A node in an allocation graph is said to be accessible if and only if there exists a path from to in the graph.
It should be clear to see why these definitions are useful to us. We can imagine each node being one allocation,
pretending that all data not indirected through a Gc
is part of an imaginary allocation for the
root. Additionally, each Gc
acts as an edge connecting two allocations. If an allocation is
accessible in the graph-theoretical sense, it’s still possible for a program to reach it, and if not, it’s safe
to free that allocation. Note that the indegree of a node is precisely equal to the reference count of its
corresponding allocation.
I’ll roughly outline our approach to determining whether some node is accessible.
- Find , the set of all descendants of . Construct the subgraph of whose vertex set is , preserving all edges connecting nodes in .
- Find the set of nodes for which the indegree of each in is not equal to its indegree in .
- For each , mark it and its descendants as accessible.
- If and only if was marked as accessible, it is accessible.
Here’s a serpentine implementation in pseudocode, if that’s what you prefer.
def is_accessible(n):
counts = {n: n.indegree()}
for (_, m) in n.edges():
dfs(m, counts)
reachable = set()
for (a, count) in counts:
if count != 0:
mark(a, reachable)
return n in reachable
def dfs(n, counts):
if n not in counts.keys():
counts[n] = n.indegree()
for (_, m) in n.edges():
dfs(m, counts)
counts[n] -= 1
def mark(a, reachable):
if a not in reachable:
reachable.add(a)
for (_, b) in a.edges():
mark(b, reachable)
It should be pretty clear to see that is_accessible
runs in
time.
Reference counting with extra steps
Our single-threaded Gc
will behave much like an Rc
, but with some minor details
changed.
First, we define a Gc
and the allocation it points to, a GcBox
.
pub struct Gc<T: ?Sized> {
ptr: NonNull<GcBox<T>>
}
#[repr(C)]
struct GcBox<T: ?Sized> {
ref_count: Cell<usize>,
value: T
}
We can then hook into the Drop
behavior for our Gc
to make it all work.
impl<T: ?Sized> Drop for Gc<T> {
fn drop(&mut self) {
let box_ref = unsafe { self.ptr.as_ref() };
match box_ref.ref_count.get() {
0 => (),
n => {
box_ref.ref_count.set(n - 1);
if n == 1 || !is_accessible(box_ref) {
box_ref.ref_count.set(0);
unsafe {
drop_in_place(addr_of_mut!(ptr.as_mut().value));
dealloc(ptr.as_ptr().cast(), Layout::for_value(ptr.as_ref()));
}
}
}
}
}
}
I, being very clever, decided to use a sentinel value of 0 for the reference count when an allocation is being cleaned up (to prevent spurious double-frees).
However, this code has a problem - how the hell are we supposed to implement
is_accessible
?
Trait hackery
Our pseudocode for is_accessible
required us to be able to access the set of edges going out from a
node. Doing so is kind of hard. If we were writing this in C, we would scan the allocation on the heap, looking
for data that looked like they could be pointers into other allocations. In Rust, though, we can be a lot more
precise by adding a small constraint to every garbage-collected type.
If the garbage collector wasn’t enough, we’re bringing the visitor pattern over from Java. We force every
garbage-collected value to implement the
Collectable
trait, which will delegate some Visitor
to each of its garbage-collected
fields.
pub struct Gc<T: Collectable + ?Sized> {
// same as before ...
}
pub trait Collectable {
fn accept<V: Visitor>(&self, visitor: &mut V);
}
pub trait Visitor {
fn visit_gc<T: Collectable + ?Sized>(&mut self, gc: &Gc<T>);
}
For example, an array might implement it by delegating to
accept
on each of its elements.
impl<T: Collectable> Collectable for [T] {
fn accept<V: Visitor>(&self, visitor: &mut V) {
self.iter().for_each(|elem| elem.accept(visitor));
}
}
We can now write code that finds the outgoing edges from each allocation by simply applying a visitor to the
allocation. I ended up writing two visitors - one for the dfs
step and one for the
mark
step.
I would have liked to make Collectable
object-safe, but that would also require
Visitor
to be object safe. That, in turn, would cause every garbage collector to be coercible to
Gc<dyn Collectable>
, which would only be possible on nightly, making it impossible to write a
stable crate.
Bulk dropping
A cursory analysis of the implementation of Drop
above shows that the call to
is_accessible
runs in linear time with respect to the total number of Gc
s existing.
You would be excused for thinking that drop
, as written, runs in linear time. However, the call to
drop_in_place
could also result in another Gc
being dropped, yielding a quadratic-time
cleanup. This is unacceptable, especially since our first call to is_accessible
actually did all
the work of determining whether the allocations are reachable from the original one - if the original allocation
being freed was inaccessible, any allocation referred to by it should also be inaccessible.
We can save our time on marking by a pretty simple method: we’ll do one more pass through the reference graph,
recording a pointer to each allocation to be destroyed and its implementation of drop
. Then, once
we’ve found every inaccessible allocation, we’ll drop them all in one quick go.
There’s one final problem, though: each Gc
owned by a dropped value could itself try to manipulate
the reference counts of its pointees, resulting in undefined behavior, or worse, slow code. However, all
problems in computer science can be solved by adding more state or more indirection. We opt to solve it with
extra state this time, creating a thread-local value called COLLECTING
.
When we do a bulk-drop, we set COLLECTING
to true
. Then, when we call
drop
on a Gc
, it checks the state of COLLECTING
. If
COLLECTING
is true, the Gc
does nothing when dropped (not even affecting its reference
count). To handle outbound edges to still-accessible allocations, we add one more traversal before the final
cleanup to handle outbound edges.
Let’s take a look at what the code for that roughly looks like now.
thread_local! {
static COLLECTING: Cell<bool> = Cell::new(false);
}
impl<T: Collectable + ?Sized> Drop for Gc<T> {
fn drop(&mut self) {
if COLLECTING.with(Cell::get) { return; }
let box_ref = unsafe { self.ptr.as_ref() };
match box_ref.ref_count.get() {
0 => (),
1 => {
box_ref.ref_count.set(0);
unsafe {
drop_in_place(addr_of_mut!(ptr.as_mut().value));
dealloc(ptr.as_ptr().cast(), Layout::for_value(ptr.as_ref()));
}
}
n => {
box_ref.ref_count.set(n - 1);
unsafe { collect(self.ptr) };
}
}
}
}
collect
is the function we’ll write to check if the allocation is accessible and perform a bulk
cleanup if needed. There’s one more problem before we can implement collect
, though: our psuedocode
required us to be able to store each allocation as a key in counts
and also to access the
allocation’s neighbors. This means we have to use a little bit of type erasure to achieve our ends. I won’t go
into too much detail, but we need to be able to do two things:
- Store pointers to allocations as keys in a hash-map.
- Reconstruct a stored pointer using type information to make it possible to do work.
Due to the limitations of Rust, these are actually mutually exclusive. Because it’s undefined behavior (due to
provenance issues) to compare the raw data in pointers, we can’t erase a pointer and then use it as a key in a
hash-map. However, we can’t just get a pointer to the first byte of the allocation, because our allocation could
be ?Sized
.
As a result, my implementation actually has two different kinds of erased pointers: a thin pointer which can be used for comparisons and a fat pointer which can be used to reconstruct a pointer to any type.
struct AllocationId(NonNull<Cell<usize>>); // used as a key in hash-maps
struct ErasedPtr([usize; 2]); // used as a reconstructible pointer to the allocation
Then, all the functions which need to know the type information of the allocation (such as
dfs
and mark
) can be stored opaquely as function pointers, and when called, they can
“rehydrate” the ErasedPtr
passed in with type information.
Below, I’ve included an example of how it can be used to implement
dfs
.
unsafe fn dfs<T: Collectable + ?Sized>(
ptr: ErasedPtr,
counts: &mut HashMap<AllocationId, usize>
) {
struct DfsVisitor<'a> {
counts: &'a mut HashMap<AllocationId, usize>,
}
impl Visitor for DfsVisitor<'_> {
fn visit_gc<U>(&mut self, gc: &Gc<U>) {
let id = AllocationId::new(gc.ptr);
if counts.insert(id, unsafe { gc.ptr.as_ref().ref_count.get() }) {
(**gc).accept(self);
}
*counts.get_mut(&id).unwrap() -= 1;
}
}
// pretend that `specify` converts the pointer back into NonNull<T>
let specified = ptr.specify::<T>();
specified.as_ref().accept(&mut DfsVisitor { counts });
}
I won’t include the implementation of collect
here because it’s lengthy and not even the final
revision of the cleanup algorithm, but you can trust me that it works. After much gnashing of teeth and tearing
of hair, this is finally enough to achieve bulk dropping, making calling drop()
on a
Gc
a linear-time operation.
Amortizing
What’s that, you say? You need your reference operations to work in time? Why are programmers these days so picky?
Luckily, it isn’t too hard to make our garbage-collection efforts terminate in average-case
time. Instead of calling collect
directly, we’ll just punt the allocation, which may be
inaccessible, into a set of “dirty” allocations. In my code, I call that set a dumpster. Then, once every
times an allocation is dropped, we’ll go through the whole dumpster and collect them all in
time. Amortized across all allocations, that’s technically
, enabled by the same accounting wizardry that makes array-backed lists useful and that caused the 2008
financial crisis.
To save ourselves a little bit of work, whenever an allocation is accessed (such as by dereferencing or cloning), we can yank the allocation away from that same dirty set since it was just proven to be accessible.
Are we concurrent yet?
So far, the garbage collector I’ve been outlining to you has been a single-threaded collector, which can only store thread-local garbage-collected values. I’m told that all code worth writing ends up being concurrent, so we have to do better than that. Lucky for us, this algorithm can be made to operate concurrently with little effort, assuming that only one thread collects at a time.
However, even if only one thread is collecting, we can still run into some nasty concurrency issues. Let’s imagine that we have 2 allocations, and , plus the imaginary “root” allocation , where can only be accessed from and can only be accessed from .
Now, consider performing a depth-first search starting from . First, we record that has 1 unaccounted-for-reference in counts
. Then,
we move on to looking at because it’s one of ’s children.
However, between when we record ’s reference count and examine , another malicious thread mutates , adding a back-reference from to .
Then, the depth-first search will return to , decrement the number of unaccounted references, and assume that all incoming references to and were totally accounted for (and therefore inaccessible). Then the garbage collector would happily destroy and ’s allocations, leaving a dangling reference to the heap, potentially causing a use-after-free or other gnarly memory errors.
Resolving concurrency
The simplest solution to this concurrency issue is to just wrap everything in a big fat lock by forcing every
single operation involving a
Gc
to acquire a single global mutex, preventing any sort of concurrency bugs at all. However,
that’s ridiculously slow.
We’d like to have an approach that only needs to achieve mutual exclusion over a small subset of all allocations
if at all possible. To do so, let’s take a closer look at our concurrency problem: we can see that we only
really get erroneous results when another thread mutates the subgraph
as we interact with it. The solution is quite clever - we tag every Gc
with a special number, which
I’ll call a collecting tag.
First, we’ll add a global atomic integer called the
COLLECTING_TAG
.
static COLLECTING_TAG: AtomicUsize = AtomicUsize::new(0);
Whenever we create or clone a new Gc
, we’ll annotate it with the current value of
COLLECTING_TAG
.
// this new `Gc` type is intended to be `Sync`.
pub struct Gc<T: Collectable + Send + Sync + ?Sized> {
ptr: NonNull<GcBox<T>>,
tag: AtomicUsize,
}
impl <T: Collectable + Send + Sync + ?Sized> Gc<T> {
pub fn new(x: T) -> Gc<T> {
// (other bookkeeping hidden)
Gc {
ptr: /* initialization of GcBox<T> */,
tag: COLLECTING_TAG.load(Ordering::Relaxed),
}
}
}
Next, at the start of the collection process, we’ll increment the value of
COLLECTING_TAG
just before we call dfs
.
fn collect() {
COLLECTING_TAG.fetch_add(1, Ordering::Relaxed);
// carry on with DFS and such
}
Lastly, whenever we encounter a Gc
during the dfs
process, we’ll check its tag. If the
tag is equal to the current value of COLLECTING_TAG
, that means the Gc
was created
after dfs
started. If so, then whatever spot we found the Gc
in must have been
accessible, and we can mark it as so. To prevent us from visiting the same edge twice (due to shenanigans),
we’ll also update the tag on every edge we visit to the value of COLLECTING_TAG
.
Let’s return to our broken example to show how this modification fixes it. We return to our reference graph with
pointing to and
pointing to . Since every Gc
must be annotated with a tag, let’s say they’re all tagged with 0.
Next, when dfs
begins, COLLECTING_TAG
is incremented to 1. As a result, the new
reference created by a malicious thread must be tagged with 1.
Now, when dfs
visits the edge pointing from to , it sees that the reference is labeled with a tag of 1, and therefore both and
must be accessible - no accidental early deallocations here.
Moving in and out
We might hope that tagging the pointers is sufficient to prevent all concurrency bugs. However, this is sadly not enough. To show why, we’ll begin with an example.
First, let’s imagine an allocation called , which contains two optional references to itself.
struct X {
x1: Mutex<Option<Gc<X>>>,
x2: Mutex<Option<Gc<X>>>,
}
At the start of our example, there will be only one Gc
pointing from inside
to itself, and the other slot that can hold a Gc
will be empty.
is accessible directly from the root allocation
, and all existing references are tagged with 0. For the sake of clarity, I’ll draw the arrow representing the
first slot in
with a solid line and the second slot with a dashed line.
Now, let’s have dfs
begin, starting at
.
- We increment
COLLECTING_TAG
to 1. -
dfs
records that has 2 unaccounted-for references at the start. -
dfs
visits the ’s first slot and tags the currently-existing self reference in that slot. now has 1 unaccounted-for reference. -
A malicious thread intervenes, first moving the
Gc
from ’s first slot to , then moving the originalGc
which connected to into ’s second slot.fn do_evil(r_to_x: Gc<X>) { let z = std::mem::take(&mut *r_to_x.x1.lock().unwrap()); *z.x2.lock().unwrap() = r_to_x; }
-
dfs
continues, now visiting the second slot in . Seeing another reference to , we decrement the number of unaccounted-for references to 0. -
dfs
found no ancestors of with unaccounted-for ancestors, so we free . However, is still accessible!
This problem stumped me for a decent while. It seems impossible to prevent other threads from do-si-do-ing every
Gc
pointing to an allocation in order to fool our search. Luckily, there is a way through. Since
tagging every Gc
worked out for us before, we’ll try tagging every allocation.
#[repr(C)]
struct GcBox<T: Collectable + Send + Sync + ?Sized> {
ref_count: AtomicUsize,
tag: AtomicUsize,
value: T
}
Next, every time a Gc
is dereferenced, we’ll update the tag in the allocation to the current value
of COLLECTING_TAG
.
impl<T: Collectable + Send + Sync + ?Sized> std::ops::Deref for GcBox<T> {
type Target = T;
fn deref(&self) -> &Self::Target {
let box_ref: &GcBox<T> = unsafe { self.ptr.as_ref() };
box_ref.tag.store(COLLECTING_TAG.load(Ordering::Acquire), Ordering::Release);
&box_ref.value
}
}
Whenever dfs
observes that the tag on an allocation is equal to COLLECTING_TAG
, that
means that the allocation was accessed, and therefore accessible. In order to shuffle references around in our
above example, the malicious thread would have to dereference a Gc
, notifying the search and
preventing an early deallocation.
In fact, this is enough to prove that dfs
will never incorrectly mark an allocation as
inaccessible, even under concurrent execution. This blog post is already rather long, so I’ll only provide a
rough sketch of the proof.
Theorem: dfs
will never erroneously mark an allocation as inaccessible.
Proof:
-
If
dfs
encounters aGc
created afterdfs
begins, it will know that the referent is accessible. -
Due to (1), if
dfs
marked some allocation as inaccessible, it must have only visitedGc
s pointing to which existed beforedfs
began. -
Weakening the statement from (2), if
dfs
marked some allocation as inaccessible, it must have only visitedGc
s pointing to which existed when ’s reference count was recorded. -
dfs
never visits aGc
while it points from the root allocation to any other allocation. -
dfs
must visit everyGc
which pointed to an allocation at the point when its reference count was recorded in order to mark it as inaccessible. -
Due to (4) and (5), if is marked as inaccesssible, any
Gc
which at any point connected to must have been moved to connect another node (not ) to . -
Due to (6), there is no lifetime
'x
which can borrow against aGc
pointing from to and last between whendfs
first visits and when it last visits . -
The final
Gc
to move into beforedfs
must have been moved while another borrow against aGc
pointing from to existed. Let the lifetime of that borrow be'y
. -
Because
'y
lasts longer than the final move,'y
can be extended to last arbitrarily long, even until afterdfs
has finished visiting everyGc
pointing to . -
Due to (7),
'y
must not have existed beforedfs
first visited . Therefore, if is erroneously marked as inaccessible, aGc
pointing to must have been dereferenced to generate a borrow with lifetime'y
. -
If a
Gc
pointing to was dereferenced before the last timedfs
visited aGc
pointing to , thendfs
would have observed that ’s tag had changed. Therefore, if an allocation was erroneously marked as inaccessible, it must have been marked as accessible. Therefore no allocations were ever erroneously marked as inacessible bydfs
. QED.
This proof is relatively handwavy, but it should be possible for an astute reader to fill in the details. I will
leave it to such astute readers to prove the complementary proof to this; namely, that dfs
will
never erroneously mark an allocation as accessible.
I also think that it’s really cool that the entire mechanism of the correctness of this approach is totally supported by Rust’s borrow checking semantics. It makes the whole thing feel like it fits right in.
It’s the weak ones that count
There’s one more cause for memory safety issues from concurrency. If an allocation is freed because its reference count reached zero, it could still be accessible by another thread because it was stored in the global dumpster. We can’t guarantee that we’ve obliterated all references to that allocation from every other thread’s memory, so it seems like we’re hosed.
However, there’s no problem in computer science that can’t be solved by adding more state. Taking a leaf from
Rc
’s book and adapting it liberally, we annotate each allocation with a “weak” reference count,
which is the number of times it’s referenced by a dumpster or cleanup thread. We’ll rename our original
reference count to be the strong count instead.
#[repr(C)]
struct GcBox<T: Collectable + Send + Sync + ?Sized> {
strong: AtomicUsize,
weak: AtomicUsize,
tag: AtomicUsize,
value: T,
}
Then, when dropping the last strong reference to a Gc
, we first check that its weak reference count
is nonzero before we can deallocate it. If not, it’s the cleanup threads’ problem to take care of it.
Whenever an allocation is removed from a dumpster and about to be thrown away by a thread running a cleanup procedure, the thread decrements the weak count of the allocation and checks if both the weak and strong count have reached zero. If so, that thread can then safely destroy the allocation.
Deadlocked!
Imagine that we want to create a Gc<Mutex<()>>
, and then drop a
Gc
pointing to it while the Mutex
is locked.
let gc1 = Gc::new(Mutex::new(()));
let gc2 = gc1.clone();
let guard = gc1.lock()?;
drop(gc2); // suppose this triggers a cleanup
drop(guard);
While traversing the reference graph, the dfs
visitor would attempt to visit the mutex, lock it,
and try to carry on. However, since the same thread that triggered the allocation still holds a guard to that
mutex, the thread would deadlock disastrously. Lucky for us, we know that any allocation with a held mutex guard
must still be accessible (due to Mutex
’s excellent API), so we can be certain that if we fail to
acquire a mutex, we have immediate proof that an allocation is accessible.
To take advantage of that, we’ll make visiting a collectable type a fallible operation, and make it so that failing to acquire a lock also fails the visit.
pub trait Collectable {
fn accept<V: Visitor>(&self, visitor: &mut V) -> Result<(), ()>;
}
pub trait Visitor {
fn visit_gc<T: Collectable + ?Sized>(&mut self, gc: &Gc<T>) -> Result<(), ()>;
}
impl<T: Collectable + ?Sized> Collectable for Mutex<T> {
fn accept<V: Visitor>(&self, visitor: &mut V) -> Result<(), ()> {
// pretend poison errors don't exist
self.try_lock().map_err(|_| ())?.accept(visitor);
Ok(())
}
}
Sharing your dumpsters
You’ve made it this far into this (admittedly rather lengthy) post. Enjoy this hand-drawn picture by yours truly.
When implementing this garbage collector, one of the biggest bottlenecks with concurrent performance is
accessing the one, global dumpster every single time a Gc
is dropped. If using a
Mutex<HashMap>
, this will remove nearly all parallelism from the whole system and make you
wonder why you paid all that money for a 16-core computer.
As a result, any sane person might lean for a concurrent hashmap. There are many Rust implementations - I tried
chashmap
but gave up on it due to a
rare panic bug, and then I tried
dashmap
, which worked fine but was too slow as thread counts increased.
The heart of the issue is that concurrently updating any data structure is going to be slow. Even so-called “concurrent” data structures which use clever sharded locks have this issue - there’s always some contention overhead. The best workaround to concurrency slowdowns is to just not do things concurrently. We can fix this by making every thread have its own dumpster, and then having a global collection of to-be-cleaned allocations which I’ll call the garbage truck. Whenever another allocation is marked dirty, we can check if this thread’s dumpster is full according to some heuristic, and if so, transfer all the allocations from this thread’s dumpster to the garbage truck. I’ve included some pseudocode for the whole process below.
dumpster = set() # local to this thread
garbage_truck = mutex(set()) # global to all threads
def mark_dirty(allocation):
dumpster.add(allocation)
if is_full(dumpster):
garbage_truck.lock()
for trash_bag in dumpster:
garbage_truck.add(trash_bag)
garbage_truck.unlock()
dumpster = set()
If is_full
returns true
rarely enough, there will be almost no contention over the
garbage truck. This means that the concurrency overhead is vastly reduced and we get to pocket some great
performance returns.
Okay, but how fast is it?
In short, this garbage collector runs pretty fast!
In the end, I made two different garbage collector implementations - one which is thread-local, saving us from
all the concurrency headaches, and another thread-safe one that has all the bells and whistles at the cost of
some performance. Borrowing from
once_cell
’s API, the former lives in the unsync
module of the crate and the latter in sync
.
Next, I collected all the garbage collectors I could find in order to compare them. I selected all the ones
which had a similar API to
dumpster
, were published on crates.io, and actually worked.
Name | Concurrent? | Like dumpster ? |
---|---|---|
dumpster (unsync) |
❌ | ✅ |
dumpster (sync) |
✅ | ✅ |
Rc
|
❌ | ✅ |
Arc
|
✅ | ✅ |
bacon-rajan-cc
|
❌ | ✅ |
cactusref
|
❌ | ❌ |
elise
|
✅ | ❌ |
gc
|
❌ | ✅ |
gcmodule
|
yes, but with a different API | ✅ |
rcgc
|
❌ | ❌ |
shredder
|
✅ | ✅ |
I excluded the ones that didn’t have a dumpster
-like API because it would have been a pain to set
up a benchmarking harness to handle them. If you feel this is poor diligence, you can go benchmark them
yourself. I also included Rc
and Arc
, even though they don’t collect cycles, because
they’re a decent baseline for the minimum performance of a garbage collector. During my research, I also found
that bacon-rajan-cc
doesn’t initiate collections of its own accord, so I added a version of each of
dumpster
’s garbage collectors which doesn’t automatically trigger collections in the spirit of
competitive fairness. Those versions are labeled as “manual.” Lastly, I found that gcmodule
simply
stack-overflows and sometimes segfaults when facing extremely large collection loads, so I removed it from the
benchmark set.
As my benchmark, I decided to do the very simplest thing I could think of: have a bunch of allocations randomly add and remove references to each other in a loop, dropping and creating new allocations as well. To account for random noise, I ran 100 trials of each garbage collector doing 1 million operations. Each operation is either the creation of a reference or a deletion.
First, I ran a benchmark of all the garbage collectors, measuring pure single-threaded performance. I’ve shown my benchmark results as a set of violin plots below, relating runtime to frequency.
This plot doesn’t tell us much other than that shredder
is slow. To take a closer look, let’s
remove shredder
from the list and see what we find.
dumpster
’s unsync implementation is in a decent position, beat out only by the non-cycle-detecting
allocators and bacon-rajan-cc
. bacon-rajan-cc
has very little overhead and is quite
performant.
I think there are two main reasons why the concurrent version of the garbage collector is much slower than the
single-threaded one. First, moving allocations from a thread-local dumpster to the global garbage truck takes
some time. Second, the dfs
operation in a concurrent environment must record the current set of
children of an allocation, typically in a separate heap-allocated structure, to prevent some concurrency bugs.
It’s certainly possible to mitigate the losses from the first issue, but the second is harder to handle.
My guess as to why the concurrent version is significantly slower when using automatic collection triggering is that it’s a consequence of my testing harness. Currently, it runs all the tests in a loop, so the single-threaded benchmark could run directly after the multi-threaded one and be forced to clean up some of its garbage.
In any event, let’s move on to the results from the multi-threaded benchmark.
The only other concurrent garbage collector available for me to test against was
shredder
, and it looks like it wasn’t much of a competition.
dumpster
doesn’t scale as smoothly as I’d like with thread count, but it’s a relief that
performance does actually improve with thread count, except for the jump from one thread to two. I think being
only 3x slower than Arc is a decent victory, especially for a project that I wasn’t paid any money to do. I’m
not much of an expert at high-performance computing in concurrent situations, so trying to optimize this further
would likely just be an adventure in trying random things and seeing if they yield an improvement. If anyone out
there has clever ideas on how to optimize this approach, let me know! My email is on the “About” page.
Coda
Over the last few months, I built a garbage collector for Rust from scratch. I implemented a novel (ish) algorithm of my own design, got it to work in both single-threaded and multi-threaded contexts, and showed that it was actually pretty performant.
The implementation of this crate is now up on crates.io and available for anyone to download. It’s licensed under the GNU GPLv3.
Special thanks to Wisha, Charlie, and Shreyas for reviewing this post, fielding my bad questions, and putting up with me rambling about this for nearly two months straight.
Postscript
I have a few things that I thought were interesting along the way but didn’t quite have a good spot to include in the story above, so they’ll all be mish-mashed in here.
Ode to bad good ideas
The final, polished solution you see in this blog post is actually quite far from my original idea. In particular, almost all of my first ideas for handling concurrency didn’t work. Here’s a quick roll of them all:
-
Wrap a
Mutex
around every single allocation’s reference count and hold them while doingdfs
-
Save on space in
Gc
s by tagging only the lowest bit instead of adding an entireusize
- Make everything enable dynamic dispatch so we can have real Java-like objects
- Clean up allocations locally on every thread
- Avoid directly dropping allocations and instead cleverly decrement their reference count
Cool nightly tricks
Rc
and the other pointer types are somewhat “blessed” in that they can access nightly features from
the Rust compiler and still be consumed by stable projects. dumpster
has no such blessings.
I would like to implement CoerceUnsized
unconditionally for all my garbage collectors, but that’s
simply not possible on stable yet.
Likewise, pending the implementation of strict provenance, there’s actually no way for a stable crate to manipulate the address portion of a fat pointer. This means that even if I found some space optimization using lower bit tagging, I wouldn’t be able to use it because it’s gated behind a nightly feature.
Another path to Gc<dyn T>
Without the ability to coerce Gc
as an unsized type, I can’t make Collectable
object
safe due to the function signature of accept
, which forces Visitor
to also be
object-safe, forcing visit
operations to also be object-safe.
I’ve thought of a slightly different approach, though. We can make
Visitor
a single concrete type, and that makes Collectable
trivially object-safe. The
downside here is that I would then have to store all 5 of my current visitor implementations in the same
structure, and that loses a lot of my nice type safety features and adds a large runtime penalty. I’m not very
happy with that idea, though.
Miri rules
I would probably not have been able to finish this project without
cargo miri
. If you haven’t heard of it, you should download it right now and run all of your Rust projects with it. It’s
such a fantastic tool for debugging unsafe code that I wish I could have it in other languages, too - valgrind
is a poor substitute for it when using C.
Testing and debugging
Early on, I had just a few handwritten tests. The first reference graph that I found that gave me trouble looked like this:
However, once I had simple graphs with multiple cycles down, it was hard to come up with new tests. As a result, I resorted to fuzzing, using the same approach as my benchmark - randomly generating references and then seeing if they got cleaned up properly.
Debugging the results of those fuzzed tests absolutely sucked, though. The experience of trying to find the source of an issue was awful, so I ended up having to make some (crude) tools to make the process easier. Eventually, I had settled on printing out the reference graph at choice times during the run of my tests and then visualizing them to get a good guess of what was wrong. However, when making the concurrent implementation, printing something can change the execution behavior of the program. Not only that, it’s not always obvious what the issue is. I had one bug which only occurred after several thousand operations had completed, resulting in reference graphs that looked like this:
The graph above was made by fuzzing my garbage collector, using the same approach as my benchmarks.