1. People usually make a scratch arena locals to a thread using something like __declspec(thread). What if you're on a platform or project that can't use thread-local storage? I usually see people handle this by passing around a "thread context" Care to comment on that?
2. The push and pop operator on the current arena implementation aren't atomic, which mean they aren't thread-safe. How do you go about dealing with memory management for multithreading?
3. Do you think the idea of ArenaTemp can be made more abstract and applied to other kinds of allocators? For example, I may have a generic heap allocator that keeps a free list and allows for out-of-order alloc and free allocations. I can still create an ArenaTemp for this allocator by storing the free list in the ArenaTempBegin and restoring it in the ArenaTempEnd function. The same also goes for a pool allocator, for example. I can still store the current free list and reset it later.
1. All consumer machines - phones, game consoles, PCs, laptops - have hardware-level thread-local storage support. I don't write for other systems, so I don't ever do anything else. Per-thread storage can be emulated nonetheless - that is, indeed, what "the stack" is (there is one stack per thread).
2. The arena push and pop operations are not intended to be used by multiple threads by themselves. The arena, like the stack, is a useful pattern when accessed *by a single thread at a time*, so either in a single-threaded context, or in some interlocked shared data structure. In general, threads should try to synchronize as little as possible, and so they will often be operating on *different* arenas altogether. For communication *between* threads, an arena is not really the correct "shape" in most cases - another primitive, like a ring buffer, is much more suitable. In other words, the arena implementation is not the right layer to solve for multithreading.
3. You can apply the same principle to other allocators but I don't see any point in making the code more abstract. ArenaTemp is so simple for arenas, and it's not so simple for other allocators. Given that ~95%+ of allocations occur on arenas, I would not pessimize the 95% case for code reusability in the <5% cases - I'd just make another "Temp" idea at that layer (if it was indeed useful).
Thanks for answering. Regarding multithreaded allocators, I understand that threads should communicate with each other as little as possible, meaning each thread should have its own allocators. But I don't understand why a ring buffer is a better fit here. This also leads me nicely to the underlying question I was trying to ask. You've talked a lot (in this post and some other lectures) about the idea of composing arenas and other allocators together to create several more sophisticated ones (for example, combining a free list with an arena to create a simple pool allocator). Asking about multithreaded or malloc-style heap allocators was just me trying to know more about other composed allocators that you find interesting and handy in your experience.
Ringbuffers are the industry standard for lock-free multithreading communication. Feel free to check out my MCMP Queue implimentation, which is a multi-consumer multi-producer ringbuffer.
I wonder what you think about pathological cases with the pool allocator, where memory gets tied up in the free list of one struct and becomes unavailable for other allocations.
For example, let's say that, through user error, the level editor for a game allocates and frees 200k entities (the user selected the entire level and copied it 100 times by accident, then pressed undo). Now a potentially large amount of memory gets leaked into the entity free list (and other associated free lists).
Perhaps this is a reason to use a malloc-style allocator, where freeing memory makes it available for other kinds of allocations?
First I'd say that the memory in free lists is not *leaked*, really, at least not any more than a free'd malloc allocation leaks when the allocator keeps it committed. Each element in the free list is still available for subsequent allocations, those allocations just must be of the same type. You're right that a proliferation of types will therefore lead to possibly large free-lists which are unusable for many kinds of allocations. For this reason, among others, I recommend against a proliferation of types - it keeps everything simpler. Fully explaining all the ways in which it simplifies everything is out-of-scope for this comment, but you see the argument in this case.
If you have gigantic free lists like in a case you describe, that memory being physically committed may increase the nominal memory usage of a program, but it will not have detrimental paging effects (because that memory is merely sitting around waiting to be re-allocated - it is not actively being used). Note that the simple free list (stack) works in the order you'd want - the last thing freed is the first thing to be used. So, in your example, the user first undoes the *last paste*, and lastly undoes the *first paste*. Upon more allocations, the first things pulled off the free-list will be at the *lowest address*. In this level editor, then, growing back to the 200k entities gracefully occurs. The 200k entities still sit around being committed, but I'd claim that isn't a "real" problem. If it's worrisome (for some reason), then that constraint will indeed require more complex strategies.
These more complex strategies may include, as you suggest, a more general purpose allocator (reusing freed memory for other sorts of allocations), and/or the allocator attempting to group freed allocations together and physically decommit those pages. You can imagine, for instance, the basic allocation block being 4k entities, and then always preferring already-allocated blocks. That also has pathological cases, but then again, so does `malloc`, and so does everything else - it's just a matter of what you expect and what you care to avoid.
I suppose the main questions are 1) how much memory the free lists actually take up and 2) how many worst cases you expect over the lifetime of the program.
I suspect maybe the amount of memory isn't that large even for my 200k entity example. Perhaps a few 10s of MBs, maybe 100s max for a complex game and editor, which isn't much for a modern PC, especially a dev machine.
But those numbers can add up over days of program use if they are spread out over disparate free lists. I suppose this is where minimizing the number of different types can help. I'm just not sure to what extent that is possible given sufficiently complex program requirements (but otherwise I agree that this is good advice for simplifying code).
I was also thinking about (and you already cover this to some extent) the granularity of arenas. Instead of a single permanent arena, you could have one per level. Or even one per entity to cover cases where a single entity requires a lot of dynamic allocations. But this approach introduces its own drawbacks similar to having to call free on individual allocations.
Thanks for subscribing, Jake. I'm glad you liked the article! The Code Depot is locked - I need to make you an account. Please check out the Paid Subscriber Registration post here: https://www.rfleury.com/p/paid-subscriber-registration
Regarding the per-thread scratch arena system, can you have an additional per-thread int scratchCount that increases every time you call GetScratch and decreases every time you call ReleaseScratch? Then every time you need a new scratch arena, rather than searching, you can get it from the index after the int. This will also give the maximum number of unique scratches you need in any instance.
This is not how this scratch system works. If you call GetScratch twice in a row, both for temp allocations (you don't pass first scratch as a conflict to the other), you will get same arena for both calls. And that is fine, this is how stack works. Even in case when arenas are in conflict, fixed counter would disable alternating same arenas in arbitrarily deep stack as Ryan mentioned with his example.
So I got a notification about your reply about why did I want that, but I can't find it, so maybe you deleted your reply? I will explain my idea a little more so that we are all on the same page.
The OP was talking specifically about the per-thread scratch system and was not about the stack-like arena fashion. In the article, Ryan introduced the GetScratch function that takes in an array of Arena* and returns a unique arena to the array from the thread pool (technically, it returns a TempArena that reference the unique arena). The original array of arenas can only contain a thread-local scratch arena when and only when that scratch arena was from the thread-local scratch pool by calling GetScratch. My idea was rather than pass in a not-to-conflicts-with array to the GetScratch, it just has a counter that gets increased every time it gets called and returns the element at that index from the pool. Then when the current function is done using the scratch arena, it will call the ReleaseScratch function which will decrease the counter by one so the latter functions can reuse the arena.
I don't know if by OP you meant your original reply or Ryan's post. But Ryan introduced scratch system in the context of stacks.
As for your implementation, yes that would work too, but then you'd have to dynamically increase arena pool when your memory stack grows beyond pool capacity. (Or allocate large enough pool at start by measuring your code, which could be a bit tricky if you use recursions a lot). I think Ryan's approach is more elegant exactly because of arena re-usage. You don't have to measure your code, you just have to follow simple rule Ryan outlined. As long as only a single persistent arena is present at any point in any codepath, you will not need more than two scratch arenas.
So it seems like you misunderstand me. You can still have a persistent arena, and all your functions still need to pass in an arena like the function A and function B example. I was talking specifically about the GetScratch and ReleaseScratch functions that were used inside functions A and B.
void *FunctionA(Arena *arena)
{
ArenaTemp scratch = GetScratch(); // increase the counter
// use the scratch arena
// fill result to Arena* arena
ReleaseScratch(scratch); // decrease the counter
return result;
}
void FunctionB(void)
{
ArenaTemp scratch = GetScratch(); // increase the counter
void *result = FunctionA(scratch.arena); // the scratch inside A will not be the same as this scratch
I can't find a way to turn on the code format in the reply section, so here's a link with some example code about the difference between Ryan's way and mine: https://pastebin.com/h53XsVzi
You can't get the same arena until the counter goes down. The GetScratch method only gets the scratch at the counter and increments it by one. Kind of like this: #define GetScratch(name) ArenaTemp name = scratchPool[scratchCount++] and #define ReleaseScratch(name) ArenaTempEnd(name); scratchCount--
I just have one question: how do we pull data out of the arena? And why not make arenas hash maps rather than contiguous blocks, so we can pull out the values by name, even though that would be a bit less efficient?
Push functions return pointer to the memory block that you can then write into. As for hash maps, I believe those are not on the same level of abstraction. Hash map would be classified as a cointainer. You can easily implement hash map on top of arenas. Same goes for dynamic arrays as Ryan mentioned in the article.
Right - in the same way that you don't ask the `malloc` allocator for allocations you've performed, you don't ask an arena for those either. That's up to the usage code of an arena. So you push something onto an arena with e.g. `Foo *foo = PushArrayZero(arena, Foo, 1);`. Then, it's up to *you* to use `foo` for whatever, store it somewhere, etc. - bookkeeping does not belong at the layer of the arena.
I haven’t looked at Zig’s memory allocator strategies much, but from my understanding Zig requires that allocating functions take allocators as parameters. In that sense, it’s the same. It’s different in that I’d expect Zig’s allocators to be abstract (this might not be the case), whereas arenas are a specific type of basic allocator.
The second part is important. Arenas simplify both implementation and usage code. Code that must work with a generic allocator will be more complicated than code that needs to work only with arenas.
In terms of mass deallocation of memory—i.e., deallocate all memory allocated at once—with arenas I assume others have implemented efficient ways of doing this? - With the system [kernel] allocator its straightforward but often incurs nontrivial overhead.
I like the idea of using an arena to allocate a large tree datastructure, but without granular bookkeeping I wouldn't think more complex data-structures could be efficiently implemented (like Brodal's finger trees).
PS: Enjoyed this line "like a government agency, except in this case, the garbage collector is ostensibly doing something approximating useful work—although both function by stealing valuable resources involuntarily" ^_^
Yeah, so it seems like my expectation was correct. The problem I have with that design decision (which matches that of Odin as well) is that it forces code written for an abstract allocator - for which the implementation can change - to use the *maximally generic allocator API*. With an arena, the burden of freeing individual allocations all but disappears (except when doing more complex allocators on top of an arena, e.g. a pool allocator) - so code written for an arena is much simpler than code written for a generic allocator (even if that generic allocator no-ops on all of the frees, and indeed is implemented with an arena). The generic interface remains a pessimization for the allocator usage code.
The "granular bookkeeping" you mention (for more complex data structures) is what I describe as "a more complex allocator that composes with an arena", e.g. the free-list + arena introduced in the post. So it's definitely possible, it's just that it requires *composition with an arena*. The approach of e.g. Zig and Odin is to *abstract over the allocator*, but I disagree with their approach - instead, use the simplest allocator as a building block, and add more building blocks as necessary.
Regarding the government line, glad you enjoyed it. I have to sneak in my ancap somewhere. :)
A couple of questions:
1. People usually make a scratch arena locals to a thread using something like __declspec(thread). What if you're on a platform or project that can't use thread-local storage? I usually see people handle this by passing around a "thread context" Care to comment on that?
2. The push and pop operator on the current arena implementation aren't atomic, which mean they aren't thread-safe. How do you go about dealing with memory management for multithreading?
3. Do you think the idea of ArenaTemp can be made more abstract and applied to other kinds of allocators? For example, I may have a generic heap allocator that keeps a free list and allows for out-of-order alloc and free allocations. I can still create an ArenaTemp for this allocator by storing the free list in the ArenaTempBegin and restoring it in the ArenaTempEnd function. The same also goes for a pool allocator, for example. I can still store the current free list and reset it later.
1. All consumer machines - phones, game consoles, PCs, laptops - have hardware-level thread-local storage support. I don't write for other systems, so I don't ever do anything else. Per-thread storage can be emulated nonetheless - that is, indeed, what "the stack" is (there is one stack per thread).
2. The arena push and pop operations are not intended to be used by multiple threads by themselves. The arena, like the stack, is a useful pattern when accessed *by a single thread at a time*, so either in a single-threaded context, or in some interlocked shared data structure. In general, threads should try to synchronize as little as possible, and so they will often be operating on *different* arenas altogether. For communication *between* threads, an arena is not really the correct "shape" in most cases - another primitive, like a ring buffer, is much more suitable. In other words, the arena implementation is not the right layer to solve for multithreading.
3. You can apply the same principle to other allocators but I don't see any point in making the code more abstract. ArenaTemp is so simple for arenas, and it's not so simple for other allocators. Given that ~95%+ of allocations occur on arenas, I would not pessimize the 95% case for code reusability in the <5% cases - I'd just make another "Temp" idea at that layer (if it was indeed useful).
Thanks for answering. Regarding multithreaded allocators, I understand that threads should communicate with each other as little as possible, meaning each thread should have its own allocators. But I don't understand why a ring buffer is a better fit here. This also leads me nicely to the underlying question I was trying to ask. You've talked a lot (in this post and some other lectures) about the idea of composing arenas and other allocators together to create several more sophisticated ones (for example, combining a free list with an arena to create a simple pool allocator). Asking about multithreaded or malloc-style heap allocators was just me trying to know more about other composed allocators that you find interesting and handy in your experience.
Ringbuffers are the industry standard for lock-free multithreading communication. Feel free to check out my MCMP Queue implimentation, which is a multi-consumer multi-producer ringbuffer.
https://github.com/173duprot/mcmpq.h/
Great content. Thank you for the detailed explanation.
Ah, I love this :D
Thanks for the article! Very inspiring.
I wonder what you think about pathological cases with the pool allocator, where memory gets tied up in the free list of one struct and becomes unavailable for other allocations.
For example, let's say that, through user error, the level editor for a game allocates and frees 200k entities (the user selected the entire level and copied it 100 times by accident, then pressed undo). Now a potentially large amount of memory gets leaked into the entity free list (and other associated free lists).
Perhaps this is a reason to use a malloc-style allocator, where freeing memory makes it available for other kinds of allocations?
I'm glad you enjoyed it!
First I'd say that the memory in free lists is not *leaked*, really, at least not any more than a free'd malloc allocation leaks when the allocator keeps it committed. Each element in the free list is still available for subsequent allocations, those allocations just must be of the same type. You're right that a proliferation of types will therefore lead to possibly large free-lists which are unusable for many kinds of allocations. For this reason, among others, I recommend against a proliferation of types - it keeps everything simpler. Fully explaining all the ways in which it simplifies everything is out-of-scope for this comment, but you see the argument in this case.
If you have gigantic free lists like in a case you describe, that memory being physically committed may increase the nominal memory usage of a program, but it will not have detrimental paging effects (because that memory is merely sitting around waiting to be re-allocated - it is not actively being used). Note that the simple free list (stack) works in the order you'd want - the last thing freed is the first thing to be used. So, in your example, the user first undoes the *last paste*, and lastly undoes the *first paste*. Upon more allocations, the first things pulled off the free-list will be at the *lowest address*. In this level editor, then, growing back to the 200k entities gracefully occurs. The 200k entities still sit around being committed, but I'd claim that isn't a "real" problem. If it's worrisome (for some reason), then that constraint will indeed require more complex strategies.
These more complex strategies may include, as you suggest, a more general purpose allocator (reusing freed memory for other sorts of allocations), and/or the allocator attempting to group freed allocations together and physically decommit those pages. You can imagine, for instance, the basic allocation block being 4k entities, and then always preferring already-allocated blocks. That also has pathological cases, but then again, so does `malloc`, and so does everything else - it's just a matter of what you expect and what you care to avoid.
Thanks for the reply!
I suppose the main questions are 1) how much memory the free lists actually take up and 2) how many worst cases you expect over the lifetime of the program.
I suspect maybe the amount of memory isn't that large even for my 200k entity example. Perhaps a few 10s of MBs, maybe 100s max for a complex game and editor, which isn't much for a modern PC, especially a dev machine.
But those numbers can add up over days of program use if they are spread out over disparate free lists. I suppose this is where minimizing the number of different types can help. I'm just not sure to what extent that is possible given sufficiently complex program requirements (but otherwise I agree that this is good advice for simplifying code).
I was also thinking about (and you already cover this to some extent) the granularity of arenas. Instead of a single permanent arena, you could have one per level. Or even one per entity to cover cases where a single entity requires a lot of dynamic allocations. But this approach introduces its own drawbacks similar to having to call free on individual allocations.
Hi Ryan! Loved the article so much I just subscribed!
Is the Code Depot with the example locked? Would love to see how you use Arena Allocation in a small example.
Thanks for subscribing, Jake. I'm glad you liked the article! The Code Depot is locked - I need to make you an account. Please check out the Paid Subscriber Registration post here: https://www.rfleury.com/p/paid-subscriber-registration
Just sent and thanks a ton!
Regarding the per-thread scratch arena system, can you have an additional per-thread int scratchCount that increases every time you call GetScratch and decreases every time you call ReleaseScratch? Then every time you need a new scratch arena, rather than searching, you can get it from the index after the int. This will also give the maximum number of unique scratches you need in any instance.
This is not how this scratch system works. If you call GetScratch twice in a row, both for temp allocations (you don't pass first scratch as a conflict to the other), you will get same arena for both calls. And that is fine, this is how stack works. Even in case when arenas are in conflict, fixed counter would disable alternating same arenas in arbitrarily deep stack as Ryan mentioned with his example.
So I got a notification about your reply about why did I want that, but I can't find it, so maybe you deleted your reply? I will explain my idea a little more so that we are all on the same page.
The OP was talking specifically about the per-thread scratch system and was not about the stack-like arena fashion. In the article, Ryan introduced the GetScratch function that takes in an array of Arena* and returns a unique arena to the array from the thread pool (technically, it returns a TempArena that reference the unique arena). The original array of arenas can only contain a thread-local scratch arena when and only when that scratch arena was from the thread-local scratch pool by calling GetScratch. My idea was rather than pass in a not-to-conflicts-with array to the GetScratch, it just has a counter that gets increased every time it gets called and returns the element at that index from the pool. Then when the current function is done using the scratch arena, it will call the ReleaseScratch function which will decrease the counter by one so the latter functions can reuse the arena.
I don't know if by OP you meant your original reply or Ryan's post. But Ryan introduced scratch system in the context of stacks.
As for your implementation, yes that would work too, but then you'd have to dynamically increase arena pool when your memory stack grows beyond pool capacity. (Or allocate large enough pool at start by measuring your code, which could be a bit tricky if you use recursions a lot). I think Ryan's approach is more elegant exactly because of arena re-usage. You don't have to measure your code, you just have to follow simple rule Ryan outlined. As long as only a single persistent arena is present at any point in any codepath, you will not need more than two scratch arenas.
I meant my original reply.
So it seems like you misunderstand me. You can still have a persistent arena, and all your functions still need to pass in an arena like the function A and function B example. I was talking specifically about the GetScratch and ReleaseScratch functions that were used inside functions A and B.
void *FunctionA(Arena *arena)
{
ArenaTemp scratch = GetScratch(); // increase the counter
// use the scratch arena
// fill result to Arena* arena
ReleaseScratch(scratch); // decrease the counter
return result;
}
void FunctionB(void)
{
ArenaTemp scratch = GetScratch(); // increase the counter
void *result = FunctionA(scratch.arena); // the scratch inside A will not be the same as this scratch
// use result
ReleaseScratch(scratch); // decrease the counter
}
I can't find a way to turn on the code format in the reply section, so here's a link with some example code about the difference between Ryan's way and mine: https://pastebin.com/h53XsVzi
Consider this usage code:
ArenaTemp a = GetScratch();
ArenaTemp b = GetScratch();
ReleaseScratch(a);
ArenaTemp c = GetScratch();
In your approach, `b` and `c` would be the same scratch arena.
You can't get the same arena until the counter goes down. The GetScratch method only gets the scratch at the counter and increments it by one. Kind of like this: #define GetScratch(name) ArenaTemp name = scratchPool[scratchCount++] and #define ReleaseScratch(name) ArenaTempEnd(name); scratchCount--
Fantastic post! Mind-expanding and elegant :-D.
I just have one question: how do we pull data out of the arena? And why not make arenas hash maps rather than contiguous blocks, so we can pull out the values by name, even though that would be a bit less efficient?
Thanks!
Push functions return pointer to the memory block that you can then write into. As for hash maps, I believe those are not on the same level of abstraction. Hash map would be classified as a cointainer. You can easily implement hash map on top of arenas. Same goes for dynamic arrays as Ryan mentioned in the article.
Is the idea that every time we use an arena to allocate new memory to create a variable inside of a function, that we return that variable?
Right - in the same way that you don't ask the `malloc` allocator for allocations you've performed, you don't ask an arena for those either. That's up to the usage code of an arena. So you push something onto an arena with e.g. `Foo *foo = PushArrayZero(arena, Foo, 1);`. Then, it's up to *you* to use `foo` for whatever, store it somewhere, etc. - bookkeeping does not belong at the layer of the arena.
Good to see you introduce arenas. Your APIs that take `arena` as parameter remind me of Zig, are the concepts identical?
I haven’t looked at Zig’s memory allocator strategies much, but from my understanding Zig requires that allocating functions take allocators as parameters. In that sense, it’s the same. It’s different in that I’d expect Zig’s allocators to be abstract (this might not be the case), whereas arenas are a specific type of basic allocator.
The second part is important. Arenas simplify both implementation and usage code. Code that must work with a generic allocator will be more complicated than code that needs to work only with arenas.
Just double-checked the docs… in Zig you can override the implementation with as much granularity as you like: https://ziglang.org/documentation/0.9.1/#Choosing-an-Allocator
In terms of mass deallocation of memory—i.e., deallocate all memory allocated at once—with arenas I assume others have implemented efficient ways of doing this? - With the system [kernel] allocator its straightforward but often incurs nontrivial overhead.
I like the idea of using an arena to allocate a large tree datastructure, but without granular bookkeeping I wouldn't think more complex data-structures could be efficiently implemented (like Brodal's finger trees).
PS: Enjoyed this line "like a government agency, except in this case, the garbage collector is ostensibly doing something approximating useful work—although both function by stealing valuable resources involuntarily" ^_^
Yeah, so it seems like my expectation was correct. The problem I have with that design decision (which matches that of Odin as well) is that it forces code written for an abstract allocator - for which the implementation can change - to use the *maximally generic allocator API*. With an arena, the burden of freeing individual allocations all but disappears (except when doing more complex allocators on top of an arena, e.g. a pool allocator) - so code written for an arena is much simpler than code written for a generic allocator (even if that generic allocator no-ops on all of the frees, and indeed is implemented with an arena). The generic interface remains a pessimization for the allocator usage code.
The "granular bookkeeping" you mention (for more complex data structures) is what I describe as "a more complex allocator that composes with an arena", e.g. the free-list + arena introduced in the post. So it's definitely possible, it's just that it requires *composition with an arena*. The approach of e.g. Zig and Odin is to *abstract over the allocator*, but I disagree with their approach - instead, use the simplest allocator as a building block, and add more building blocks as necessary.
Regarding the government line, glad you enjoyed it. I have to sneak in my ancap somewhere. :)