In the context of a broader game engine design, how might this "CPU shader" approach be designed?
I think this may be one of those areas that as you mention at the end of the article may not fit 100% (and is in fact requiring heterogenous systems), but I'll try to flesh out the idea in a thought exercise.
In an ideal world, it seems like each thread would simply run their copy of the game loop. Thread 0 would probably be responsible for keeping time and gathering inputs (and maybe a barrier sync could be used if that's required), and then afterwards each thread pulls out relevant update tasks from its portion of the update queue (and propagates any tasks it'd like done at the end of its update section).
I think one problem with this approach tho is that components like OpenGL (or SDL iirc) lock themselves to the main thread, and require response from that thread to do anything. So you cant arbitrarily assign any thread to interact with those systems. That can be argued as an issue of engineering cruft.
A more fundamental issue may lie in things like the input and timekeeping code. For game logic, it seems like it'd be best to have a referee thread that acts as the source of truth. If multiple threads independently gather input or do timekeeping, that'd become messy very quickly.
When external systems (like those exposed by the operating system, or GPU APIs) enforce their own timeline or thread assignment constraints, then they need to be pulled out into their own heterogeneous timelines / thread groups. In the debugger we have this with the Windows event loop, the GPU APIs, and also the debug APIs - you need to dedicate a thread to wait for debug events which is not locked to the refresh rate - it must go much faster - and on Linux, only one thread can actually be attached to other threads as a debugger.
In my game project, I separate UI/rendering onto its own timeline, and keep game simulation on its own timeline. I also pull input event gathering into a separate timeline. I wrote all of that before I learned the stuff in this post, so these are just individual threads rather than thread groups / wavefronts / whatever you want to call them. But, when I keep going on that project, what I will probably do is just convert the UI/render ("user") thread into a "user" thread *group*, and same thing with the game simulation thread. Basically all that requires is going and annotating the code appropriately, and redesigning certain algorithms when needed.
When it comes to game simulation, I don't think there is a problem if you just deterministically decide on the timestamps in absolute space, given the tick index. If you do that, then all threads should just get the right answer. But if you ever did need a single source of truth, then you can always simply do that by going narrow, and then syncing the value. No problem.
When you have multiple timelines/thread groups like this, what's your strategy for mapping them to CPU cores? Do you just create some number of threads (presumably more than the number of cores available) and let the OS scheduler manage things, or do you limit the total number of threads created (at the expense of leaving cores idle sometime), or is there a system to switch OS threads between these thread groups over time?
I think the best strategy is highly dependent on circumstances. I could imagine it sometimes being best to oversubscribe, and let the kernel scheduler swap in the active threads. I could also imagine in other cases (e.g. if there is no benefit in increasing the cores past some point) constraining the number of threads you allocate to one timeline.
What I hoped to get across in this post is that, writing code which is multi-core by default is actually not making that decision for you; it's simply making your code more flexible for going wide at all. The number of cores you give to a timeline is a knob that you can tweak after the fact (perhaps even dynamically at runtime).
Great post, thank you for sharing. Do you intend on writing all or most of your future programs like this?
Also, I kinda wonder what would happen if hypothetically, every program running on a computer were written to allocate NUM_CORES_ON_MACHINE threads upfront and use this multicore by default approach. I would imagine that in 99% of normal home user scenarios where you really only care about one program (the one you are currently interacting with), it would be very beneficial. But maybe if for some reason you were running lots of individual processes at once and cared about how long it takes for all of them to complete at once, there might not be much different in the overall time taken...
If a program requires only serial work (which is rarer than you'd think) then almost all of the threads will just wait at a barrier, and the kernel will just not even schedule them, so you'll have very little overhead actually. If a program *can* take advantage of going wide in this way, then it is strictly better to do so. Now, as the programmer, it's important to note how much you actually win for each core you add - in some cases (e.g. due to a shared L3 cache contention) you can run into diminishing returns. In those cases, you don't need to go completely wide - it may cost a lot more energy, and a lot more work, for very little gain.
Importantly, the techniques in this post don't actually force you to go as wide as the CPU allows. What it really is about, in my view, is adding extra information to code, such that the code *can* be executed wide if needed. It is about defining the information needed for the code to execute correctly in a thread group.
In the debugger codebase, I have certain codepaths which have highly variable inputs. In some cases, depending on those inputs, I want those codepaths to go wide. But if those inputs are different, I want those codepaths to go narrow. This approach lets you decide.
Amdahl’s Law comes into play fairly rapidly here. Programs range from the “embarrassingly parallel” to the totally serial, and no generic solution is available or possible. There’s still some juice to squeeze from the fact that going from one core to n cores is a discontinuous and sharp increase in complexity, our host’s excellent post is a contribution there.
But the reason we have so many approaches, and even dedicated hardware for some parallel tasks, is because of inherent heterogeneity in the problems which computers solve. The optimal solution will always depend on the domain.
Hi, I enjoyed reading your article and now I want to go try it out myself!
This style of parallel programming is almost exactly the MPI/OpenMPI way of doing things. I think mentioning this is important to give credit where credit is due, and also allow readers who already know it to skip ahead.
I'm pretty confused about the "Codebase Support" section where you introduce "lanes". I really don't understand what you are trying to say because, among other reasons, there are no examples of why and how the abstraction emerges.
I've been trying to wrap my head around the concept for a while now but I just don't get it. A few questions I can think of right away that could maybe help me understand:
1. Is a lane just a way to map a stable handle to a potentially unstable real thread index so that warps/thread-groups are easier to manage and reason about?
2. If yes then how do you map lane indices to real thread indices?
3. What problem do lanes solve exactly?
4. What would I be missing out on if I don't make this "Lane" abstraction? (Code examples)
Also have you measured the performance gains from using this programming technique by simply starting the program with only 1 thread? I'm really curious how much you gain from this in the RAD debugger and if you think it was worth the complexity cost.
> This style of parallel programming is almost exactly the MPI/OpenMPI way of doing things. I think mentioning this is important to give credit where credit is due, and also allow readers who already know it to skip ahead.
I didn't learn any of this from that. I stated openly that I didn't invent it, and from where I learned it. It wasn't just MPI/OpenMPI, there are countless places where people have used similar techniques. It is just that, in other fields, people (like myself) have learned to overcomplicate the problem. I can't bloat the post with everyone who has ever done this before, the long & full history is well beyond the scope of this post.
> I really don't understand what you are trying to say because, among other reasons, there are no examples of why and how the abstraction emerges.
Well, why does MPI have "ranks"?
The reason why you want a term other than "thread" is because one thread participates in *many* groups, meaning it has a different "lane index" at various points in time. There are also potentially *many* lane groups, meaning a singular "thread index" is really not the right concept.
Imagine I take a lane group, and dedicate a particular task to a specific lane:
if(LaneIdx() == 0)
{
SomeFunction();
}
Now, imagine that SomeFunction also actually was written to take advantage of multiple lanes.
SomeFunction(void)
{
Rng1U64 range = LaneRange(...); // uses LaneCount()
}
Can you see the problem? That `range` will be computed *as if all lanes were calling `SomeFunction`. But only lane 0 called that function. Thus, in cases like this, you want to create a new lane group - or "wavefront" - and give each thread a new lane index. In this case, you would just want a single-threaded wavefront, where `LaneIdx()` stays `0`, but `LaneCount()` drops to `1`, and a new barrier is used (which will no-op, since there are no other lanes to wait for).
> Is a lane just a way to map a stable handle to a potentially unstable real thread index so that warps/thread-groups are easier to manage and reason about?
No. Like I said, "thread index" is not really appropriate or useful, because your program may contain many threads, all of which are in various groups, and a single "thread index" tells you nothing about how work *within one thread group* should be divided. That is why you want a term other than "thread". I found "lane" to capture the idea and be short & concise, so I used it.
> If yes then how do you map lane indices to real thread indices?
Again, there is no global "thread index". But when you create a thread group, that is when you assign each thread in the group a "lane index".
> What problem do lanes solve exactly?
Already answered here, and I also covered all of this in the post.
> What would I be missing out on if I don't make this "Lane" abstraction?
There is no extra "abstraction" here. It's simply a different term. Using "thread" everywhere simply blends terms together in a way that will make the meaning of various names/concepts in your code ambiguous, which makes it more difficult to reason about.
> Also have you measured the performance gains from using this programming technique by simply starting the program with only 1 thread?
Yes - I have done an enormous amount of measurement. As I said in the introduction of the post, I have been working on some problems in RADDBG which require much better usage of all cores than you'd just get out of single-core code. How much win you get depends on the bounds of your problem. For example, if you had 64 cores, you may not get 64x improvement, if you have all cores contending more on shared L3 cache(s) or limited file I/O queues/buses/etc. But in some cases, you will. It depends a lot on your problem. But you don't have to look very hard to find a problem where you obliterate the performance of single-core code.
> Can you see the problem? That `range` will be computed *as if all lanes were calling `SomeFunction`. But only lane 0 called that function. Thus, in cases like this, you want to create a new lane group - or "wavefront" - and give each thread a new lane index. In this case, you would just want a single-threaded wavefront, where `LaneIdx()` stays `0`, but `LaneCount()` drops to `1`, and a new barrier is used (which will no-op, since there are no other lanes to wait for).
Ok so I understand that lanes are meant to avoid writing code like this:
```C
// let `work_data` be some buffer of data to do operations on of length `work_size`
thread_local int thread_count = 8;
if (thread_id >= 3 && thread_id <= 6) {
// This would solve the problem
// However, semantically, modifying the thread_id does not make sense as now we have two threads of id 0, 1, and 2
// Our list of thread ids would look like this before restoring the ids: [0,1,2, 0,1,2, 7,8]
//
// Changing the thread_count doesn't make that much sense either as it's supposed to give us the amount of total global threads and not just
// the threads currently executing this branch. To modify it we even have to change thread_count from static to thread_local
thread_count = 3;
thread_id -= thread_count;
assert(work_length % thread_count == 0);
int work_part_length = work_length / thread_count;
int thread_range_start = thread_id * work_part_length
int thread_range_end = thread_id * work_part_length
for (int i = thread_range_start; i < thread_range_end; ++i) {
...
}
thread_id += thread_count; // Don't forget to restore the thread ids afterwards
thread_count = 8;
}
```
But how exactly do you "create a lane group" in your code?
And if you have two lane groups G1 and G2 both with 3 lanes each indexed 0..2
how do you differentiate between say lane 1 of G1 and lane 1 of G2?
For example can you give the exact code that would help you solve this problem with lanes?
How would you make a group of lanes that has lanes/threads 1,3,4,8?
Very interesting idea that made me rethink some "obvious" assumptions that I have when writing multithreaded code, but I don't feel like I am paying anything extra with job system that I am not going to pay with lanes given that actual data transformations are the same, and it just covers more cases than lanes.
As you mentioned, single timeline for a program with a lot of heterogeneous tasks wont work. So you will have a normal single core logic at the bottom of the stack, and then you need to create timelines at some point based on what you actually need to do. I think at this point its already very close to job system "single core" structure.
For stuff like going wide for entity update or preparing data for render all is good, but for stuff like loading a scene or a model, jobs allow you to dynamically discover actual work bounds without preparing anything in advance. With timleines I would have to make mini job system in place for those cases, and I have no idea how wide my lane should be.
It also seems pretty heavy to create those timelines if they require starting new platform threads. If I am going wide over my entities and I need to make an explosion that i want to parallelise, i don't have multi-core by default because my timeline is busy with entity update. I need to create a separate timeline within a timeline specifically for that. You can argue that I should put explosion into some queue and then batch process them in a separate pass anyway, but depending on the game this can be just a lot of extra work for 0 benefit. Or you can go full shader mode and make every lane go over explosion logic in every entity, distribute all explosion works across all threads and then wait with all threads. If you are going to do it for every entity this is thousands of extra syncs just for this gameplay feature. And if you have 10 more optional things to do in entity update that becomes much worse.
I also don't get lifetime argument. In parallel for case I can directly reference any data on a stack within my job, it's not really detached as shown in your diagram if I am going to wait for result like in timeline case, I don't need to think about general case. In my opinion, user code for job case looks better than timeline case, because with lanes your stack is not shared, so you need to do all those broadcasts to sync it (which are also easy to mess up). Or use shared buffer for lane, but depending on how its used/allocated it can also be more annoying than referencing call stack variables.
Awesome post. Why do some of the FileRead() calls have 3 arguments and some have 4 arguments? Also you're computing LaneRange(values_count) twice in the last example for seemingly no reason?
FileRead is not an actual API I use, it is just an example. The first examples just used the integers (because I did not want to introduce a range type for no reason in the post), but the latter part was specifically talking about codebase features you can add into a base layer, so in that case I introduced the range type, which I also use in the FileRead API at that point.
My actual codebases use ranges for the FileRead equivalent API, so it'd be 3 parameters rather than 4.
But, how is failure handled? If an exception appears in one of the lane/thread, wouldn’t that make all other threads wait forever at the next barrier and the program would just deadlock?
With a job system, waiting on a task allows you to find out it failed, and make decision about it.
Ryan, many thanks for writing this up. I remember you mentioning this on stream but I didn't quite get the gist of it so thanks for taking the time to put it into writing here, it's nice and clear now!
Great article as always. One small question: As far as I can tell, the first thread is always responsible for initializing/allocating the work, broadcasting data across lanes, and merging/outputting the result. I assume you want to do this to easily switch to single-core mode later when "the calling code can simply set thread_idx = 0, thread_count = 1, and barrier = {0}."
My question is: do you ever assign those responsibilities to a different thread? Or have cases where multiple threads do different setup tasks? In those cases, how can you switch back to single-core? Setting thread_idx to 0 means that those other code paths get disabled.
The choice of lane 0 is not necessary to work gracefully in single-core mode, you just need to make sure the lane index is actually valid, so someone gets the work. You can do this simply by modding by the lane count:
if(LaneIdx() == N % LaneCount())
// where N is whatever you want
This makes it easy to split up these setup tasks, if you know they're independent:
if(LaneIdx() == 0)
{
// setup task 0
}
if(LaneIdx() == 1%LaneCount())
{
// setup task 1
}
if(LaneIdx() == 2%LaneCount())
{
// setup task 2
}
I do this in a number of places. I actually just pulled out the `%` part to not screw it up, as `LaneFromTaskIdx`, so I can just do `if(LaneIdx() == LaneFromTaskIdx(123))`.
That is fundamentally the same thing. I wouldn't do that, because I don't agree with putting every trivial composition of some operations behind an API. I only did it for the % in `LaneFromTaskIdx` because doing that part wrong silently breaks code when you run it under different circumstances (e.g. with a different lane count), and I'd prefer to not catch those later.
`LaneIdx()` is used in many places, like writing into a specific slot of an output array, where each lane fills out its own part. It is also implicitly used in `LaneRange(...)`.
I recommend learning the Elixir language. The virtual machine BEAM has a separate scheduler for each core. You learn how to design cooperating processes; BEAM handles the details.
Great post, as always. It would be great to read more about how you would work with these primitives in a full program, and how you set up the contexts and lane-groups for use. I must confess that I don't know very much about thread-local storage... I know there are mechanisms to access thread-local keys/slots on Linux and Windows, but that's about it.
I'm curious as to whether you have any thoughts about how you would combine a generic job system (possibly task stealing or something) approach with your lane groups approach. I'm thinking in a scenario of threads executing various odd jobs in parallel, but then a larger batch of work which may come up occasionally and could benefit from your lane group approach. I'm thinking it might make sense to dynamically determine how many threads are free, conscript them into a lane group, and run the batch job. I guess that could have the issue of threads being occupied with odd jobs and then being unavailable for forming into a lane group. But the alternative I see is to keep some threads in reserve that can be formed into lane groups for when bigger batch jobs come up. But then you've got threads idle. Does any obvious approach occur to you?
The part that really made it click was introduction of the concept of a "timeline". I always felt like the job system approach was at odds with the typical way a program wants to be written but I never managed to disambiguate one particular thread's work from a more encompassing "timeline". This post finally gave me that clarity.
Very interesting. I like the idea of multithreaded by default. I wonder what it would be like to run SIMD and Multithreaded by default then as you said "Go Narrow" most of the time. I suppose one would constantly need to decide if they want to use at most the same number of threads as logical processors in order to go all out with SIMD, or to use more threads and half-width at most SIMD instructions. Maybe its smart enough to switch between all the different states at any point in time. Hmm just a thought.
Awesome, thanks for sharing Ryan. Also there is a typo in "barrirer"
In the context of a broader game engine design, how might this "CPU shader" approach be designed?
I think this may be one of those areas that as you mention at the end of the article may not fit 100% (and is in fact requiring heterogenous systems), but I'll try to flesh out the idea in a thought exercise.
In an ideal world, it seems like each thread would simply run their copy of the game loop. Thread 0 would probably be responsible for keeping time and gathering inputs (and maybe a barrier sync could be used if that's required), and then afterwards each thread pulls out relevant update tasks from its portion of the update queue (and propagates any tasks it'd like done at the end of its update section).
I think one problem with this approach tho is that components like OpenGL (or SDL iirc) lock themselves to the main thread, and require response from that thread to do anything. So you cant arbitrarily assign any thread to interact with those systems. That can be argued as an issue of engineering cruft.
A more fundamental issue may lie in things like the input and timekeeping code. For game logic, it seems like it'd be best to have a referee thread that acts as the source of truth. If multiple threads independently gather input or do timekeeping, that'd become messy very quickly.
Let me know your thoughts
When external systems (like those exposed by the operating system, or GPU APIs) enforce their own timeline or thread assignment constraints, then they need to be pulled out into their own heterogeneous timelines / thread groups. In the debugger we have this with the Windows event loop, the GPU APIs, and also the debug APIs - you need to dedicate a thread to wait for debug events which is not locked to the refresh rate - it must go much faster - and on Linux, only one thread can actually be attached to other threads as a debugger.
In my game project, I separate UI/rendering onto its own timeline, and keep game simulation on its own timeline. I also pull input event gathering into a separate timeline. I wrote all of that before I learned the stuff in this post, so these are just individual threads rather than thread groups / wavefronts / whatever you want to call them. But, when I keep going on that project, what I will probably do is just convert the UI/render ("user") thread into a "user" thread *group*, and same thing with the game simulation thread. Basically all that requires is going and annotating the code appropriately, and redesigning certain algorithms when needed.
When it comes to game simulation, I don't think there is a problem if you just deterministically decide on the timestamps in absolute space, given the tick index. If you do that, then all threads should just get the right answer. But if you ever did need a single source of truth, then you can always simply do that by going narrow, and then syncing the value. No problem.
When you have multiple timelines/thread groups like this, what's your strategy for mapping them to CPU cores? Do you just create some number of threads (presumably more than the number of cores available) and let the OS scheduler manage things, or do you limit the total number of threads created (at the expense of leaving cores idle sometime), or is there a system to switch OS threads between these thread groups over time?
I think the best strategy is highly dependent on circumstances. I could imagine it sometimes being best to oversubscribe, and let the kernel scheduler swap in the active threads. I could also imagine in other cases (e.g. if there is no benefit in increasing the cores past some point) constraining the number of threads you allocate to one timeline.
What I hoped to get across in this post is that, writing code which is multi-core by default is actually not making that decision for you; it's simply making your code more flexible for going wide at all. The number of cores you give to a timeline is a knob that you can tweak after the fact (perhaps even dynamically at runtime).
Great post, thank you for sharing. Do you intend on writing all or most of your future programs like this?
Also, I kinda wonder what would happen if hypothetically, every program running on a computer were written to allocate NUM_CORES_ON_MACHINE threads upfront and use this multicore by default approach. I would imagine that in 99% of normal home user scenarios where you really only care about one program (the one you are currently interacting with), it would be very beneficial. But maybe if for some reason you were running lots of individual processes at once and cared about how long it takes for all of them to complete at once, there might not be much different in the overall time taken...
If a program requires only serial work (which is rarer than you'd think) then almost all of the threads will just wait at a barrier, and the kernel will just not even schedule them, so you'll have very little overhead actually. If a program *can* take advantage of going wide in this way, then it is strictly better to do so. Now, as the programmer, it's important to note how much you actually win for each core you add - in some cases (e.g. due to a shared L3 cache contention) you can run into diminishing returns. In those cases, you don't need to go completely wide - it may cost a lot more energy, and a lot more work, for very little gain.
Importantly, the techniques in this post don't actually force you to go as wide as the CPU allows. What it really is about, in my view, is adding extra information to code, such that the code *can* be executed wide if needed. It is about defining the information needed for the code to execute correctly in a thread group.
In the debugger codebase, I have certain codepaths which have highly variable inputs. In some cases, depending on those inputs, I want those codepaths to go wide. But if those inputs are different, I want those codepaths to go narrow. This approach lets you decide.
Amdahl’s Law comes into play fairly rapidly here. Programs range from the “embarrassingly parallel” to the totally serial, and no generic solution is available or possible. There’s still some juice to squeeze from the fact that going from one core to n cores is a discontinuous and sharp increase in complexity, our host’s excellent post is a contribution there.
But the reason we have so many approaches, and even dedicated hardware for some parallel tasks, is because of inherent heterogeneity in the problems which computers solve. The optimal solution will always depend on the domain.
This is kind of how MPI works right? 🤔
Hi, I enjoyed reading your article and now I want to go try it out myself!
This style of parallel programming is almost exactly the MPI/OpenMPI way of doing things. I think mentioning this is important to give credit where credit is due, and also allow readers who already know it to skip ahead.
I'm pretty confused about the "Codebase Support" section where you introduce "lanes". I really don't understand what you are trying to say because, among other reasons, there are no examples of why and how the abstraction emerges.
I've been trying to wrap my head around the concept for a while now but I just don't get it. A few questions I can think of right away that could maybe help me understand:
1. Is a lane just a way to map a stable handle to a potentially unstable real thread index so that warps/thread-groups are easier to manage and reason about?
2. If yes then how do you map lane indices to real thread indices?
3. What problem do lanes solve exactly?
4. What would I be missing out on if I don't make this "Lane" abstraction? (Code examples)
Also have you measured the performance gains from using this programming technique by simply starting the program with only 1 thread? I'm really curious how much you gain from this in the RAD debugger and if you think it was worth the complexity cost.
> This style of parallel programming is almost exactly the MPI/OpenMPI way of doing things. I think mentioning this is important to give credit where credit is due, and also allow readers who already know it to skip ahead.
I didn't learn any of this from that. I stated openly that I didn't invent it, and from where I learned it. It wasn't just MPI/OpenMPI, there are countless places where people have used similar techniques. It is just that, in other fields, people (like myself) have learned to overcomplicate the problem. I can't bloat the post with everyone who has ever done this before, the long & full history is well beyond the scope of this post.
> I really don't understand what you are trying to say because, among other reasons, there are no examples of why and how the abstraction emerges.
Well, why does MPI have "ranks"?
The reason why you want a term other than "thread" is because one thread participates in *many* groups, meaning it has a different "lane index" at various points in time. There are also potentially *many* lane groups, meaning a singular "thread index" is really not the right concept.
Imagine I take a lane group, and dedicate a particular task to a specific lane:
if(LaneIdx() == 0)
{
SomeFunction();
}
Now, imagine that SomeFunction also actually was written to take advantage of multiple lanes.
SomeFunction(void)
{
Rng1U64 range = LaneRange(...); // uses LaneCount()
}
Can you see the problem? That `range` will be computed *as if all lanes were calling `SomeFunction`. But only lane 0 called that function. Thus, in cases like this, you want to create a new lane group - or "wavefront" - and give each thread a new lane index. In this case, you would just want a single-threaded wavefront, where `LaneIdx()` stays `0`, but `LaneCount()` drops to `1`, and a new barrier is used (which will no-op, since there are no other lanes to wait for).
> Is a lane just a way to map a stable handle to a potentially unstable real thread index so that warps/thread-groups are easier to manage and reason about?
No. Like I said, "thread index" is not really appropriate or useful, because your program may contain many threads, all of which are in various groups, and a single "thread index" tells you nothing about how work *within one thread group* should be divided. That is why you want a term other than "thread". I found "lane" to capture the idea and be short & concise, so I used it.
> If yes then how do you map lane indices to real thread indices?
Again, there is no global "thread index". But when you create a thread group, that is when you assign each thread in the group a "lane index".
> What problem do lanes solve exactly?
Already answered here, and I also covered all of this in the post.
> What would I be missing out on if I don't make this "Lane" abstraction?
There is no extra "abstraction" here. It's simply a different term. Using "thread" everywhere simply blends terms together in a way that will make the meaning of various names/concepts in your code ambiguous, which makes it more difficult to reason about.
> Also have you measured the performance gains from using this programming technique by simply starting the program with only 1 thread?
Yes - I have done an enormous amount of measurement. As I said in the introduction of the post, I have been working on some problems in RADDBG which require much better usage of all cores than you'd just get out of single-core code. How much win you get depends on the bounds of your problem. For example, if you had 64 cores, you may not get 64x improvement, if you have all cores contending more on shared L3 cache(s) or limited file I/O queues/buses/etc. But in some cases, you will. It depends a lot on your problem. But you don't have to look very hard to find a problem where you obliterate the performance of single-core code.
> Can you see the problem? That `range` will be computed *as if all lanes were calling `SomeFunction`. But only lane 0 called that function. Thus, in cases like this, you want to create a new lane group - or "wavefront" - and give each thread a new lane index. In this case, you would just want a single-threaded wavefront, where `LaneIdx()` stays `0`, but `LaneCount()` drops to `1`, and a new barrier is used (which will no-op, since there are no other lanes to wait for).
Ok so I understand that lanes are meant to avoid writing code like this:
```C
// let `work_data` be some buffer of data to do operations on of length `work_size`
thread_local int thread_count = 8;
if (thread_id >= 3 && thread_id <= 6) {
// This would solve the problem
// However, semantically, modifying the thread_id does not make sense as now we have two threads of id 0, 1, and 2
// Our list of thread ids would look like this before restoring the ids: [0,1,2, 0,1,2, 7,8]
//
// Changing the thread_count doesn't make that much sense either as it's supposed to give us the amount of total global threads and not just
// the threads currently executing this branch. To modify it we even have to change thread_count from static to thread_local
thread_count = 3;
thread_id -= thread_count;
assert(work_length % thread_count == 0);
int work_part_length = work_length / thread_count;
int thread_range_start = thread_id * work_part_length
int thread_range_end = thread_id * work_part_length
for (int i = thread_range_start; i < thread_range_end; ++i) {
...
}
thread_id += thread_count; // Don't forget to restore the thread ids afterwards
thread_count = 8;
}
```
But how exactly do you "create a lane group" in your code?
And if you have two lane groups G1 and G2 both with 3 lanes each indexed 0..2
how do you differentiate between say lane 1 of G1 and lane 1 of G2?
For example can you give the exact code that would help you solve this problem with lanes?
How would you make a group of lanes that has lanes/threads 1,3,4,8?
```C
if (lane_id() == 1 || lane_id() == 3 || lane_id() == 4 || lane_id() == 8) {
SomeFunction();
}
```
Where some fuction calls code that was written to take advantage of lanes.
I've spent multiple hours trying to understand how you do this and I've gone
through the rad debugger code to try and get a better grasp of it but I haven't
been able to find code that declares a group of threads.
When lanes are created the lane id is simply the thread index: (radbin/radbin.c)
```C
Thread *threads = push_array(scratch.arena, Thread, threads_count);
RB_ThreadParams *threads_params = push_array(scratch.arena, RB_ThreadParams, threads_count);
Barrier barrier = barrier_alloc(threads_count);
U64 broadcast_val = 0;
for EachIndex(idx, threads_count)
{
threads_params[idx].cmdline = cmdline;
threads_params[idx].lane_ctx.lane_idx = idx;
threads_params[idx].lane_ctx.lane_count = threads_count;
threads_params[idx].lane_ctx.barrier = barrier;
threads_params[idx].lane_ctx.broadcast_memory = &broadcast_val;
threads[idx] = thread_launch(rb_thread_entry_point, &threads_params[idx]);
}
```
"Async threads" are created in a similar way in `base/base_entry_point.c` with
a different entry_point than `rb_thread_entry_point`. I don't know if those
count as a different group because they start somewhere else completely.
Also I think the file is involved in some sort of metaprogramming.
So correct me if I'm wrong but right after each thread is started there are no
lane groups or anything and lanes are not used for anything other than
lane_range() loops
The only place where I could find a lane context change was in `artifact_cache.c`
```C
for (;;) {
...
// rjf: push thin lane ctx
U64 thin_lane_ctx_broadcast_memory = 0;
LaneCtx thin_lane_ctx = {0, 1, {0}, &thin_lane_ctx_broadcast_memory};
LaneCtx lane_ctx_restore = lane_ctx(thin_lane_ctx);
// rjf: compute val
B32 retry = 0;
U64 gen = r->gen;
AC_Artifact val = r->create(r->key, r->cancel_signal, &retry, &gen);
// rjf: restore wide lane ctx
lane_ctx(lane_ctx_restore);
...
}
```
Here I suppose that the `r->create` function has code written to take advantage
of multiple lanes. However before calling it you change the lane context of
every thread so that lane_id() will be 0 and lane_count() will be 1.
So if we have 4 threads the lane_ctx transformation will look like this for each thread.
1: { 1, 4 } -> { 0, 1 } -> { 1, 4 }
2: { 2, 4 } -> { 0, 1 } -> { 2, 4 }
3: { 3, 4 } -> { 0, 1 } -> { 3, 4 }
4: { 4, 4 } -> { 0, 1 } -> { 4, 4 }
And each thread will have calculated the same value.
So is this the only piece of code where lanes are really used in rad besides for LangeRange() statements?
In MPI ranks serve to index processes within an explicitly declared group or communicator.
The rank is linked to a specific group/comm.
https://mpitutorial.com/tutorials/introduction-to-groups-and-communicators/
Sorry if it looks like I'm being obtuse but I'm trying my best >_<
Sorry for the awful substack formatting. It was much more readable in my editor.
Very interesting idea that made me rethink some "obvious" assumptions that I have when writing multithreaded code, but I don't feel like I am paying anything extra with job system that I am not going to pay with lanes given that actual data transformations are the same, and it just covers more cases than lanes.
As you mentioned, single timeline for a program with a lot of heterogeneous tasks wont work. So you will have a normal single core logic at the bottom of the stack, and then you need to create timelines at some point based on what you actually need to do. I think at this point its already very close to job system "single core" structure.
For stuff like going wide for entity update or preparing data for render all is good, but for stuff like loading a scene or a model, jobs allow you to dynamically discover actual work bounds without preparing anything in advance. With timleines I would have to make mini job system in place for those cases, and I have no idea how wide my lane should be.
It also seems pretty heavy to create those timelines if they require starting new platform threads. If I am going wide over my entities and I need to make an explosion that i want to parallelise, i don't have multi-core by default because my timeline is busy with entity update. I need to create a separate timeline within a timeline specifically for that. You can argue that I should put explosion into some queue and then batch process them in a separate pass anyway, but depending on the game this can be just a lot of extra work for 0 benefit. Or you can go full shader mode and make every lane go over explosion logic in every entity, distribute all explosion works across all threads and then wait with all threads. If you are going to do it for every entity this is thousands of extra syncs just for this gameplay feature. And if you have 10 more optional things to do in entity update that becomes much worse.
I also don't get lifetime argument. In parallel for case I can directly reference any data on a stack within my job, it's not really detached as shown in your diagram if I am going to wait for result like in timeline case, I don't need to think about general case. In my opinion, user code for job case looks better than timeline case, because with lanes your stack is not shared, so you need to do all those broadcasts to sync it (which are also easy to mess up). Or use shared buffer for lane, but depending on how its used/allocated it can also be more annoying than referencing call stack variables.
Awesome post. Why do some of the FileRead() calls have 3 arguments and some have 4 arguments? Also you're computing LaneRange(values_count) twice in the last example for seemingly no reason?
FileRead is not an actual API I use, it is just an example. The first examples just used the integers (because I did not want to introduce a range type for no reason in the post), but the latter part was specifically talking about codebase features you can add into a base layer, so in that case I introduced the range type, which I also use in the FileRead API at that point.
My actual codebases use ranges for the FileRead equivalent API, so it'd be 3 parameters rather than 4.
Interesting!
But, how is failure handled? If an exception appears in one of the lane/thread, wouldn’t that make all other threads wait forever at the next barrier and the program would just deadlock?
With a job system, waiting on a task allows you to find out it failed, and make decision about it.
I don’t really see how this works here.
this is fun, but it kinda shockingly similar to "coarray fortran", parallel evolution is fun to see :]
Ryan, many thanks for writing this up. I remember you mentioning this on stream but I didn't quite get the gist of it so thanks for taking the time to put it into writing here, it's nice and clear now!
Great article as always. One small question: As far as I can tell, the first thread is always responsible for initializing/allocating the work, broadcasting data across lanes, and merging/outputting the result. I assume you want to do this to easily switch to single-core mode later when "the calling code can simply set thread_idx = 0, thread_count = 1, and barrier = {0}."
My question is: do you ever assign those responsibilities to a different thread? Or have cases where multiple threads do different setup tasks? In those cases, how can you switch back to single-core? Setting thread_idx to 0 means that those other code paths get disabled.
The choice of lane 0 is not necessary to work gracefully in single-core mode, you just need to make sure the lane index is actually valid, so someone gets the work. You can do this simply by modding by the lane count:
if(LaneIdx() == N % LaneCount())
// where N is whatever you want
This makes it easy to split up these setup tasks, if you know they're independent:
if(LaneIdx() == 0)
{
// setup task 0
}
if(LaneIdx() == 1%LaneCount())
{
// setup task 1
}
if(LaneIdx() == 2%LaneCount())
{
// setup task 2
}
I do this in a number of places. I actually just pulled out the `%` part to not screw it up, as `LaneFromTaskIdx`, so I can just do `if(LaneIdx() == LaneFromTaskIdx(123))`.
Why not `if (IsLaneIdxSpecial(123))`? Do you find any other use cases for `LaneIdx()` except this and broadcasting?
That is fundamentally the same thing. I wouldn't do that, because I don't agree with putting every trivial composition of some operations behind an API. I only did it for the % in `LaneFromTaskIdx` because doing that part wrong silently breaks code when you run it under different circumstances (e.g. with a different lane count), and I'd prefer to not catch those later.
`LaneIdx()` is used in many places, like writing into a specific slot of an output array, where each lane fills out its own part. It is also implicitly used in `LaneRange(...)`.
I recommend learning the Elixir language. The virtual machine BEAM has a separate scheduler for each core. You learn how to design cooperating processes; BEAM handles the details.
Great post, as always. It would be great to read more about how you would work with these primitives in a full program, and how you set up the contexts and lane-groups for use. I must confess that I don't know very much about thread-local storage... I know there are mechanisms to access thread-local keys/slots on Linux and Windows, but that's about it.
Great article. Thanks for sharing.
I'm curious as to whether you have any thoughts about how you would combine a generic job system (possibly task stealing or something) approach with your lane groups approach. I'm thinking in a scenario of threads executing various odd jobs in parallel, but then a larger batch of work which may come up occasionally and could benefit from your lane group approach. I'm thinking it might make sense to dynamically determine how many threads are free, conscript them into a lane group, and run the batch job. I guess that could have the issue of threads being occupied with odd jobs and then being unavailable for forming into a lane group. But the alternative I see is to keep some threads in reserve that can be formed into lane groups for when bigger batch jobs come up. But then you've got threads idle. Does any obvious approach occur to you?
Thanks for the great post Ryan!
The part that really made it click was introduction of the concept of a "timeline". I always felt like the job system approach was at odds with the typical way a program wants to be written but I never managed to disambiguate one particular thread's work from a more encompassing "timeline". This post finally gave me that clarity.
Very interesting. I like the idea of multithreaded by default. I wonder what it would be like to run SIMD and Multithreaded by default then as you said "Go Narrow" most of the time. I suppose one would constantly need to decide if they want to use at most the same number of threads as logical processors in order to go all out with SIMD, or to use more threads and half-width at most SIMD instructions. Maybe its smart enough to switch between all the different states at any point in time. Hmm just a thought.