UI, Part 9: Keyboard and Gamepad Navigation
Techniques for supporting keyboard and gamepad navigation without overburdening builder code.
If you’ve learned about operating systems, chances are you’ve heard the term “context switch”. It’s the name given to the operation done when an operating system scheduler stores state required for some thread to execute, prepares some state for another thread to execute, and resumes. It’s an implementation detail of how one CPU core might execute work for multiple threads, without any cooperation from the threads themselves (a “thread” being an abstraction which refers to an instance of a single cohesive ‘stream’ of work).
I won’t be writing about context switches in that manner in this post, but I needed a hook—instead I’ll be writing about context switches for users as they interact with an interface.
A modern consumer desktop or laptop has two primary modes of input, both of which are quite different: the keyboard, and the mouse. The patterns in which information is encoded using these two modes is quite different.
The keyboard is useful for—unsurprisingly—textual input. In user interfaces, users need to write some piece of text. They do this using well-established designs and patterns which are considered more-or-less standard for textual input. I’ve already covered how one approach to the problem of textual input can work. A keyboard is also useful for discrete inputs with a small bandwidth—for instance, directional navigation (a subset of which is found within the problem of handling textual input).
The mouse is useful for continuous inputs with a larger bandwidth. For instance, at which value a slider should be set; or how far down on a long page the user would like to scroll; or using two dimensional coordinates to determine which item to select out of 1,000 on a screen.
The characteristics—and differences—of the mouse and keyboard explain why in a first-person shooter, for example, the keyboard is generally used to control the player’s movement with a small set of discrete inputs (forward, backward, strafe left or strafe right), and the mouse is generally used to control the player’s looking direction (particularly in games when this piece of information must be very precise, like a competitive multiplayer shooter).
A mouse is a one-handed device, and a keyboard is a two-handed device. A subset of a keyboard may be used with one hand. For this reason, a person interacting with a computer is generally operating in one of two positions. I’ll call them “position A”—both of their hands are on the keyboard—and “position B”—one hand is on the keyboard, and the other is on the mouse.
And with that, I hope the relevance of a context switch becomes clear—when someone must change from “position A” to “position B”, it causes one version of a mental context switch. It interrupts the user’s flow as they interact in one way, and forces them to adjust and begin another flow.
Context switches are worth discussing because they cost something. In the case of an operating system scheduling threads, it may be precious CPU time. In the case of a person switching from “position A” to “position B”, it may be precious user time. If user time is not valued, then it can cause time lost at worst, but perhaps more commonly frustration and friction when interacting with the software in question. It may simply be a fraction of a second, but that time adds up, and makes an impression on the user.
I won’t ramble much about valuing user time. Many developers—obviously—don’t value user time. But I believe in well-designed and pleasing experiences, and I believe that solutions which provide such experiences (without sacrificing much else) succeed more often, in the long term, than those which don’t.
For this reason, I minimize—to the degree that I can—user friction and context switching. In this post, I’ll be focusing on the context switches incurred when a user must switch repeatedly from “position A” to “position B”. Moreover, how those context switches may be avoided by not artificially constraining the set of inputs a user interface accepts. This can be done with support for keyboard navigation and control, when appropriate.
If a higher quality design was not enough reason to support keyboard navigation, consider that it also is the first step in supporting other input methods, including gamepads and those for accessibility.
Note: While I’ve observed the techniques I’ll describe in this post be successful, this is one area of user interface programming which I suspect there is plenty of room for improvement without much effort. My current techniques are several steps forward than those I used a few years ago, but they are probably not the pinnacle of what one might easily achieve.
Taking A Walk
I’ll begin by specifying some constraints that I find reasonable for the problem.
I’ve already sketched a picture of builder code given the subjects covered in previous posts. Namely, that it is the set of codepaths tasked with producing a “box hierarchy”, where a “box” is just one fragmentary building block which we can compose to form the effect of widgets. Each box may have a unique key, which is required for interaction with them (labels and other non-interactable widgets do not require unique keys). Builder code simultaneously correlates user interaction with boxes in this hierarchy with codepaths which must execute in response to that interaction—“what happens when the user presses this button?”
Another way to conceptualize a “box” is this: One “box” is simply the intersection of a rectangular portion of the screen with common user interface “grammar”—such as clickability, borders, backgrounds, text, and computation produced by interactions with the user.
As I’ve already written before, builder code is the largest, most complex, and most often changed part of some graphical application’s frontend user interface code. Given that, a solution for keyboard navigation support ideally requires as little as possible from builder code. Thus, our core code will instead need to be involved in providing some useful mechanisms for the problem.
In a large number of—but not all—cases, I’ve found this is possible. The keyboard navigation behavior in such cases can be automatically derived from a box hierarchy, with nearly no extra effort.
To demonstrate why, I’ll break down an example.
Take the above example. Assume, first, that a user can only interact with one box at a time with their keyboard. Knowing that, we can form a fairly trivial rule for building an ordering of the possibly-interactable boxes:
This is the order in which certain boxes are visited by the user’s keyboard selection when they navigate with Tab from top to bottom, given standard Tab navigation behavior.
There’s a reason why that is a standard, and it’s because this ordering is natural, meaning conceptually (and computationally) easy, given the structure of the box hierarchy. The reason why this is the case becomes obvious when looking at the visualization of a depth first, pre-order traversal of the box hierarchy used to produce the above user interface.
My Microsoft Paint skills are only so good, so while it might be difficult to clearly see in the above image, what’s hopefully clear to understand is that the keyboard navigation walk through the tree is a sub-walk of the depth-first, pre-order walk through the tree. In other words, we can compute the former by simply filtering the latter.
This filter can be produced in a number of ways. You might decide, in your own implementation, that it ought to be explicitly encoded with a feature flag on each box. Boxes with the flag are included in the keyboard navigation walk, boxes without the flag are not.
This flag need not necessarily be directly related to this filter, though, and instead it can simply be that which I previously introduced for controlling clickability behavior: UI_BoxFlag_Clickable
. The logic goes, if a box in the tree is interactable via the mouse, then it should also be interactable with the keyboard.
This works particularly well if the clickability behavior codepath is extended to support keyboard selection and—for instance—hitting the enter key, or the A button on a gamepad causing an identical effect to a click with the mouse.
Directionality
The filtering rule can quickly become more complex. This gets into the weeds, so I’ll only briefly cover what some of the options are—specifics ought to be reserved for a full implementation, instead of a blog post simply covering the basic mental framework for implementing a solution.
Thus far I’ve only covered the most basic case: navigating in depth-first order, irrespective of visual layout. This is sufficient for many cases, and is a great starting point.
But in many cases, users expect an interface’s visual layout and the directionality of their inputs to change the navigation. Moving down with selection within a horizontal row may mean all siblings within that horizontal row are skipped, and the next vertically-arranged box within the depth-first walk is selected.
Directionality information is already encoded within the box tree for the purposes of layout, and so this information may be also used in implementing this filtering rule.
Directionality alone is not perfect, though, because each node within the box tree is not uniformly sized. If the user presses down, it is less jarring to have their subsequent selection be as horizontally close to their previous as possible. This is information not explicitly encoded within the box tree, and so it may require extra complexity in the filtering rule.
I will leave these fancier extensions to the filtering rule up to you, though—it’s a great area for experimentation.
Key Stability
So, you might then think:
Well, I guess this one’s a wrap! All you need is to store, say, the key of the box selected by the keyboard, and implement some keyboard navigation controls which mutate that state by traversing the clickable sub-tree of the box tree in depth-first fashion, possibly with fancy directionality features. To start, Tab can move pre-order, Shift + Tab can move post-order. Done!
And while that’s an excellent start—and indeed works well for many simple cases—it will soon become insufficient.
The reason why it becomes insufficient is most clear when considering the case of a windowed list, which I’ve covered before. As a quick refresher, a windowed list produces a box tree, and stores the associated state, only for visible portions of the list, given a scroll position and a visible region size. This is not merely an optimization over building the entire list (although it can help performance, of course), it is sometimes necessary—with this technique, lists can be arbitrarily large, which may be required given underlying data used to produce the list.
Note: Have you ever tried typing some_pointer,1000000000
into the Visual Studio Debugger Watch Window and hitting the expander?
The windowed list’s builder code can do this easily because of the immediate mode nature of the core’s box building API—the builder codepath runs repeatedly, and so it can act like a rendering codepath (a fancier rendering codepath which supports things like clickability, keyboard navigation, and animation).
But in this case, you might have noticed a small problem.
If the user navigates with their keyboard to an item in this list, then scrolls away from that item, the keyboard navigation state will be storing a key which no longer corresponds to a live box. This causes problems:
In the above clip, I scroll away from my keyboard selection, then hit an arrow key to navigate again. The usual behavior for this case is to perform the navigation, and snap the new post-navigation selection into view.
But if the builder code for the above list only produces visible boxes (which, in this case, it does—I wrote it), how can it possibly know anything about the box key to which I navigate while the associated box is invisible? And furthermore, even if it knew, how would it know where the view would need to snap?
You may try to devise many solutions to this problem in order to keep a box’s key as the underlying type for keyboard selection, but these become less and less feasible when considering that the above case is indistinguishable from other cases where the desired behavior is entirely different. For example, say a box which the user has selected with their keyboard selection disappears (because they clicked a checkbox somewhere else which hides it, for instance). In that case, our user interface code shouldn’t keep their keyboard selection state—it needs to instead adapt to this situation and mutate their keyboard selection state to instead select something else that’s useful.
The user interface core code—which needs to be involved in keyboard navigation—doesn’t “know” why a box has disappeared—that is squarely decided by the builder code. All it “knows” is that a box has disappeared.
Furthermore, boxes can disappear in unpredictable ways. Imagine you’ve built a list of buttons for every entity in your game world, and you remove one of the entities. The related button will disappear without any obvious change taking place in the builder code.
Builder codepaths may require specific rules regarding robustness of keyboard selection across change. If the user has selected a specific row in a hierarchical table view, but a parent of that row is subsequently collapsed, then there are a number of options. In some cases, it may be reasonable to treat the user’s keyboard selection state as a grid position—meaning, there is no need to adjust the keyboard selection state, it can simply remain at the same grid cell. In other cases, it may instead be reasonable to find the first open ancestor of the selected row, which has disappeared:
My conclusion is this: The information a box key is able to communicate is not sufficiently rich to reliably reconstruct keyboard selection state across time and through change. So, the solution is that the space of keyboard selection states must be different than the space of box keys. Or, a proper strategy must concede that the two spaces must be different in some cases.
At the limit, a working strategy can simply put the burden of navigation and marking up which box has the keyboard selection onto the builder code entirely. With this view, the core does not need to be involved at all, perhaps other than providing a mechanism for the builder code to specify which box has keyboard selection, and adjusting interaction codepaths accordingly (e.g. allowing the user pressing Enter, given keyboard selection, to mimic a click).
That does mean, however, that no user interface produced by builder code has support for keyboard navigation until the builder code participates. This—in my opinion—virtually guarantees that there will be a lack of complete coverage. Some interfaces will simply not have keyboard navigation implemented, and so they’ll be somewhat less usable.
My approach is to strike a balance. I want most interfaces—ones without many changes, and without fancy stability rules, and without windowing—to “just work” with a simple filtered depth-first traversal ordering. And secondly, I want the complex but few-in-number interfaces—the important ones, which have special case needs—to be able to take full control, and at the limit, simply control the keyboard selection state themselves.
Importantly, in those more complex interfaces with more sophisticated keyboard navigation requirements, the keyboard navigation state space will be obviously derivative not from the box hierarchy, but some other state controlled by the builder codepath. In the hierarchical grid example I showed above, this keying problem must be solved by the builder codepath regardless to do other important tasks, like manage expansion state. So, in that case, it’s quite natural to use the keys used for that state for the keys used to identify keyboard selection.
I’ll get into more specifics about this approach, but first, let’s take a look at another one of the problem’s wrinkles.
Levels Of Analysis
I’ve written about memory allocation before, but other resources are limited (and thus allocated or deallocated) also. With respect to this post, other relevant limited resource are keyboard keys, or gamepad buttons.
Within a single application, it’s common that many interfaces within that application would like to allocate keyboard keys differently, despite those interfaces coexisting. This might simply be because there are not enough keys on a keyboard (or buttons on a gamepad) for all of the interfaces, which may coexist. But it also might be because specific keys or buttons have specific meanings. If there is a way to move an object on a two-dimensional plane, then the first place that the user will go to control that object will be either W, A, S, and D, or the arrow keys. So, if in order to avoid conflicts with other interfaces, one interface chooses another arbitrary set of keys to control directional movement, it can cause confusion or friction.
I don’t want to ignore the simpler case—in fact, more generic approaches to this problem in simpler cases are actually higher friction when they’re unnecessary. For instance, once upon a time I wrote a simple, local-only, one-versus-one fighting game with a friend for a hackathon:
In this case, my friend and I understood our problem well, and we could simply use both W, A, S, and D, and I, J, K, and L for the two players’ directional movement. A user interface could work similarly, if it were sufficiently constrained.
But the cost of developing a simple, special-case solution gets dramatically higher than a generic one (which has a plan for an arbitrary number of to-be-designed, coexisting interfaces) when the number and complexity of various coexisting interfaces increases, and especially if the user has the ability to control the composition of those interfaces in any way. And, if the cost were not a problem, the benefit becomes dramatically reduced also, because the user’s ability to quickly understand a new interface in the application is diminished.
My more generalized approach to this problem is to ensure that each interface’s builder code has the ability to either integrate with a pre-existing keyboard navigation strategy, or be completely sandboxed in its keyboard navigation, and thus be without influence from other interfaces. Various interfaces will want different things.
For instance, consider the trivial case—a button. In this case, the button’s code has no reason to enforce a keyboard navigation strategy—it’s either selected, ultimately, or it isn’t.
But take a more complex widget—a color picker. In this case, the color picker’s builder code might have stronger opinions about what directional navigation means.
Color pickers and buttons, furthermore, may integrate into the same overarching interface. Let’s say it’s a list—in that case, directional controls used by the color picker (to move the value and saturation selection, for instance) are directly in conflict with the list’s directional controls for navigating through the list.
The color picker, in this case, simply requires a different “scope” of keyboard navigation. The user should be able to navigate through the list, and then enter into the color picker’s “scope” when they desire. When they’re done interacting with the color picker, they should be able to exit, and return to the list. When they exit, their state in navigating through the list would ideally be returned to where it was.
You might realize that, at this point, I’m just describing a stack-like structure. I’ll call it the “interface stack”.
There are other examples, as well. A user navigating through the list may click a button, which opens a context menu, one option on which opens a modal. Each level in this stack ought to have exclusive control over its allocation of keyboard keys and gamepad buttons, and each level in this stack can be popped, restoring the state which it replaced.
This stack-like structure becomes useful in defining how various interfaces can “click together”, despite having differing designs for keyboard navigation. It’s in this structure that the common case I spoke about earlier—in which simple box hierarchy filtered depth-first traversal can be used to automate keyboard navigation—may cleanly integrate with the less common case—in which the builder codepath has strong opinions about special-case rules in how user keyboard navigation works.
This is the basic structure I use in my approach, which—as I said earlier—tries to strike a balance between these two cases.
Sketching Out A Solution
To start being a bit more explicit about my solution approach, I’ll make explicit the distinction between the interface stack and keyboard selection state.
The interface stack is the data structure which implements the “navigation scope” idea I presented in the last section. Each value in this stack is a reference to some identifier or key for one interface which has its own ideas about how keyboard navigation works. The interface referred to by the top of the stack has the exclusive right to define keyboard navigation rules.
The keyboard selection state is defined by the builder codepath for each of these interfaces. Core code may expose one mechanism for storing keyboard selection state—as is the case with the default, common case, which may use box keys for keyboard navigation state—but an interface’s builder code has the choice to use its own state also.
For the interface stack, I have found that box keys work acceptably well for this. While box keys are unreliable in—for instance—windowed lists, they are reliable for less leaf-like, less granular boxes in the box hierarchy. These would be boxes like—for example—that for a window, or panel. These boxes are not transient, and usually directly correspond to an interface instantiation entity, for which the user manages state directly. Even if a box key for a transient interface (e.g. a complex interface which exists within a windowed list) is used, this remains robust—the interface will work as expected when it’s built, and nothing will happen when it isn’t. In such a case, however, the builder codepath for the windowed list may need to take special care to scroll to the selected interface when keyboard interaction occurs.
The interface stack must be explicitly mutated by builder code in response to various events. For instance, when a user hits Enter with their keyboard selection targeted at a color picker (without that color picker being present within the interface stack), then the response code would push the key of the color picker onto the interface stack. The builder code may also choose to provide universal mechanisms—hitting Escape, for example—to pop keys off the top of the interface stack. So, the resulting API is straightforward:
UI_Key UI_PushNavInterfaceKey(UI_Key interface_key);
UI_Key UI_PopNavInterfaceKey(void);
UI_Key UI_TopNavInterfaceKey(void);
The above API can be used as follows in builder codepaths:
// build interface root box
UI_Box *complex_interface_box = UI_BoxMake(...);
// get input signal & respond accordingly
UI_Signal sig = UI_SignalFromBox(complex_interface_box);
if(sig.double_clicked) // double click -> push onto interface stack
{
UI_PushNavInterfaceKey(complex_interface_box->key);
}
// build contents
UI_Parent(complex_interface_box)
{
// complex interface goes in here!
}
Top-level builder code can implement an Escape-to-exit rule:
if(OS_KeyPress(UI_Events(), UI_Window(), OS_Key_Esc, 0))
{
UI_PopNavInterfaceKey();
}
Keyboard selection itself—be it determined by a default, generic navigation rule, or a custom, special-case solution—can simply be a feature flag which is attached to boxes: UI_BoxFlag_KeyboardFocused
. At the limit, the interface builder codepath can do nothing with regards to navigation, and set this flag on widget codepaths for which it’d like the keyboard control code to execute.
To implement a custom navigation rule, with a custom keyboard selection state type, the builder codepath becomes responsible for maintaining that state, mutating it according to user input, correlating it with the UI_Box
es it produces, and attaching UI_BoxFlag_KeyboardFocused
to those boxes.
For keyboard selection state’s default case—using box keys as the type, and implementing a generic depth-first traversal rule for navigation—the state can simply be stored per-UI_Box
, and used if needed:
struct UI_Box
{
// ...
UI_Key kb_select_key;
// ...
};
If the core knows that the default, generic rule is being used (however that may be—it may be explicit, or perhaps doing otherwise requires an explicit opt-out by a builder codepath), then it can check this, and automatically attach UI_BoxFlag_KeyboardFocused
to selected widgets.
Whatever navigation rule is used, it should only run if the active interface key is on the top of the interface stack. However, note that some other navigation-related codepaths may still need to execute—visualization of the keyboard selection, for instance—even without the associated interface being selected by the interface stack.
Many of the default behaviors users expect with keyboard navigation—for example, navigating to a box, and requiring it to be within view (if it was previously out-of-view)—may also be implemented in the core for the default navigation rule.
Closing Thoughts
I’ve left many implementation details up to readers, mainly because I’m not sure if they’d be helpful for me to provide, or if they’d lead people astray (because they are merely an accident of how I ended up implementing my solution).
For example—how do builder codepaths mark up an interface subtree as requiring the default navigation rule? There are a number of options—my most recent implementation simply exposed the default navigation rule as a helper from the core. If I want the default navigation rule within a builder codepath, I call UI_DefaultNav
. Otherwise, I don’t.
As I described earlier, there are also a number of ways that the default navigation rule may be extended to factor in important details about an interface—directionality, spatial location, size, and so on. What exactly the builder codepath provides to help, and what details the navigation rule pull out of a generic tree in order to get as close as possible to the ideal in a variety of scenarios, are open questions—I encourage readers to try a variety of ideas out in their own implementation.
With that, I hope this provided some helpful details and mental models for tackling this problem. That’ll be it from me for now. Good luck!
If you enjoyed this post, please consider subscribing. Thanks for reading.
-Ryan