Being an engine programmer usually means being a bit of a jack of all trades. There’s always something weird going on and you have to be pretty familiar with a bunch of low level details that come in handy in unexpected ways. Recently I went down a somewhat unexpected rabbit hole where those skills came in extremely hand. In an effort to blog more and also because it seems like I was the first to run into this issue, I figured I should sit down and just write about it so future people can benefit from it.
If you are working with Vulkan, chances are that at some point you’ll run into a VK_ERROR_DEVICE_LOST error. It’s the worst kind of error, chances are that your GPU choked on some data sometime about a frame or two ago and the position where you received the error is nowhere near where the GPU actually decided to throw up its hands and give up. This is of course because GPUs and CPUs are inherently decoupled from each other, and when you submit your work from the CPU to the GPU, the GPU will start crunching your numbers while in the meantime the CPU goes on with its busy life doing other things. Now, maybe while crunching your numbers the GPU encountered a page fault or did some other computation where it just couldn’t proceed anymore afterwards. Or maybe it took too long to compute results and the kernel watchdog decided that enough was enough and restarted the device. Or something completely different! Either way, at some point this error will have propagated all the way back to your application in userspace and you’ll have no idea why it happened. And are only left to guess about what went wrong and where. Well, Nvidia has your back with the extremely verbosely named VK_NV_device_diagnostic_checkpoints extension, which is a lot like Nvidias Aftermath for DirectX, except it works on Vulkan. And because for some reason nobody on the internet seems to sing its praise, I will do so now in the form of this blog post!
Let’s say that you occasionally look at your application with NSight, VTune, or any other profiling tool of choice. Naturally you want to add debug markers into your application, but you might not necessarily ship with them or have them run at all when no profiling tool is attached. You could put them behind a command line flag, but I prefer automatic discovery: One build, when run with NSight, having all the debug markers I need to dissect a frame, and when run without NSight not doing any of that overhead. With X-Plane, there is another layer to that, and that is that specifically for NSight some features have to be disabled since they are making unsupported GL calls.
11 months ago was the last time I touched firedrake, and last weekend the urge to mess with it caught me again. So I set up WSL, installed all necessary dependencies and opened firedrake. I fired up the last compiled version I had, just to remind myself of where I had left things, and I QEMU was happy to dump my debug printf()’s via the virtual UART into stdout. All was good. Then I compiled firedrake from scratch and it stopped working, or rather, it stopped producing output via the UART. That’s strange, I thought, messed with a couple of things and also stashed all my git changes but no avail; No more output via the UART.
X-Plane’s plugin systems allows authors to load models in two ways: Asynchronously and Synchronously. Most plugins tend to use synchronous loading, but since all plugins run on the main thread there is a need and desire for some to use asynchronous loading. Up until now, the latter was broken however, and plugin authors complained about invisible models. From what I understand Pilot Edge were the first to complain, but they want to use async loading to dynamically load in models of planes that flew in via multiplayer, which makes it non-trivial to debug. Luckily, the author of the fantastic Better Pushback also ran into the issue and his plugin is completely local, so it’s no problem to stop at a debugger for any amount of time without being worried a multiplayer socket closes and the whole setup has to be recreated.