Multithreaded OpenGL
Starting in the middle of the 10.4 development cycle, a new option was added to OpenGL to allow it to run multi-threaded. This has two main effects: it increases the overhead on the CPU for submitting OpenGL commands, but it also relieves the performance burden on your app for processing those commands. The multithreaded engine ends up being a net win if you spend less time submitting commands than the GPU does executing them, so it should not be enabled unless you know this is the case for your app.
One side-effect of this a performance oddity if your OpenGL application is heavily GPU (or fillrate) bound. In this case, you spend a very small amount of time submitting commands to the GPU, while the GPU spends a very large amount of time processing the commands. In essence, you can end up sending 30+ frames to the GPU before the engine stalls and processes the frames. The most common public case of this is World of Warcraft: if you visit certain areas and spin around (looking at a wall, for example), the engine can very quickly send 32 "look at the wall" frames and a long time waiting for the GPU to draw them (if those simple frames do a lot of alpha-blending, for example). What you experience in the game is what's known as "UI lag" - WoW stutters a bit as you spin around and your keystrokes and mouse movements start to lag behind what you see on the screen because the main CPU thread is already 32 frames ahead of what you are looking at.
One of the first things I was tasked with when I started at Apple last year was solving this issue for Leopard. From a practical standpoint, there's not any benefit to letting the GL get ahead more than a frame of the command submission. Before, we had a cut-off of roughly 32 frames in the command buffer and when the 32nd frame was submitted, the main thread waited for the GPU to finish before it started accepting more commands. Starting with Leopard, this number was reduced down to 1. By default, the main thread can now queue up one frame for the GPU to work on and start building a second. When the second frame is ready, if the first frame is not completed, the GL waits. This avoids entirely the issue of UI lag in WoW and has zero impact on the maximum framerate that the apps can achieve - you can't run faster than the GPU can draw!
To that end, we added a new parameter that can be used with CGLSetParameter/CGLGetParameter : kCGLCPMPSwapsInFlight. You can set this parameter to indicate how many frames the CPU should queue up for the GPU to process. If you want the old behavior under 10.4, set it to 32. The default value for Leopard is 1. Setting it to zero means that the GL will stop at the end of each frame and wait for the GPU to finish before proceeding. You might think this makes it the same as disabling the multithreaded engine, but no. Because there's an increased overhead in the multithreaded engine with the producer (CPU) and consumer (GPU) threads, it's effectively slower than disabling the multithreaded engine entirely. The general thinking is that you'll never need to alter this parameter. If you do, I'd be interested in knowing why and how it helps so we can better understand your needs.
There are other articles on developer.apple.com that enumerate best practices when dealing with the multithreaded engine. Be sure and read them to make best use of it. Here's a hint: if you just turn on the MT engine, don't be surprised to see your framerate go down. This technote is a good place to start.
Comments
Awesome informative post
Posted by: Dave Anderson | April 11, 2008 02:28 PM
Thanks for the info and the link to the technote. This was very useful.
Posted by: Darth Ed | April 22, 2008 10:43 AM
Hi Brad - great post.
WoW actually has three modes in how it uses MT-GL, known as GLFaster 0, 1, or 2.
In mode 0 MT is off.
In mode 1, MT is on, but we call glFlush once per frame to keep the command queue down to a reasonable depth of work.
In mode 2, we dispense with the glFlush and let the queue grow as deep as the OS will allow, which is the situation where the rubber-banding lag described can happen.
The default setting for WoW on MT-GL capable systems is "1" so this behavior should not happen on Tiger unless the user has manually set GLFaster to 2 (which some do, to get a little more speed).
We tried to make this fully automatic using fences to track the depth of work in the queue, but in Tiger the act of testing a fence also drains the queue, yielding a trivial answer of "yes, there's no more work pending, and this answer took a very long time to obtain". I think behavior of fences was also fixed in Leopard.
Posted by: Rob | April 25, 2008 08:51 AM