Couldn’t Sleep; Wrote a Thing

Unlike many programmers, I actually prefer to use one monitor these days. A lot of my colleagues have Visual Studio filling the entirety of their primary monitor and either Outlook or documentation filling the other, often with a bunch of other apps like Lync, a command prompt or two, and likely a music player.

When I moved to Microsoft HQ about 5 months ago I found that my new PC was lacking a second HDMI output and I didn’t have a DVI-VGA converter for the second twenty-something inch monitor I’d been given. Weirdly, once the part arrived a week or so later, I went to plug it in and… found that I didn’t want to.

A single monitor really cut down the distractions. I get a lot of mail – a lot of mail – and the constant flood of it in the corner of my eye was becoming a problem. (I have to add that turning off the task-bar’s “New Mail” icon in Outlook also helped a lot.)

Since I cut down to a single screen, my ability to focus has increased dramatically and that my primary monitor is easily big enough for code and documentation if the need arises. I actually prefer having the documentation right there, next to my code, rather than having to flick my eyes over to another monitor.

When I need to “dual wield”, I use the little known Windows key + Arrow keys hotkeys. If you’re unfamiliar with these, give them a try. Windows+Left and Windows+Right are by far the most useful, as they will:

  • snap the window to the left or right half of the screen,
  • flip the window onto the next monitor if you have one.

Windows+Up will maximize, whereas Windows+Down will restore or minimize depending on the window’s current state.

After about 5 months of using them, these shortcuts are now indispensable. However, they weren’t quite useful enough. A common scenario is that I just want a little more than half the screen for code, as documentation pages tend to deal well with smaller browser sizes. Or, sometimes I have a command prompt pinned to the right, but I have to fiddle about with the mouse to fill the gap because command windows won’t generally expand horizontally.

Another problem was stacking windows above each other. In an ideal world, VS would occupy 60% of the left side of my monitor, and on the remaining right side, I’d have documentation taking 70% of the top right corner and the command prompt in the bottom right 30%. With the vanilla Windows hotkeys, this is impossible with keyboard alone.

And thus, my answer to this was to write code, culminating in this: Advanced Window HotKeys. This is a tiny app that will run in the background, and using the Alt+Arrow keys (among others) will net you some really fine-grained control over your window layout. I went one step further and the tool will also help you preserve your current layout but still allow you to resize the foreground window, moving other windows out of the way. Check out the page for more information.

screenshot

I wrote this in VS2012, C++ and Win32. Win32 really is a very powerful API. This tool makes extensive use of EnumWindows and MoveWindow to expand or contract the foreground window; search for adjacent windows to snap to; modify adjacent windows to prevent overlap;and other stuff. I threw in a little Control Panel, although that’s still very primitive.

I haven’t tested this on anything other than Win8, so if anyone wants to volunteer to see how this’ll work on Win7 or below then please comment or email me. I’d really appreciate feedback, so you can do that using the email link at the bottom of the AWHK website.

Tile-Based Forward Rendering: Update

I’ve been meaning to write about this for a while now. I added a few more optimisations and extensions to the Tile-Based Forward Rendering demo I wrote back in March. I’ve inlined the updates in the actual main article, but I’ll outline the changes here for those who have already read it.

The first significant change is that I’ve switched to a Compute Shader for the list generation. I used a line-sphere test in the Compute Shader to determine the maximum and minimum depth values at each corner of each tile. I added lights to either the opaque or transparent buffers based on whether the light intersected this frustum section.

The speed increase by using the CS is easily by an order of magnitude. Beforehand I was primitive assembly bound due to the number of spheres that I had to render, whereas running a compute shader (clipped to the bounds of the light in screen space) was much more efficient. Using the line test also meant I didn’t have to grow my spheres by a tile, which previously meant I was getting conservative light addition in some cases.

The second change follows on from this. Because most of the scene is opaque geometry, I optimise this case by building a separate light buffer for opaque surfaces. This means my light linked-list generation shader generates two separate lists for opaque-affecting lights and transparent-affecting lights. Because transparent objects could have any depth less than the value in the depth buffer, all lights that pass the depth test need to be recorded. However, for opaque objects, we can reject lights that we know don’t intersect anything.

For each tile I accumulate a minimum and maximum depth value. If a light doesn’t intersect [0, MaxZ] then it’s culled completely. If it does, it gets added to the TransparentLightBuffer. If it also falls between [MinZ, MaxZ], I add it to an OpaqueLightBuffer. Obviously this breaks down if a tile has a large depth range, but it’s still a significant saving in the general case.

Finally, because we want the opaque materials to be as fast as possible, I serialise the lighted linked lists for each tile into a flat array using another CS. This has the obvious drawback of having to allocate a fixed maximum number of lights per tile, but it nets a huge speed win.

Tile-Based Forward Rendering

Update: I’ve made some inline revisions, but you can see a summary of my changes here.

There’s been a bit of buzz about AMD’s Leo Demo recently, which uses a tile-based culling technique to generate a per-tile list of lights that might affect each tile. According to the website,

this demo uses DirectCompute to cull and manage lights in a scene. The end result is a per-pixel or per-tile list of lights that forward-render based shaders use for lighting each pixel.

I thought I’d give this a go, and while my approach differs somewhat in the technicalities to AMD’s implementation, the benefits are theoretically the same. For those who haven’t discovered this algorithm yet, a quick summary of the advantages are:

  • You don’t have to do a geometry pass over your scene for each dynamic light, like in traditional forward renderers,
  • You can still use MSAA, unlike most deferred renderers,
  • Transparency can use the same lighting codepath, unlike deferred renderers,
  • You can use whatever BRDF you want without the need to squeeze data into a G-buffer and littering your lighting shader with if/elses, unlike deferred renderers, or doing another geometry pass like in light prepass renderers.
  • It potentially decreases memory footprint over a fat G-buffer.

Aras P explains in more detail, so instead I’ll document my direct findings here, as well as my thoughts on solutions to some of the problems this algorithm might cause. I assume readers have some familiarity with the algorithm.

I’ve uploaded some screenshots to my webzone.

I had no paper to work off of, so my implementation is basically made up based on the few hints from the quoted paragraph above from AMD. Unlike the Leo demo, I didn’t use DirectCompute at all.

Depth Prepass

To start with, I do a depth prepass over the scene. There’s nothing I do that’s unusual about this step; it’s just a boring old Zpass. However, then I down-sample it so that I have one conservative depth value per tile. By default my application accumulates a linked list of lights in 8×8 pixel tiles, so I have a shader that finds the maximum depth of all the pixels in that tile and writes it to a smaller depth buffer.

Conservative depth

Conservative depth

If you’re using 1×1 pixel tiles without MSAA you can skip the depth down-sample stage. If you are using MSAA you will always need to do this step even if you have 1×1 tiles, as you’ll need to resolve your MSAA depth buffer to a 1x depth buffer for the next step. Don’t forget to read each depth sample in each pixel of your MSAA depth buffer.

Ensure you’re progressively down-sampling. The reason for this is subject of a whole post in itself, but essentially it’s to reduce the number of pixels in flight at any one time.

My approach is this:

  1. Resolve my MSAA depth buffer to a full-res non-MSAA depth buffer
  2. Down-sample and transpose the output of (1) to a H x (W/T) buffer (where W and H are the full-res buffer’s dimensions, and T is the tile size)
  3. Down-sample and transpose the output of (2) to the final (W/T)x(H/T) depth buffer.

Make sure you unroll your HLSL loops and pass the tile size data in via compile-time macros.

Building the Per-Tile Linked Lists

In order to accumulate what lights might affect each tile I render a volume for each light, just like you would with a deferred renderer. Instead of shading, I merely add the light’s ID to a linked list for each tile. I use the down-sampled depth buffer from the previous step to quickly reject pixels that definitely have no lights affecting them.

Where with a deferred renderer you might use Carmack’s Reverse, in this case that won’t work. Rejecting pixels that pass the Z-test means you’re rejecting pixels that might include a transparent object. You won’t have information about transparency in your down-sampled depth buffer, so you have to be conservative here.

I still do stencil-based culling; I only discard the pixels that fail the Z-test. The stencil buffer still comes in handy in the case that the camera intersects the volume, but in all other cases the depth buffer would have probably sufficed.

The stencil mode I used is:

// Depth test parameters
dsDesc.DepthEnable = true;
dsDesc.DepthWriteMask = D3D11_DEPTH_WRITE_MASK_ZERO;
dsDesc.DepthFunc = D3D11_COMPARISON_LESS_EQUAL;

// Stencil test parameters
dsDesc.StencilEnable = true;
dsDesc.StencilReadMask = D3D11_DEFAULT_STENCIL_READ_MASK;
dsDesc.StencilWriteMask = D3D11_DEFAULT_STENCIL_WRITE_MASK;

// Stencil operations if pixel is front-facing
dsDesc.FrontFace.StencilFailOp = D3D11_STENCIL_OP_KEEP;
dsDesc.FrontFace.StencilDepthFailOp = D3D11_STENCIL_OP_DECR;
dsDesc.FrontFace.StencilPassOp = D3D11_STENCIL_OP_KEEP;
dsDesc.FrontFace.StencilFunc = D3D11_COMPARISON_ALWAYS;

// Stencil operations if pixel is back-facing
dsDesc.BackFace.StencilFailOp = D3D11_STENCIL_OP_INCR;
dsDesc.BackFace.StencilDepthFailOp = D3D11_STENCIL_OP_INCR;
dsDesc.BackFace.StencilPassOp = D3D11_STENCIL_OP_INCR;
dsDesc.BackFace.StencilFunc = D3D11_COMPARISON_ALWAYS;

In short, I always allow back faces to increment the buffer, but front faces will only decrement when their fragments fail the depth test. I use no back-face culling, so this prevents any overdraw in my light volume pixel shader.

Update: Since writing this I’ve switched to a Compute Shader for the list generation. I used a line-sphere test in the Compute Shader to determine the maximum and minimum depth values at each corner of each tile. I added lights to either the opaque or transparent buffers based on whether the light intersected this frustum section.

The speed increase by using the CS is easily by an order of magnitude. Beforehand I was primitive assembly bound due to the number of spheres that I had to render, whereas running a compute shader (clipped to the bounds of the light in screen space) was much more efficient. Using the line test also meant I didn’t have to grow my spheres by a tile, which previously meant I was getting conservative light addition in some cases.

Tiles being affected by lights

Tiles being affected by lights

On the right is a render of all pixels that could potentially be shaded by lights. You can see the light volume in the floor will only shade inside the corridor.

The linked list generation is based on AMD’s Order Independent Transparency presentation. I have two Unordered Access Views: one stores the linked list elements (the LinkBuffer), and the other stores the offset into the LinkBuffer that marks the start of the list for that tile (the HeadBuffer).

The LinkBuffer could potentially use unbounded amounts of storage because we don’t know how many lights will overlap (thus, we can’t determine how large each tile’s linked list will be), so when you allocate your LinkBuffer make sure you estimate conservatively. I give mine 16 links for each tile, so I have a buffer of (X/T)*(Y/T)*16 link elements, assuming the back buffer dimensions are (X, Y) and a tile size of T. 16 might be a bit excessive for typical game scenes though.

Because this is also going to be a big buffer, I don’t store any lighting data in it; I only store the ID of the light and the index of the Next link item. In HLSL:

struct LightLink
{
    uint LightID;
    uint Next;
};

The HeadBuffer is a constant size: one uint per tile. I clear each pixel in this buffer to -1 at the start of the linked list generation process.

I draw each light volume with no culling and depth-read, in two passes. The first pass writes to the stencil buffer, so I turn off colour writes, set the magic stencil mode and draw a sphere. The second pass is drawn with the LinkedList generation pixel shader and only executes if the stencil == 1. This pass also resets the stencil to 0 for each pixel that passes.

The volume pixel shader looks like this:

[earlydepthstencil]
void Main(float4 pos : SV_POSITION)
{
    uint x = pos.x; // X coordinate in pixels
    uint y = pos.y; // Y coordinate in pixels

    LightLink link;
    link.LightID = LightID;

    // (1) Increment the link counter and get the current link ID
    uint linkID = LinkBuffer.IncrementCounter();

    // (2)
    // Read any existing linkID in the head buffer
    // Replace it with the new link ID
    HeadBuffer.InterlockedExchange(
        (y * BufferWidth + x) * 4,
        linkID,
        link.Next);

    // Write the link into the link buffer
    LinkBuffer[linkID] = link;
}

In stage (1), we use the UAV’s IncrementCounter to give us the index of the next LightLink element we should write. We exchange this index with the value currently in the HeadBuffer (2), which becomes the new LightLink’s Next index. If the value in HeadBuffer is -1, link.Next will be -1 and that means this element is the end of the linked list.

Hopefully you can see how this works, but if not I highly recommend reading the AMD OIT presentation.

Update: A neat optimisation I’ve recently added is to generate two separate lists for opaque-affecting lights and transparent-affecting lights. Because transparent objects could have any depth less than the value in the depth buffer, all lights that pass the depth test need to be recorded. However, for opaque objects, we can reject lights that we know don’t intersect anything.

For each tile I accumulate a minimum and maximum depth value. If a light doesn’t fall inside [0, MaxZ] then it’s culled completely. If it does, it gets added to the TransparentLightBuffer. If it also falls between [MinZ, MaxZ], I add it to an OpaqueLightBuffer. Obviously this breaks down if a tile has a large depth range, but it’s still a significant saving in the general case.

Finally, because we want the opaque materials to be as fast as possible, I serialise the lighted linked lists for each tile into a flat array using another CS. This has the obvious drawback of having to allocate a fixed maximum number of lights per tile, but it nets a huge speed win.

Rendering the Scene

Once you’ve generated your linked lists, you can bind the LinkBuffer and HeadBuffer UAVs to your pixel shaders that need lighting information. You’ll also need to bind a buffer containing all your lighting data. I do this by storing my point light data in a StructuredBuffer for random access by the pixel shader:

struct PointLight
{
    float3 Position;
    float Radius;
    float3 Colour;
};

StructuredBuffer<PointLight> PointLightBuffer;
StructuredBuffer<LightLink> LinkBuffer;
Buffer<uint> HeadBuffer;

The lighting part is easy. First, index into the HeadBuffer based on what tile this pixel touches using SV_Position:

uint x = vpos.x / BufferScale.x;
uint y = vpos.y / BufferScale.y;
uint head_index = y * BufferWidth + x;

BufferScale and BufferWidth in this case are:

BufferWidth = HeadBufferWidth;
BufferScale = { BackBufferWidth / HeadBufferWidth, BackBufferHeight / HeadBufferHeight };

To iterate over the lights, read the index of the first LightLink from the HeadBuffer:

uint next = HeadBuffer[head_index];

Then iterate until you hit the end of the list:

while (next != 0xFFFFFFFF)
{
    LightLink link = LinkBuffer[next];
    PointLight pointLight = PointLightBuffer[link.LightID];
    lighting += Shade(worldPos, worldNormal, eyeVec, specPower, pointLight);

    next = link.Next;
}

Shadows

Orc

Really need some shadows...

One problem that people have flagged up is shadows: if you wanted to mimic current algorithms you’d have to bind every possible shadow map to your pixelshader, which obviously isn’t viable.

Instead I was considering rendering all the shadow maps into a virtual texture. This wouldn’t be a complicated change, and it would still be possible render all the point light shadows in one pass using a geometry shader and six different viewports. All I’d have to do is upload the UVs in the shadow map that each light uses.

People have mentioned texture arrays, but that means each shadow map would have to be the same resolution, which doesn’t sound like a great solution to me.

A Quick Note on Driver Stability

I’ve been queried about my use of InterlockedExchange in the linked list generation. It’s true that only one fragment should be written to that tile at any one time, but for some reason not doing InterlockedExchange causes my GPU to hang. I’m updating my drivers as I type this to hopefully fix this, as I can’t think of any reason a normal Load/Store wouldn’t suffice instead.

Note that the GPU can and will hang if your linked list is corrupt. It will spin through garbage link values until it sees a -1, and given that you have millions of lists, you’re unlikely to avoid a total GPU meltdown. Windows takes it down most of the time. The times that it doesn’t usually mean a BSOD is heading your way.

Overall

BRDFs

Mixing different BRDFs

The Sponza demo runs at 1080p with 8x MSAA at approximately 14ms on my Radeon HD 5700 with 213 lights.

I’d really like to see this side-by-side with a deferred renderer. I would assume this technique is slower than deferred due to the cache performance of per-pixel linked list traversal, but it does offer significant advantages. I wonder if it might actually be faster than deferred if you’re bandwidth bound on your G-buffer. Either way, it’s still a huge improvement over a multi-pass forward renderer with many dynamic lights.

P.