Tile-Based Forward Rendering

Update: I’ve made some inline revisions, but you can see a summary of my changes here.

There’s been a bit of buzz about AMD’s Leo Demo recently, which uses a tile-based culling technique to generate a per-tile list of lights that might affect each tile. According to the website,

this demo uses DirectCompute to cull and manage lights in a scene. The end result is a per-pixel or per-tile list of lights that forward-render based shaders use for lighting each pixel.

I thought I’d give this a go, and while my approach differs somewhat in the technicalities to AMD’s implementation, the benefits are theoretically the same. For those who haven’t discovered this algorithm yet, a quick summary of the advantages are:

  • You don’t have to do a geometry pass over your scene for each dynamic light, like in traditional forward renderers,
  • You can still use MSAA, unlike most deferred renderers,
  • Transparency can use the same lighting codepath, unlike deferred renderers,
  • You can use whatever BRDF you want without the need to squeeze data into a G-buffer and littering your lighting shader with if/elses, unlike deferred renderers, or doing another geometry pass like in light prepass renderers.
  • It potentially decreases memory footprint over a fat G-buffer.

Aras P explains in more detail, so instead I’ll document my direct findings here, as well as my thoughts on solutions to some of the problems this algorithm might cause. I assume readers have some familiarity with the algorithm.

I’ve uploaded some screenshots to my webzone.

I had no paper to work off of, so my implementation is basically made up based on the few hints from the quoted paragraph above from AMD. Unlike the Leo demo, I didn’t use DirectCompute at all.

Depth Prepass

To start with, I do a depth prepass over the scene. There’s nothing I do that’s unusual about this step; it’s just a boring old Zpass. However, then I down-sample it so that I have one conservative depth value per tile. By default my application accumulates a linked list of lights in 8×8 pixel tiles, so I have a shader that finds the maximum depth of all the pixels in that tile and writes it to a smaller depth buffer.

Conservative depth

Conservative depth

If you’re using 1×1 pixel tiles without MSAA you can skip the depth down-sample stage. If you are using MSAA you will always need to do this step even if you have 1×1 tiles, as you’ll need to resolve your MSAA depth buffer to a 1x depth buffer for the next step. Don’t forget to read each depth sample in each pixel of your MSAA depth buffer.

Ensure you’re progressively down-sampling. The reason for this is subject of a whole post in itself, but essentially it’s to reduce the number of pixels in flight at any one time.

My approach is this:

  1. Resolve my MSAA depth buffer to a full-res non-MSAA depth buffer
  2. Down-sample and transpose the output of (1) to a H x (W/T) buffer (where W and H are the full-res buffer’s dimensions, and T is the tile size)
  3. Down-sample and transpose the output of (2) to the final (W/T)x(H/T) depth buffer.

Make sure you unroll your HLSL loops and pass the tile size data in via compile-time macros.

Building the Per-Tile Linked Lists

In order to accumulate what lights might affect each tile I render a volume for each light, just like you would with a deferred renderer. Instead of shading, I merely add the light’s ID to a linked list for each tile. I use the down-sampled depth buffer from the previous step to quickly reject pixels that definitely have no lights affecting them.

Where with a deferred renderer you might use Carmack’s Reverse, in this case that won’t work. Rejecting pixels that pass the Z-test means you’re rejecting pixels that might include a transparent object. You won’t have information about transparency in your down-sampled depth buffer, so you have to be conservative here.

I still do stencil-based culling; I only discard the pixels that fail the Z-test. The stencil buffer still comes in handy in the case that the camera intersects the volume, but in all other cases the depth buffer would have probably sufficed.

The stencil mode I used is:

// Depth test parameters
dsDesc.DepthEnable = true;
dsDesc.DepthWriteMask = D3D11_DEPTH_WRITE_MASK_ZERO;
dsDesc.DepthFunc = D3D11_COMPARISON_LESS_EQUAL;

// Stencil test parameters
dsDesc.StencilEnable = true;
dsDesc.StencilReadMask = D3D11_DEFAULT_STENCIL_READ_MASK;
dsDesc.StencilWriteMask = D3D11_DEFAULT_STENCIL_WRITE_MASK;

// Stencil operations if pixel is front-facing
dsDesc.FrontFace.StencilFailOp = D3D11_STENCIL_OP_KEEP;
dsDesc.FrontFace.StencilDepthFailOp = D3D11_STENCIL_OP_DECR;
dsDesc.FrontFace.StencilPassOp = D3D11_STENCIL_OP_KEEP;
dsDesc.FrontFace.StencilFunc = D3D11_COMPARISON_ALWAYS;

// Stencil operations if pixel is back-facing
dsDesc.BackFace.StencilFailOp = D3D11_STENCIL_OP_INCR;
dsDesc.BackFace.StencilDepthFailOp = D3D11_STENCIL_OP_INCR;
dsDesc.BackFace.StencilPassOp = D3D11_STENCIL_OP_INCR;
dsDesc.BackFace.StencilFunc = D3D11_COMPARISON_ALWAYS;

In short, I always allow back faces to increment the buffer, but front faces will only decrement when their fragments fail the depth test. I use no back-face culling, so this prevents any overdraw in my light volume pixel shader.

Update: Since writing this I’ve switched to a Compute Shader for the list generation. I used a line-sphere test in the Compute Shader to determine the maximum and minimum depth values at each corner of each tile. I added lights to either the opaque or transparent buffers based on whether the light intersected this frustum section.

The speed increase by using the CS is easily by an order of magnitude. Beforehand I was primitive assembly bound due to the number of spheres that I had to render, whereas running a compute shader (clipped to the bounds of the light in screen space) was much more efficient. Using the line test also meant I didn’t have to grow my spheres by a tile, which previously meant I was getting conservative light addition in some cases.

Tiles being affected by lights

Tiles being affected by lights

On the right is a render of all pixels that could potentially be shaded by lights. You can see the light volume in the floor will only shade inside the corridor.

The linked list generation is based on AMD’s Order Independent Transparency presentation. I have two Unordered Access Views: one stores the linked list elements (the LinkBuffer), and the other stores the offset into the LinkBuffer that marks the start of the list for that tile (the HeadBuffer).

The LinkBuffer could potentially use unbounded amounts of storage because we don’t know how many lights will overlap (thus, we can’t determine how large each tile’s linked list will be), so when you allocate your LinkBuffer make sure you estimate conservatively. I give mine 16 links for each tile, so I have a buffer of (X/T)*(Y/T)*16 link elements, assuming the back buffer dimensions are (X, Y) and a tile size of T. 16 might be a bit excessive for typical game scenes though.

Because this is also going to be a big buffer, I don’t store any lighting data in it; I only store the ID of the light and the index of the Next link item. In HLSL:

struct LightLink
{
    uint LightID;
    uint Next;
};

The HeadBuffer is a constant size: one uint per tile. I clear each pixel in this buffer to -1 at the start of the linked list generation process.

I draw each light volume with no culling and depth-read, in two passes. The first pass writes to the stencil buffer, so I turn off colour writes, set the magic stencil mode and draw a sphere. The second pass is drawn with the LinkedList generation pixel shader and only executes if the stencil == 1. This pass also resets the stencil to 0 for each pixel that passes.

The volume pixel shader looks like this:

[earlydepthstencil]
void Main(float4 pos : SV_POSITION)
{
    uint x = pos.x; // X coordinate in pixels
    uint y = pos.y; // Y coordinate in pixels

    LightLink link;
    link.LightID = LightID;

    // (1) Increment the link counter and get the current link ID
    uint linkID = LinkBuffer.IncrementCounter();

    // (2)
    // Read any existing linkID in the head buffer
    // Replace it with the new link ID
    HeadBuffer.InterlockedExchange(
        (y * BufferWidth + x) * 4,
        linkID,
        link.Next);

    // Write the link into the link buffer
    LinkBuffer[linkID] = link;
}

In stage (1), we use the UAV’s IncrementCounter to give us the index of the next LightLink element we should write. We exchange this index with the value currently in the HeadBuffer (2), which becomes the new LightLink’s Next index. If the value in HeadBuffer is -1, link.Next will be -1 and that means this element is the end of the linked list.

Hopefully you can see how this works, but if not I highly recommend reading the AMD OIT presentation.

Update: A neat optimisation I’ve recently added is to generate two separate lists for opaque-affecting lights and transparent-affecting lights. Because transparent objects could have any depth less than the value in the depth buffer, all lights that pass the depth test need to be recorded. However, for opaque objects, we can reject lights that we know don’t intersect anything.

For each tile I accumulate a minimum and maximum depth value. If a light doesn’t fall inside [0, MaxZ] then it’s culled completely. If it does, it gets added to the TransparentLightBuffer. If it also falls between [MinZ, MaxZ], I add it to an OpaqueLightBuffer. Obviously this breaks down if a tile has a large depth range, but it’s still a significant saving in the general case.

Finally, because we want the opaque materials to be as fast as possible, I serialise the lighted linked lists for each tile into a flat array using another CS. This has the obvious drawback of having to allocate a fixed maximum number of lights per tile, but it nets a huge speed win.

Rendering the Scene

Once you’ve generated your linked lists, you can bind the LinkBuffer and HeadBuffer UAVs to your pixel shaders that need lighting information. You’ll also need to bind a buffer containing all your lighting data. I do this by storing my point light data in a StructuredBuffer for random access by the pixel shader:

struct PointLight
{
    float3 Position;
    float Radius;
    float3 Colour;
};

StructuredBuffer<PointLight> PointLightBuffer;
StructuredBuffer<LightLink> LinkBuffer;
Buffer<uint> HeadBuffer;

The lighting part is easy. First, index into the HeadBuffer based on what tile this pixel touches using SV_Position:

uint x = vpos.x / BufferScale.x;
uint y = vpos.y / BufferScale.y;
uint head_index = y * BufferWidth + x;

BufferScale and BufferWidth in this case are:

BufferWidth = HeadBufferWidth;
BufferScale = { BackBufferWidth / HeadBufferWidth, BackBufferHeight / HeadBufferHeight };

To iterate over the lights, read the index of the first LightLink from the HeadBuffer:

uint next = HeadBuffer[head_index];

Then iterate until you hit the end of the list:

while (next != 0xFFFFFFFF)
{
    LightLink link = LinkBuffer[next];
    PointLight pointLight = PointLightBuffer[link.LightID];
    lighting += Shade(worldPos, worldNormal, eyeVec, specPower, pointLight);

    next = link.Next;
}

Shadows

Orc

Really need some shadows...

One problem that people have flagged up is shadows: if you wanted to mimic current algorithms you’d have to bind every possible shadow map to your pixelshader, which obviously isn’t viable.

Instead I was considering rendering all the shadow maps into a virtual texture. This wouldn’t be a complicated change, and it would still be possible render all the point light shadows in one pass using a geometry shader and six different viewports. All I’d have to do is upload the UVs in the shadow map that each light uses.

People have mentioned texture arrays, but that means each shadow map would have to be the same resolution, which doesn’t sound like a great solution to me.

A Quick Note on Driver Stability

I’ve been queried about my use of InterlockedExchange in the linked list generation. It’s true that only one fragment should be written to that tile at any one time, but for some reason not doing InterlockedExchange causes my GPU to hang. I’m updating my drivers as I type this to hopefully fix this, as I can’t think of any reason a normal Load/Store wouldn’t suffice instead.

Note that the GPU can and will hang if your linked list is corrupt. It will spin through garbage link values until it sees a -1, and given that you have millions of lists, you’re unlikely to avoid a total GPU meltdown. Windows takes it down most of the time. The times that it doesn’t usually mean a BSOD is heading your way.

Overall

BRDFs

Mixing different BRDFs

The Sponza demo runs at 1080p with 8x MSAA at approximately 14ms on my Radeon HD 5700 with 213 lights.

I’d really like to see this side-by-side with a deferred renderer. I would assume this technique is slower than deferred due to the cache performance of per-pixel linked list traversal, but it does offer significant advantages. I wonder if it might actually be faster than deferred if you’re bandwidth bound on your G-buffer. Either way, it’s still a huge improvement over a multi-pass forward renderer with many dynamic lights.

P.

 

13 thoughts on “Tile-Based Forward Rendering

  1. Pingback: Tile-Based Forward Rendering考察記事

  2. Pingback: Lost in the Triangles » Blog Archive » Tiled Forward Shading links

  3. Thank you for the very interesting article!

    You mention avoiding doing one geomtery pass per dynamic light as an advantage of this method, I was wondering though if people really do that? I mean a more sensible approach for forward rendering is to determine the (small) number of lights that affect a surface and pass them all in the shader as constants, lighting and shading the surface in one geometric pass. Granted, the number of supported lights will be small, but in truth so is the number of lights that actually affect a surface.

    Best regards,

    Kostas

    • It depends on the engine and your lighting set up. I’ve seen both methods used in practice, but I only thought it fair to compare the algorithm to others that support arbitrary numbers of lights.

  4. Pingback: Forward框架的逆袭:解析Forward+渲染 - KlayGE游戏引擎

  5. Hi,
    Thanks for sharing all these informations with us, Tile-based forward rendering looks very exciting but I was wondering if someone ever come with a good idea to handle the different types of lights (Point lights, spots, etc…) so that the actual shading code can take different code path depending on the type of light. My first ideas were (not tested):
    - Having a light type ID OR’ed with light index?
    - Having a linked list per light type?
    I’d really like to see a sample with not only point lights :)

    • Hi benualdo,

      Sorry for the delay in responding – I’ve been working quite a lot lately!

      I’m going to be implementing spotlights very soon. My colleague Ben Woodhouse — who’s been working on his own implementation independently — has spotlights integrated already, although I’ll have to chat with him again about how he’s handling them. The last time we caught up we both thought that having linked list per light type was the most scalable method of handling multiple light types, even if it is ugly on the code-side.

      With the linked-list flattening optimisation for opaque surfaces the overhead of this should be fairly minimal, but you’ll obviously have to have the memory for the extra buffers.

  6. Pingback: Tile-Based Forward Rendering: Update | PJB::Articles

  7. Pingback: Futuremark toont trailer nieuwe 3DMark-test | Tech-nieuws

  8. i have a idea to use tile-based forward rendering in dx9, the key point is generate per-tile linked lists.

    in your article you assign (X/T)*(Y/T)*16 link elements, assuming the back buffer dimensions are (X, Y) and a tile size of T.

    first, i create a render target to save per-tile lights id list, one tile assign 16 pixel, so the render target size is ((X/T)*(Y/T))*16, if reslution is 1280 * 720, the render target size is 3600 * 16, is not very big.

    then i generate a vertex buffer to inclue all light information, according the light position and radius we can calculate which tile intersect with light volume in xy, every intesected tile add one point into the buffer, every point include the light position and radius and use transformed xy(tile xy).

    final, render this vertex buffer(use point list) to the screen, then use the light position, radius, tile min and max z, we can know if the tile intersect with the tile.

    if intersect happen, wirte the light id to tile light list buffer in the render target. every tile has 16 pixel in the render target, use other render target to record light count(n) already in the buffer, write light id to the nth pixel and write n+1 to other render target.

    so we get per-tile lights list, because calculate intersect is not accurate to the pixel, so the result is conservative, the lights list count maybe large than use dx11, but the render target size is not very large, we can use 32 pixel per tile.

    one question is every tile use a fix size buffer, not a vector in dx11 which can dynamic increase size, is max 31 lights enough?

    i don’t implement it yet, when i read i have a inspiration, is this method work?

  9. Always add some loop counter based early exit code while working with data driven loop in a compute shader or a pixel shader ! And disable it by macro when you have the finalized and safe version of the thing :)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>