Google-apps
Hoofdmenu

Post a Comment On: C0DE517E

"Tiled hardware (speculations)"

5 Comments -

1 – 5 of 5
Blogger Fabian Giesen said...

The vertex positions+indices don't generally go into a per-tile storage; it's basically a unified scratch buffer (allocated in chunks or similar). These can get fairly big so they're generally streamed to memory. Even so you can run out and then might need to do a partial flush (render everything queued so far to free up memory). These are expensive and you really want to not do that. And yes, everything you write into the bin buffers is compressed.

The vertex shader split is a thing you can do (we did in Omatic, at least in certain cases). It helps a lot sometimes. The trade-off here is that any vertex shading you do late (per-tile) gets re-run for every tile that a vertex is referenced in; vertex shading you do up-front needs to store its results in mem (which gets big quickly!) but is only done once.

Vertex shading invoked that way hurts somewhat more than "regular" vertex shading since the effect of running post-cull and post-Z-test (!) is that the index sequence is more random, so the memory read patterns are worse.

Programmable blending doesn't really have that much to do with the speed of memory; the key issue is the need to schedule the final blending stage of fragment shaders in-order. This is easier to coordinate in a fixed-size tile than for a full render target; e.g. for a 32x32 pixel tile, a 16x16 quad single-bit scoreboard of "write pending for this quad" is sufficient to identify conflicts, which is quite cheap, and doesn't need to coordinate with other tiles. This part is harder in an IM renderer because you don't necessarily know which other pending warps to synchronize *with*; they could be anywhere! (In practice, there's some sort of binning anyway, which simplifies things, but I digress.)

In terms of HW cost, I don't think the specific-to-TBDR hardware ends up taking any more area than what IM renderers spend on schemes like Z and color compression or early-Z/early stencil that TBDRs don't have much need (or use) for.

Triangle count is absolutely a big issue, because vertex data can get a lot bigger than what you usually store per pixel. If you have dense meshes with tris averaging ~5 pixels, then their diameter is ~2.24 pixels, and (assuming the mesh is like a quad grid) a 32x32 tile will contains a ~14x14 grid of quads = 225 verts. If each vert has 32 bytes of attribute payloads (post-shading! This is 8 scalar floats), that's 7.2k bytes per 1k pixels, so ~7 bytes (=56 bits)/pixel. That's still OK (well, with PC memory bandwidths; with mobile this is already worrisome), but if you have slightly denser meshes or more attributes, this gets ugly quick. (You can do some re-shading instead, which needs less memory for shaded verts, but more for attr fetch, and comes with scheduling issues).

With IM, you want to compress depth, color buffers, etc. to save mem bandwidth, but those are fairly regular data structures with a fixed format, and you don't spend mem BW on shaded vertex data (you do need internal interconnects and buffers to get the shaded verts to where they're needed, and that part is gnarly as hell). With a TBDR you want compressed vertex data (post pre- and post-shading) and that's messier since it's more configurable and less regular than pixel formats are.

A final issue you don't mention is warp/wavefront size. Tilers want them a bit smaller (or else you want bigger tiles). The problem is that if you have say a 32x16 tile, there's only 512 pixels in there = 8 full GCN wavefronts if everything in that tile is one shader. More likely, if a tile is touched by 2-3 shaders, then each of them will run maybe 3 or 4 wavefronts, one of which is half-full. You get more wasted utilization from partially-filled waves and your shader cores are switching shaders a lot more often. (Which means they need to be designed so they're efficient at switching shaders every handful of waves, which they aren't necessarily right now).

August 7, 2017 at 6:23 PM

Anonymous BartW said...

In case of shading only vertices and detecting overdraw "perfectly", how does it handle alpha tested geometry? Any traingle could have arbitrary holes in it... Unless there is "only" sorting of triangles (imperfect)?

August 7, 2017 at 6:57 PM

Blogger Fabian Giesen said...

Anything alpha-tested, with true blending, writing output Z etc. doesn't get deferred and does not get perfect overdraw elimination in TBDRs.

You're *strongly* encouraged to draw all opaque geometry first and anything with alpha test/transparencies second because of this.

August 7, 2017 at 8:49 PM

Blogger Unknown said...

This is anecdotal but I'm impressed with what's possible at 60fps on mobile at 2048x1536, as long as you do all the work on each tile without round tripping to memory. For example it can deal with a lot of full screen particle overdraw -- fragment shaders do more work but no extra memory bandwidth is used. I remember previous gen consoles getting destroyed by that kind of thing at much lower resolutions.

One pet peeve is that you can read the framebuffer color in the fragment shader in GLES 2.0 (iOS) but not the depth, even both values are right there on-chip. Maybe this is fixed in more recent APIs, haven't checked.

August 8, 2017 at 11:05 AM

Blogger DEADC0DE said...

I'll reword the post a bit. Fabian: when I wrote "per tile storage" referring to the indices I didn't mean on-chip, but off-chip (ram) logically organized in tile bins.

August 8, 2017 at 11:15 AM

You can use some HTML tags, such as <b>, <i>, <a>

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.
Please prove you're not a robot