Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"08-25-10 - ProfileParser"

26 Comments -

1 – 26 of 26
Blogger Brian said...

We do the exact same thing. We define big thread local arrays and have nearly identical macros.
At the end of the program we dump the arrays to a disk file and have a Java program that post processes them. We're going to generate time line plots using gnuplot so you can see load balancing problems.

Our motivation is that all of the linux profilers are awful at communicating in a reasonable way how much time you spend in the kernel processing a libc call.

August 25, 2010 at 11:56 AM

Blogger cbloom said...

Dealing with all the different profile builds on all the platforms is a nightmare IMO. The XbTrace stuff for Pix is probably the best of all worlds but even that is a bit of a hassle because it's awfully slow, but generally it's nice to just be able to flip on "DO_PROFILE" and get info on all my platforms and then view the info in a consistent way.

August 25, 2010 at 1:11 PM

Blogger Jeff Roberts said...

Telemetry does exactly this - it even has the gross scope macro . You also need to store the current thread id, though.

August 26, 2010 at 1:54 AM

Blogger Jeff Roberts said...

There was supposed to be a "[grin]" at the end of "gross scope macro", but the comment window parsed my action wrong...

August 26, 2010 at 1:56 AM

Anonymous Anonymous said...

It's an open question to me whether streaming the data out is better than immediately encoding into info with a custom stack like iprof does, since the long stream is going to touch a lot of memory and cost performance that way.

It certainly preserves more information though.

August 26, 2010 at 12:07 PM

Anonymous Anonymous said...

oh nevermind, you mentioned the write-combining possibility. although that doesn't really work unless you write out a whole cacheline? or something.

August 26, 2010 at 12:09 PM

Blogger cbloom said...

"oh nevermind, you mentioned the write-combining possibility. although that doesn't really work unless you write out a whole cacheline? or something."

Yeah, it will hold the partially written CL in the store buffer for a while, I'm not sure exactly what triggers it to give up and have to fetch the line.

BTW it would be nice if the PC had a mechanism like dcbz ; and even dcbz isn't exactly what you want. You want to be able to say "I'm streaming into some buffer and will write the bytes sequentially, and I don't care what the contents are in the region after my write pointer".

August 26, 2010 at 12:24 PM

Anonymous Anonymous said...

if it needs a whole cache-line (if it can't communicate with memory in smaller amounts), then you're kind of fucked, because i'm pretty sure that stuff just sits in the write buffers so it won't last very long (not long enough to finish out the cache).

in indie game jam 0, I tried to make the sprite rendering go through Nvidia's VAR extension which required writing to write combining memory, and if I tried to construct the data on the fly it just always failed. the only way I was able to hit the presumably optimal performance was to write to a small (1K?) buffer, and then copy that buffer directly to the WC memory. (I say presumably optimal, because that got me to performance parity with doing it the naive OpenGL immediate way (multiple function calls per vertex), which Nvidia had already themselves optimized to do the same thing (I suspect).)

Of course machines are different now and the write-combining stuff has had time to be tweaked and improved, but I can't imagine it's more than the write buffers.

August 26, 2010 at 1:06 PM

Anonymous Anonymous said...

er, store buffers. whatever kids are calling them these days.

August 26, 2010 at 1:08 PM

Blogger cbloom said...

I'm no expert on this shit anymore, but I believe that while a partial write is slower, it still doesn't pollute your caches, so it's good in that sense.

Also you can always try having a little cache-line-sized temp buffer to accumulate the push/pops and then flush it out to the main buffer using WC on that granularity. This just steals one cache line which is still significantly lower impact than "iprof" style.

Also modern Intel processors have 4 or more store buffers, not 1 like the old days.

BTW I'm still a big fan of iprof style mini-profilers for when you want realtime in-game info. It's nice to be able to just pull up your HUD and get some numbers. In cblib I support both methods and you can toggle each on and off with a #define.

August 26, 2010 at 1:36 PM

Blogger ryg said...

Quoting from the IA-32 manuals on write combining (Vol 3A, System Programming Guide): "Writes to the WC memory type are not cached in the typical sense of the word cached. They are retained in an internal write combining buffer (WC buffer) that is separate from the internal L1, L2, and L3 caches and the store buffer. [..] The size and structure of the WC buffer is not architecturally defined. For the Intel Core 2 Duo [etc.] processors; the WC buffer is made up of several 64-byte WC buffers."
Plus a lot of notice about how WC memory is explicitly weakly ordered and you have to call sfence if you want the data to reliably show up in memory. No details on how buffer allocation works or when they're flushed. Anyway, WC buffers are separate from regular store buffers, so data can be sitting there for quite some time even when there's a lot of other memory accesses going on (which is good for this application).

Even if you're not writing full cache lines, WC is a win in this case because the memory is never going to be read unless you actually do a read on it (Fallback for non-full-cacheline WC writebacks is multiple partial write transactions, not fetch whole cache line - apply changes - write whole cache line).

Previous processors had less WC buffers (sometimes only 1), and some architectures have more stringent requirements on write-combined writes (e.g. "must be strictly sequential access with no holes").

"You want to be able to say "I'm streaming into some buffer and will write the bytes sequentially, and I don't care what the contents are in the region after my write pointer"."
One thing that *ahem* certain platforms have is a mechanism called the "write gather pipe". You set the destination address once, then write whatever you want to a specific MMIO address and it will just output everything sequentially in cache line-sized transactions without ever reading memory or polluting the cache. It also packs accesses tightly, i.e. a byte write will only advance the output pointer by 1 byte and there's no penalty when writes don't end up being naturally aligned or cross cache lines. It's the awesome way to do queuing.

"I tried to make the sprite rendering go through Nvidia's VAR extension which required writing to write combining memory, and if I tried to construct the data on the fly it just always failed."
Did you check the disassembly? When accessing WC memory, it's absolutely crucial that you write the whole buffer without holes, and you really want the accesses to be all sequential (on older processors with just 1 WC buffer, non-sequential writes are as bad as holes, since they effectively amount to the same thing).

In C/C++, that means you basically have to make your writes volatile, which sucks since it's way too strong, or alternatively insert a compiler write barrier after every store, which also sucks since it's so messy.

If an extra memcpy makes things go faster, you can practically count on the write sequence in the generate code being fucked up.

August 26, 2010 at 9:23 PM

Blogger ryg said...

Quoting from the IA-32 manuals on write combining (Vol 3A, System Programming Guide): "Writes to the WC memory type are not cached in the typical sense of the word cached. They are retained in an internal write combining buffer (WC buffer) that is separate from the internal L1, L2, and L3 caches and the store buffer. [..] The size and structure of the WC buffer is not architecturally defined. For the Intel Core 2 Duo [etc.] processors; the WC buffer is made up of several 64-byte WC buffers."
Plus a lot of notice about how WC memory is explicitly weakly ordered and you have to call sfence if you want the data to reliably show up in memory. No details on how buffer allocation works or when they're flushed. Anyway, WC buffers are separate from regular store buffers, so data can be sitting there for quite some time even when there's a lot of other memory accesses going on (which is good for this application).

Even if you're not writing full cache lines, WC is a win in this case because the memory is never going to be read unless you actually do a read on it (Fallback for non-full-cacheline WC writebacks is multiple partial write transactions, not fetch whole cache line - apply changes - write whole cache line).

Previous processors had less WC buffers (sometimes only 1), and some architectures have more stringent requirements on write-combined writes (e.g. "must be strictly sequential access with no holes").

"You want to be able to say "I'm streaming into some buffer and will write the bytes sequentially, and I don't care what the contents are in the region after my write pointer"."
One thing that *ahem* certain platforms have is a mechanism called the "write gather pipe". You set the destination address once, then write whatever you want to a specific MMIO address and it will just output everything sequentially in cache line-sized transactions without ever reading memory or polluting the cache. It also packs accesses tightly, i.e. a byte write will only advance the output pointer by 1 byte and there's no penalty when writes don't end up being naturally aligned or cross cache lines. It's the awesome way to do queuing.

"I tried to make the sprite rendering go through Nvidia's VAR extension which required writing to write combining memory, and if I tried to construct the data on the fly it just always failed."
Did you check the disassembly? When accessing WC memory, it's absolutely crucial that you write the whole buffer without holes, and you really want the accesses to be all sequential (on older processors with just 1 WC buffer, non-sequential writes are as bad as holes, since they effectively amount to the same thing).

In C/C++, that means you basically have to make your writes volatile, which sucks since it's way too strong, or alternatively insert a compiler write barrier after every store, which also sucks since it's so messy.

If an extra memcpy makes things go faster, you can practically count on the write sequence in the generate code being fucked up.

August 26, 2010 at 11:19 PM

Anonymous Anonymous said...

Since I've never figured out how to make iprof-style work efficiently threaded, and then write-to-a-buffer can just write out thread IDs, then you're probably right that storing to a small (cache-line-sized) buffer and flushing with non-temporal stores is going to be the lowest impact.

If you then stream that out to another process, i.e. the telemetry model, and you have free cpu cores so that process can run without interfering with the thing you're profiling, that probably is the performance sweet spot. (So long as it's designed the level of fine-grained profiling that iprof was.)

August 27, 2010 at 1:51 PM

Anonymous Anonymous said...

aren't we conflating two different things, the write-combining memory type and the non-temporal streaming stores? i mean, I can believe they might use the same hardware mechanisms, but the rules may not be the same.

(I know the streaming, non-cache-polluting store was something I pitched to Intel when they came by Looking Glass one time, so I like to pretend I'd had some effect when it showed up many years later, but I imagine it was pretty obvious.)

ryg, I imagine I looked at the asm, but who knows now. I don't remember if I did the volatile stuff or not. I think I got the impression that it was doing too much other work on the way to generating the data--enough work that it had to spill registers, which meant that it had to do other writes, and I think I assumed that either those writes were interfering, or I was just doing it "too slow", or something. It didn't really matter because once I did the memcpy() and found than the original naive glVertex() implementation, I figured I wasn't leaving much performance on the floor (yes, the memcpy is overhead, but not that much), so I killed it (IIRC).

August 27, 2010 at 4:20 PM

Anonymous Anonymous said...

edit:

found the original naive glVertex implementation was just as fast,

August 27, 2010 at 4:22 PM

Blogger ryg said...

"aren't we conflating two different things, the write-combining memory type and the non-temporal streaming stores? i mean, I can believe they might use the same hardware mechanisms, but the rules may not be the same."
Intels manuals explicitly specify that a non-temporal write is treated exactly as if it went to write-combined memory. Quoting again:

"If a program specifies a non-temporal store with one of these instructions and the destination region is mapped as cacheable memory (write back [WB], write through [WT] or WC memory type), the processor will do the following:
* If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted.
* The non-temporal data is written to memory with WC semantics."

August 27, 2010 at 8:16 PM

Blogger Tom Johnstone said...

I loved the lightweight profiler idea so much I decided to rip out our shitty existing profiler and implement something similar, although thread safety became an issue and I didn't want to have a big ass array for every thread (ie. TLS), so I implemented the basic idea like this-

static const int kBufferSize = 1024*1024;
static u32 sPerfBuffer[kBufferSize >> 2];
u32* gpCurrPerf = sPerfBuffer;

...

class cPerfCounter
{
public:
__forceinline cPerfCounter(const char* name)
{
u32* curr = (u32*)AtomicAdd((volatile u32*)&gpCurrPerf, 12);
curr[0] = (u32)name;
curr[1] = GetCurrentThreadId();
curr[2] = union_cast(cTimer::GetTimeMs());
}

__forceinline ~cPerfCounter()
{
u32* curr = (u32*)AtomicAdd((volatile u32*)&gpCurrPerf, 12);
curr[0] = 0;
curr[1] = GetCurrentThreadId();
curr[2] = union_cast(cTimer::GetTimeMs());
}
};

Then in the post-processing step I just loop over the single array matching up begin/end tags with matching thread ID's and names.

August 30, 2010 at 7:04 PM

Blogger Tom Johnstone said...

the formatting got a little jacked there, but if you're wondering the union_cast's are converting bitwise from float's to u32's

August 30, 2010 at 7:06 PM

Blogger cbloom said...

Cool. Yeah atomic add is one good way to handle threading.

The other way is to have a std::vector per thread in TLS. Then you take the time before you do the extend, and add that as a special marker that gets subtracted out in the post-process.

That way if the vector has to do a malloc to grow it doesn't show up in the timing at all. Obviously it will distort things a bit even if that timing is subtracted, but this will be lower impact on average than the atomic add.

eg. something like :

Push()
{
U64 t = GetTime();
vec.extend(3);
vec += label;
vec += t
vec += GetTime();
}

no need to store thread index cuz there's a vector per thread. In post-process the difference between the two times is subtracted from current scope and also all parents.

August 30, 2010 at 9:18 PM

Blogger Tom Johnstone said...

Yeah I thought about doing it that way, but I really wanted to avoid multiple buckets (vectors) per thread, since I want to be able to specify exactly up-front how much memory will be reserved for the profile buffer (I'm not anti dynamic mem allocations, but try to avoid it as much as possible, even with our custom allocators which are designed to handle this kind of behaviour). I guess the overhead goes away once your buffers reach their maxima, but you still have a branch in there to check the size.

I also didn't want to introduce the overhead of malloc/new to the profiler, since parent begin/end tags will still contain the overhead of child tags, even with the malloc before the start-tag/after the end-tag.

That all said, I'm probably splitting hairs over a miniscule performance loss.

August 30, 2010 at 9:33 PM

Blogger Tom Johnstone said...

Reading comprehension FTW, I misread your reply, ok subtracting the time difference from all parent tags wouldn't pollute the timings, but I guess you have the overhead of calling QPC twice or whatever timing function you're using. Again I think we're splitting hairs :)

August 30, 2010 at 9:48 PM

Comment deleted

This comment has been removed by the author.

August 31, 2010 at 9:19 PM

Comment deleted

This comment has been removed by the author.

August 31, 2010 at 9:20 PM

Comment deleted

This comment has been removed by the author.

August 31, 2010 at 9:20 PM

Blogger Brian said...

Atomic add? You should definitely benchmark that to compare performance under contention. I seem to remember ~40 cycles for the rdtsc time stamp into a thread local array and as many as 700 cycles for a contended cache line access (depends on the pair of cores). I would expect that you would do better with a conditional branch...

August 31, 2010 at 9:21 PM

Anonymous Anonymous said...

Pre-allocate one known-size big long array, then have each thread keep sub-arrays in it. Each thread does TLS to allocate from its sub-array, and if it exhausts its sub-array, it atomic adds to allocate a new chunk from the big array and starts using that as its own subarray.

You end up with fixed-size overhead per thread (the size of one chunk) instead of the arbitrary overhead of per-thread arrays, and reduced contention compared to atomic add, and even though you've got the per chunk overhead you probably actually fit more data in the same sized big array because you don't have to write out thread IDs on every operation, just once per chunk.

September 1, 2010 at 11:34 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.