Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"12-03-11 - Worker Thread system with reverse dependencies"

12 Comments -

1 – 12 of 12
Blogger Martin Ecker said...

Nice post. This is exactly how our task/worker system worked in the last commercial title I worked on. I can attest to this working really well in practice.

December 3, 2011 at 1:07 PM

Blogger ryg said...

"The opposite direction link I will call "permits" (is there a standard term for this?)"
The standard term is "anti-dependency". It's awful.

A somewhat nicer and also reasonably standard pair is "input dependency" (A depends on its inputs) vs. "output dependency" (someone depends on A's outputs).

December 3, 2011 at 8:30 PM

Blogger   said...

Clever!

December 4, 2011 at 12:25 AM

Comment deleted

This comment has been removed by the author.

December 4, 2011 at 5:28 PM

Blogger Miles said...

This is very similar to what we do except we call permits "continuations". A task only has one continuation but it can manually decrement the ref count on other tasks during it's own run() method.

That works for most cases but is a bit more manual to set up.

December 4, 2011 at 5:59 PM

Blogger Brian said...

We also do exactly this (we have a runtime library that our compiler generated C code calls into).

One way of avoiding the lock on the permits list, is to maintain a lock-free list of dependence objects in each task. When a task retires (finishes), it sets a finished flag in its record, goes through this list, does an atomic XCHG on the dependence item to set a flag, and if the XCHG succeeds decrements the dependence count of the dependent task.

When a task adds a dependence record to the task it depends on, it adds the record to the lock free list, does an MFENCE, checks if the task with the list has finished, and if so tries an atomic XCHG on the dependence record (and if it succeeds doesn't count the dependence in its local count).

One additional trick we do is to use a large bias in the dependence count of the task we are setting up, keep track of the dependences the task has, and then decrement the bias minus the actual count from its count. It avoids a bunch of more expensive atomic decrements...

Debugging all of this and getting it right was a major, major pain...

December 4, 2011 at 10:59 PM

Blogger cbloom said...

@ Brian - that stuff is much easier to get right with Relacy.

My problem is that objects can get deleted, so I basically have to hold a lock on them to prevent deletion. I started to make the "permits list" lockfree and realized it was pointless because I need a lock anyway to ensure lifetime.

I've got a nice way to make the fast path (no dependencies) lock free, so I'm happy with that.

December 4, 2011 at 11:08 PM

Blogger Brian said...

BTW. It isn't clear to me that making it lock free is a performance win. The real places we suffered from were (under Linux) as the amount of work per task becomes small was (1) the memory allocator as you generate a whole bunch of task records and (2) cache line contention on the work queue as tasks become ready (in practice one thread makes most tasks ready in our system).

We bought into lock-free reference counting for freeing objects to avoid the deleting issue you describe. And we don't actually free these types of objects, but simply add them to a queue to be recycled (see system allocator performance issues)...

December 4, 2011 at 11:31 PM

Blogger cbloom said...

On point (2) -

multiplexing the queue is the standard solution to reduce contention in the high load case; in the low load scenario (queues nearly empty) you have some amount of contention that you cannot avoid, but in the high load scenario (# items >> #workers) you can make contention as small as possible (one cache line handoff is inevitable)

"We bought into lock-free reference counting for freeing objects"

I haven't wanted to impose things like that, I want the system to have a minimum of requirements to use correctly, but I may regret this later. I also considered building a RW lock into my handle system but decided against it for the moment.

December 5, 2011 at 11:35 AM

Blogger Brian said...

The problem we have is that we don't control the structure of the tasks (we're compiling arbitrary code to use our library). Every core actually has its own work queue. But if one core is producing most of the work at a high enough rate (and a bunch of worker cores are completing the tasks fast enough), the cache line contention from the work stealers can be problematic... This was the problem we hit when we had a single core spinning off a new task every 700 clock cycles on a 8 core Nehalem or 1600 clock cycles on a 24 core Opteron. We've thought about trying to batch items to amortize the cache line miss on the insert operation, but ensuring timely delivery of tasks would become a pain...

December 5, 2011 at 2:53 PM

Blogger bouliiii said...

Hey :-)
This is exactly what I implemeted with yaTS. You have a "start" dependency that prevents a task from running and a "end" dependency that delays the task end.
Ben

December 5, 2011 at 10:28 PM

Blogger Jesse James Lactin said...

Sounds really good. Have you considered implementing this with fibers instead of threads? The first problem I see is you pretty much have to roll your own synchronization primitives. To my knowledge, under Windows, stock mutexes and semaphores will put the whole thread to sleep, which isn't what you want.

July 13, 2017 at 2:38 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.