Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"03-11-11 - Worklets , IO , and Coroutines"

6 Comments -

1 – 6 of 6
Blogger castano said...

I had a brief experience using windows fibers to work around limitations of the CUDA asynchronous API.

Each CUDA stream defines a sequence of asynchronous operations that need to be completed in order, so that in theory multiple streams could run in parallel. Ideally you would like to express the operations of each stream sequentially:

// stream 0:
cudaMemcpyAsync(..., 0)
cudaConfigureCall(..., 0), cudaLaunch(...)
cudaMemcpyAsync(..., 0)

// stream 1:
cudaMemcpyAsync(..., 1)
cudaConfigureCall(..., 1), cudaLaunch(...)
cudaMemcpyAsync(..., 1)

but in practice, the streams are only executed in parallel if the code cuda calls are interleaved, something like:

cudaMemcpyAsync(..., 0)
cudaConfigureCall(..., 0), cudaLaunch(...)
cudaMemcpyAsync(..., 1)
cudaConfigureCall(..., 1), cudaLaunch(...)
cudaMemcpyAsync(..., 0)
cudaMemcpyAsync(..., 1)

So, you have to manually schedule your code, which is completely fragile, since the duration of the kernels and the copies is not necessarily known.

Using fibers I could define each stream sequentially in the run method of the fiber. Each asynchronous calls would yield to a custom scheduler that decided what fiber to run next.

It was very trivial to get it to work, the only problem was that it was confusing to debug, you would be stepping through the code in one stream and as you step over an asynchronous call it would jump to completely different fiber, but other than that it seemed fairly robust.

This was just a toy example, so maybe in practice there are other problems.

March 12, 2011 at 12:19 AM

Blogger jfb said...

I wonder if you could use macros and C-like 'variables must be declared at the beginning' to keep a counter of 'local variable stack space needed', on a declaration end macro malloc and fix-up all the pointers to be relative to the malloc location... that'd effectively be like member variables but mostly automated..

March 12, 2011 at 10:32 AM

Blogger cbloom said...

@jfb -

there's no need to fix up pointers, you can just memcpy in and out of your stack, since it will only work for basic C types anyway.

I'll do a followup post with code.

I think the member variable method is pretty much 100% preferable though.

March 12, 2011 at 11:24 AM

Blogger jfb said...

Yeah, copying works too. That's how Mono Continuations does it (disappointingly, it isn't very reliable with their garbage collector or soft debugger -- a great feature to 'embrace and extend' MS.NET, but they really need to make it _work reliably_...).

I was thinking you could avoid copying by having the macro resolve to (essentially)

int& a = *(int*)(state->data + 0);
int& b = etc...

and each macro adding to some 'offset' int local the sizeof() the current variable being declared to generate these lines.

March 12, 2011 at 4:16 PM

Blogger cbloom said...

Yeah, but the member variable method is just 100% superior. It lets you use classes and doesn't abuse the language and is much less prone to be buggy.

March 12, 2011 at 6:56 PM

Blogger Jesse James Lactin said...

Coroutines/Fibers are what you want in this situation. Stack switching is the only 100% robust-by-default means of getting this done, um, elegantly (hate how programmers use that word). By elegantly, I mean it works with pointers pointing to variables on the stack without any fiddling. Stack copying just isn't robust, with or without pointer fixups. Macros for doing offset calculations is a code smell IMO.

June 13, 2018 at 7:32 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.