CUDA

neutrino · March 13, 2013

menchfrest said:

neutrino said:

Game programmers tend to be very pragmatic about stuff like this and I think actually think it's kind of funny that people are trying so hard to get us to do something that's just such a mismatch for what we need. For the 100th time it doesn't make any sense whatsoever for our servers to use GPU acceleration. I've gone through the reasons ad nauseum.
Click to expand...

I think a part of the disconnect is also that people don't understand the difference between parallelizable and stupid parallelizable.

My attempt to help:
If huge things can be made fully independent then yes, gpu's are great, no pixel on your screen cares what any other pixel is or does, so running them all at the same time over many cores is a win. But with something like a physics based RTS game, units will interact with each other, through flowfields, shooting each other, repairs, etc. and while these things can be parallelized, the level of complexity in interactions reaches a point where more complex instructions, more ram, and faster clockspeeds will win out over more cores.

Warning, this is a Mechanical Engineers take on this, so I could be totally wrong
Click to expand...

You are correct. This is only one of the many reasons though.

Bottom line, this is a sophisticated subject that most people simply aren't qualified to comment on. If someone has shipped a game using these techniques and tells me that's it's awesome and can explain to me the techniques they used to ship I'm all ears. Otherwise you are just talking out of your ***.

Of course even if it was a good idea in the general case it still doesn't make sense for our game for a number of reasons I've already enumerated earlier.

bobucles · March 13, 2013

A good CPU is like a sports car. It handles well, corners well, and goes fast. It can even do some grocery shopping at the end of the day.

CUDA is like a rocket car. It goes super sanic fast, but look out! Anything that isn't a straight line on a desert plain is going to crash and burn.

neutrino · March 14, 2013

bobucles said:

A good CPU is like a sports car. It handles well, corners well, and goes fast. It can even do some grocery shopping at the end of the day.

CUDA is like a rocket car. It goes super sanic fast, but look out! Anything that isn't a straight line on a desert plain is going to crash and burn.
Click to expand...

It's like that episode of myth busters with the rocket that crushes the car.

Pawz · March 14, 2013

You made me look it up.

So CUDA is like.. First stage.. SECOND STAGE OMG SO FAST..... KAAAABLOOOOM. Little bits of your game everywhere.

Got it.

dyslexington · May 23, 2013

neutrino said:

That's roughly correct. When you get into threading it's not just accuracy it's also about the order things complete. For example imagine 2 units on different threads probing each others positions. If one unit has updated because it was on another thread then the unit probing it's position in parallel may get a different results (imagine a range check to fire a weapon for example). Butterfly effect from there.
Click to expand...

while it is obviously true that the order in which units are processed makes the outcome indeterministic, is it not also true that the processing order is irrelevant in case the 2 units are too far apart from each other to be in communication?

Take for example a simulated planet that is divided in 3 zones:
1) 30 degrees latitude north till 90 north
2) 30 north till 30 south
3) 30 south till 90 south

Assuming that no units are affected by entities that are further apart than the vertical size of zone 2 for the duration of a simulation step, it would be possible to simulate zone 1 and 3 concurrently and then zone 2 afterwards and remain deterministic.
I understand that it is irrelevant to PA, i am just curious.

antillie · May 23, 2013

Real servers don't have video cards. They have these.

b0073d · May 23, 2013

Actually I think you meant these:http://www.nvidia.com/object/tesla-servers.html

exterminans · May 23, 2013

dyslexington said:

while it is obviously true that the order in which units are processed makes the outcome indeterministic, is it not also true that the processing order is irrelevant in case the 2 units are too far apart from each other to be in communication?
Click to expand...

The solution to this problem is actually a different one. Instead of trying to find a way to respect the dependency graph (which may not be loop-free, so forget about that, it might not be possible at all), you could simply have a copy of the previous state and do all calculations on that thing.

This comes with several new problems, among others you suddenly need TWICE the memory, which is a bad thing again, but it makes the problem scalable.

Thats why you screw the additional copy, but rather place semaphores on objects which are currently modified, so even though you have no consistent reads during the full simulation frame, reads are still consistent while a single entity is processed. This WILL break the simulation for really fast moving entities (units which move more than half their diameter per frame, like fast projectiles), but it still works quite fine for everything below that threshold.

This doesn't address the original problem however, you still can't do that on GPU. Too much inhomogeneous data, too complex algorithms. And you need to transport all the data forth and back between CPU and GPU.

Things like CUDA accelerated PhysX (I'm talking about extreme situations with several thousand particles!) and alike only work because internally the data is very uniform and most of the data is hold persistent in video memory, there is no need to ever transfer anything back to the main memory or to process via CPU. It's also actually a rather simple calculation (in terms of lines of code) and the result of the calculation can be written straight to another region of video ram where it is then used for actual rendering.
And yet there was put a ridiculous amount of effort into developing PhysX while a feature wise identical CPU only version can be developed in a fraction of time since you don't need to bother about GPU optimized data structures, data transfer or all the limitations which are put upon you by the CUDA / OpenCL language (like the inability to do recursion with variable depth!).

antillie · May 23, 2013

b0073d said:

Actually I think you meant these:http://www.nvidia.com/object/tesla-servers.html

Click to expand...

Not really. CUDA isn't very useful for database or web applications.

cwarner7264 · May 23, 2013

neutrino said:

Otherwise you are just talking out of your ***.
Click to expand...

I love Uber. So delightfully frank.

dyslexington · May 23, 2013

exterminans said:

The solution to this problem is actually a different one.
Click to expand...

i dont think keeping all state twice is a practical solution.

exterminans said:

Instead of trying to find a way to respect the dependency graph (which may not be loop-free, so forget about that, it might not be possible at all),
Click to expand...

there can certainly be loops in the dependencies, but also with a single threaded design these loops are resolved by simply choosing an arbitrary of the dependent units to be simulated before the others. as long as each client in a synchronous simulation chooses the same order, the outcome is the same.

exterminans said:

Thats why you screw the additional copy, but rather place semaphores on objects which are currently modified, so even though you have no consistent reads during the full simulation frame, reads are still consistent while a single entity is processed. This WILL break the simulation for really fast moving entities (units which move more than half their diameter per frame, like fast projectiles), but it still works quite fine for everything below that threshold.
Click to expand...

i'd rather say that it easily breaks the determinism, regardless of the speed of the entity, because you do not enforce a specific order in which to process dependent units.

exterminans · May 23, 2013

Forget about determinism, thats not the topic of this thread. Synchronized game states (needs to be deterministic) vs. client/server (doesn't need to be deterministic) was a different thread. This thread was all about parallelism and GPGPU computing.

Oh, and keeping the full state of the simulation twice actually IS a practical solution if you need real determinism, especially when you can't traverse the dependency graph in correct order without significant overhead. It's also the only way to achieve determinism across multiple runs of the simulation when you don't have full control over the order in which entities are stored and processed. Thats saves you a lot of hassle since it also removes the restriction of using only stable sorting algorithms and like. This limitation is always in place, even doing only single threaded.

Last but not least: You don't need determinism anyway. What you need are reasonable, not precise results.
Using a single threaded, fully deterministic and mathematically correct general purpose solution surely gives you a precise and reasonable result, but you can get imprecise, yet still reasonable result from a non deterministic, parallelized algorithm which can calculate a approximation using simplified rules and additional restrictions which are fulfilled by the given input.

antillie · May 23, 2013

Its kind of amusing to watch people debate CUDA when real servers don't have video cards and home PCs running the PA client will be unable to use their video cards for CUDA because they will be using them to render PA's graphics instead.

I mean how many gaming PCs out there have Tesla cards in them? Really. Looking at the main uses for GPGPUs I don't see anything releated to games. Unless we need to start doing hyper accurate weather or physics simulations I just don't see what CUDA can really do for PA. I guarantee that systems built with Tesla cards in mind were not built to be game servers.

Besides CUDA is an Nvidia only thing. What about all the people who have AMD cards?

ours99 · May 23, 2013

I find it funny that people expect their GPUs to do more work when most games today tend to leave most multi-core CPUs snoring.

Sure, it's about the server part but still.

dyslexington · May 23, 2013

exterminans said:

Oh, and keeping the full state of the simulation twice actually IS a practical solution if you need real determinism, especially when you can't traverse the dependency graph in correct order without significant overhead.
Click to expand...

well that is what i am talking about. if you have a dependency graph consisting of 3 subgraphs and there are no dependencies between subgraph 1 and subgraph 3, then
you can process both in parallel. in this first steps the dependencies of both 1 and 3 towards 2 are satisfied from the old state of subgraph 2 and only when 1 and 3 are finished, subgraph 2 is processed and the dependencies of it are satisfied from the new state of subgraph 1 and 3.
since the operations on each subgraph are sequential, you can get precise results in each subgraph and since the order in which the subgraphs are processed is defined, you will always either depend on old state or new state.
still you can process those subgraphs in parallel that do not have dependencies.
in a RTS simulation there should be a lot of entities that are independent, because what happens on one side of a planet cannot affect a thing that happens on the opposite side in a timespan as short as a simulation tick.

exterminans · May 23, 2013

Pairwise unrelated. But that does not mean that there aren't any indirect dependencies. You can't tell that easily which units are related and which not.

dyslexington · May 23, 2013

exterminans said:

Pairwise unrelated. But that does not mean that there aren't any indirect dependencies. You can't tell that easily which units are related and which not.
Click to expand...

no entity in subgraph 1 has a direct dependency on any entity in subgraph 3 and vice versa (i.e. no tank on the north pole will attempt to shoot at a tank on the south pole, and thus probe its position).
there are also no indirect dependencies that take effect in a single simulation tick. If a tank on the north pole shoots at a tank on the equator and at the very same time a tank on the south pole shoots at the very same tank, then both projectiles would have to travel through the equatorial region for at least one tick and thus the thread that processes that region would roll dice and decide which projectile hits, independent of whether the north pole tank was simulated before the south pole tank or the other way around, because the simulation of the equatorial region would always wait for both north and south to finish before adding the projectiles to its simulation state.

exterminans · May 23, 2013

And what do you think what your graph will look like when units form a belt around the planet and each unit has at least two units in each direction it needs to consider?

There are no more subgraphs which could be handled independently. The generation of the graph and finding subgraphs is also quite costly, thats nothing you would do on the fly. Actually, you use such graphs ONLY while conceptualizing the software, you would never try to work with them in the final implementation.

dyslexington · May 23, 2013

exterminans said:

And what do you think what your graph will look like when units form a belt around the planet and each unit has at least two units in each direction it needs to consider?
Click to expand...

if they form a belt and thus are all at the same latitude then all would be simulated by a single thread. no performance gained but still deterministic.
But this configuration would be improbable. Especially in cases where you need parallelism most: on big planets.

exterminans said:

The generation of the graph and finding subgraphs is also quite costly, thats nothing you would do on the fly. Actually, you use such graphs ONLY while conceptualizing the software, you would never try to work with them in the final implementation.
Click to expand...

you can just divide the planet into equal-sized horizontal slices. with 3 slices the simulation time would be the time used in the most populated slice plus the time used in the second most populated slice as opposed to the time used in all slices, so with equal distribution of units, roughly 66%. The bigger the planet, the more slices you can fit in. So if you have 5 horizontal slices, and slice1, slice3 and slice5 are mutually independent as well as slice2 and slice4, then the time with equal distribution would be 2 slices instead of 5 or 40%.
I.e. thread A,B,C first simulate slice 1,3,5 in parallel, then
thread D and E simulate 2 and 4 in parallel.

CUDA

neutrino low mass particle Uber Employee

bobucles Post Master General

neutrino low mass particle Uber Employee

Pawz Active Member

dyslexington New Member

antillie Member

b0073d New Member

exterminans Post Master General

antillie Member

cwarner7264 Moderator Alumni

dyslexington New Member

exterminans Post Master General

antillie Member

ours99 New Member

dyslexington New Member

exterminans Post Master General

dyslexington New Member

exterminans Post Master General

dyslexington New Member

Share This Page

CUDA

neutrino low mass particle Uber Employee

bobucles Post Master General

neutrino low mass particle Uber Employee

Pawz Active Member

dyslexington New Member

antillie Member

b0073d New Member

exterminans Post Master General

antillie Member

cwarner7264 Moderator Alumni

dyslexington New Member

exterminans Post Master General

antillie Member

ours99 New Member

dyslexington New Member

exterminans Post Master General

dyslexington New Member

exterminans Post Master General

dyslexington New Member

Share This Page

Useful Searches