CUDA

neutrino · March 10, 2013

exterminans said:

There are also a few very expensive operations which are easy to port to OpenCL, one of them is proper path finding. Sure, you can try to do that on the CPU, but calculating a flow field for several thousand units on the CPU is a waste of time and resources and there are enough other CPU expensive tasks which could use those resources. E.g. the unit and strategic AI which relies on expensive analysis of the game state and which grows stronger with every bit of computing power available.
Click to expand...

Honestly I bigtime dispute that this is "easy". It's only easy in the theoretical sense of doing pathfinding. Doing the actual pathfinding for an actual game is a different ball of wax. It would be a research project that would have to start with some of the known approaches but I'm positive it wouldn't be a good use of time. Honestly I get why people are excited about moving stuff to the GPU but most people who actually make games are going to be much more skeptical about using this stuff in the real world case. I really can't emphasize enough how fantasy-land a lot of this thinking is when it comes to actually shipping a game.

cola_colin · March 10, 2013

neutrino said:

We are doing away with determinism. Replays are *recordings* of gamestate.
You can't write a multithreaded sim and have it be deterministic. Well at least I can't, maybe someone smarter than me can. Nobody ever has that I'm aware of though...
Click to expand...

Now I am questioning my own understand of multithreading and determinism. I know that it can be quite hard to write a bullet-prove multithreaded program, but doing away with determinism pretty much sounds like the whole game will be totally random. Which cannot really be the case. If you put 2 armies with 50 units on a planet and let the storm torwards each other they should always act exactly they same or the game gets pretty weird, doesn't it?

How exactly are you even splitting up the sim to multiple threads anyway?
Putting each planet on its own thread is simple, but apart from that it gets pretty complicated. I'd really love to get an insight on this.

neutrino · March 10, 2013

Cola_Colin said:

neutrino said:

We are doing away with determinism. Replays are *recordings* of gamestate.
You can't write a multithreaded sim and have it be deterministic. Well at least I can't, maybe someone smarter than me can. Nobody ever has that I'm aware of though...
Click to expand...

Now I am questioning my own understand of multithreading and determinism. I know that it can be quite hard to write a bullet-prove multithreaded program, but doing away with determinism pretty much sounds like the whole game will be totally random. Which cannot really be the case. If you put 2 armies with 50 units on a planet and let the storm torwards each other they should always act exactly they same or the game gets pretty weird, doesn't it?
Click to expand...

They should do exactly the same thing but the outcome will always be slightly different. There is being the same and then there is every single bit is the same. The results will always be similar but the exact bit output won't always be the same.

How exactly are you even splitting up the sim to multiple threads anyway?
Click to expand...

That's beyond the scope of a forum post I'm afraid. Planets are specifically isolated from each other but that's only the start of it. Mostly through a task system and being careful about global and mutable state.

Putting each planet on its own thread is simple, but apart from that it gets pretty complicated. I'd really love to get an insight on this.
Click to expand...

It is complicated. Basically each task that needs to be run gets put into a list which is then spread out to the worker threads across processors.

cola_colin · March 10, 2013

neutrino said:

They should do exactly the same thing but the outcome will always be slightly different. There is being the same and then there is every single bit is the same. The results will always be similar but the exact bit output won't always be the same.
Click to expand...

I see. Makes sense.

That's beyond the scope of a forum post I'm afraid.
Click to expand...

I guessed you would say something among those lines. Feel free to write a blog-post about it some day in the future. I would read it with joy.

dmii · March 10, 2013

neutrino said:

You can't write a multithreaded sim and have it be deterministic. Well at least I can't, maybe someone smarter than me can. Nobody ever has that I'm aware of though...
Click to expand...

I once read a blog post about Multi-threaded game engines, which describes a method, which as I see it should be able to be deterministic. On a macroscopic level it basically behaves like a single-threaded game loop with the subsystems running sequentially. (Which is why I think determinism should be possible.) The multiple threads simply all work on doing the tasks the currently running subsystems do via batch processing. Basically, if the game is deterministic when running on a single thread, it should still be when doing it this way.

It sounds like you are already doing it this way, but it can't hurt to post the link.
(because this explains much better than I am and also is an interesting read for everyone interested in the subject):
http://seanmiddleditch.com/journal/2011/04/multi-threaded-game-engines/

neutrino · March 10, 2013

dmii said:

neutrino said:

You can't write a multithreaded sim and have it be deterministic. Well at least I can't, maybe someone smarter than me can. Nobody ever has that I'm aware of though...
Click to expand...

I once read a blog post about Multi-threaded game engines, which describes a method, which as I see it should be able to be deterministic. On a macroscopic level it basically behaves like a single-threaded game loop with the subsystems running sequentially. (Which is why I think determinism should be possible.) The multiple threads simply all work on doing the tasks the currently running subsystems do via batch processing. Basically, if the game is deterministic when running on a single thread, it should still be when doing it this way.

It sounds like you are already doing it this way, but it can't hurt to post the link.
(because this explains much better than I am and also is an interesting read for everyone interested in the subject):
http://seanmiddleditch.com/journal/2011/04/multi-threaded-game-engines/
Click to expand...

Yes, jobs and batches. Basically you fire up one thread per processor and then feed it from a queue.

There is nothing in there about determinism. Determinism isn't something that just happens or is the natural order of things, you have to work incredibly hard to make things be consistently deterministic, especially in the case where you are trying to keep multiple clients in sync. It never happens by accident and doing it using a job queue system would be really really hard. Like so hard that I wouldn't want to tackle it.

Just getting determinism running consistently on a single threaded sim is a ridiculous amount of ongoing never ending engineering effort. Every single thing you do has to be looked at through that lens if you want it to work. Adding multithreading to that is not very practical.

Chucking determinism and sync is going to allow us to do a lot of interesting things with less effort.

dmii · March 10, 2013

Somehow I keep missing simple and obvious cases of things not being that easy. Such as singlethreaded determinism also being hard.

exterminans · March 10, 2013

Well, it is possible (and i have done it a few times on "simple" algorithms), but in general creating a deterministic behavior across multiple nodes (doesn't matter whether we are talking about distributed computing, replication or just local multithreading), is about 10-20 times harder than to go for an non deterministic approximation.

You have to handle redundant data, add semaphores, assign workload to the correct nodes in the right order which is bad enough if you only want to make it run AT ALL without running into deadlocks because you misplaced a semaphore or alike.

How does this work?
You start with a sequential model of the iterative algorithm (multiple levels of iterations preferred!), take a good guess at where to cut the sequence into dynamic sized shares (best is to try to parallelize both inner and outer iterations) and then you start at modeling EVERY dependency between subsequent shares with the use of semaphores, the more detailed your model is, the better. The algorithm is now partly parallelized and yet still deterministic.

It's most likely far from optimum however, because you have to many semaphores which will prevent parallel execution. Now you need to assign the workshares to distinct threads, with the goal that subsequent workshares on the same thread share as much data as possible so you can work on thread local memory. By removing data from the global state and moving it into thread local storage, you can replace some of the semaphores (The ones which prevented you from modifying data because a different thread would still need to read that data, if both threads have their own copy, then they don't need to wait for each other) with messages which will now manage the exchange of data between thread local storages. Some of the dependencies are now only valid on thread local data and therefor they don't need to be modeled by semaphores any more. There is one drawback with thread local storage though, that is that you need to distribute data to the threads before they can start with the actual computation and that you need to collect the data from the individual threads after they finished.

Final step is to optimize the synchronization points (semaphores and messages), so you gain maximum flexibility between the threads. You must keep in mind, that synchronization is never realtime, you have a latency, even when the threads run on cores of the same CPU. You therefor must reorder your code in such way, that every semaphore and every message is never required before it is freed / sent. If you fail to do so, threads will often wait for other threads, degrading the speedup. A good way to visualize this, is to model your algorithm with graphviz or alike whereby every workshare is represented by a group of nodes (every node is a synchronization point within the workshare) and every semaphore or message or dependency on local data by a transition and have the graph layed down from top to bottom. Graphviz will try to make the graph as broad as possible, if graphviz manages to arrange two nodes horizontally, that means that these workshares run in parallel. If you have only vertical lines in the graph, everything is fine. The height of the line shows you, how much time you have for synchronization before the data / resource is actually required. If you have purely horizontal lines, that is bad because that means, that a resource is required the moment or even before it is available with means that threads will block each other.

The attachment shows such an analysis I made on a parallel, deterministic implementation of the Gauß-Seidel algorithm (algorithm for solving large differential equations by approximation) on base of the MPI library. Shares working on the same local data are connected with red lines, other semaphores and messages are represented by black lines. The algorithm has two nested loops, I only looked at 3 iterations of the outer loop and broke the inner loop into 3 shares for the analysis, but the implementation also ran on a 64 core cluster over network with a perfect 6300% speedup, because (as you can see) every communication had enough flexibility, given that the individual shares were big enough.
You can also observe yet another phenomenon on that example, you often have some typ of startup phase when trying to parallelize an algorithm, in this case it takes n iterations of the outer loop until all n threads can start with their work. Therefore this implementation works best if many iterations occur as the impact of the startup phase becomes neglectable.

As you can see, it's a hell of a work to try to parallelize an algorithm in a both deterministic and efficient way and while it is perfectly possible, it requires 10-20 times more effort than writing the actual algorithm and should only be considered for critical tasks or if you plan to go MASSIVE (>16 cores) and with distributed memory (thats means you no longer have shared memory because of hardware limitations).

For comparison, I also implemented the Gauß-Seidel algorithm with OpenMP (shared memory only, no messaging), ignored almost any of the semaphores i had implemented in the MPI version and let it run until it reached the same precision as the "clean", deterministic version. Well, both versions took the same amount of iterations until they terminated, so even though the result wasn't the same, it didn't affect the complexity of the algorithm, they also both consumed the same amount of time.

Where they DID differ though, was the amount of work i had to throw into each version:
Writing the single threaded algorithm took me about half an hour.
Adding OpenMP annotations so it became non deterministic, but parallelized with dynamic partitioning of workshares took me less than 10 minutes.
Performing the analysis of dependencies, distributing thread local data, adding deadlock free startup and shutdown logic, optimizing synchronization points, and adding MPI commands took me a total of over 6 hours with all the documentation required.

And what did it get me in the end?
It isn't faster than the non-deterministic version, it took ages to parallelize and the resulting code is barely understandable for outsiders, given that it looks nothing like the original (simple!!!) algorithm.
Well, I CAN run it on any super computer of any size now (it can calculate matrices where only the variables itself consume several hundred gigabytes of data so they won't even fit in RAM of a single machine!), thats not possible with OpenMP, but who is going to call Oak Ridge National Laboratory an tell them "We want to host a TA match for 10.000 players on your Titan"?

cola_colin · March 10, 2013

Interesting write. I think I really understand the meaning of "non-deterministic" from this now:
Gauß-Seidel never returns a perfect result, it is meant to return an approximation within certain bounds. So the difference between a deterministic and a non-deterministic version is that the deterministic version always gives the exact same result (which will still be just an approximation) while the non-deterministic result will always give a result within a certain accuracy, but it wont be exactly the same for every execution of the algorithm.

PA's simulation will be the same: it is just an approximation with a certain accuracy. The results will differ within that accuracy, which makes creating a synchronous simulation on multiple clients impossible.

Am I correct?

neutrino · March 10, 2013

Cola_Colin said:

Interesting write. I think I really understand the meaning of "non-deterministic" from this now:
Gauß-Seidel never returns a perfect result, it is meant to return an approximation within certain bounds. So the difference between a deterministic and a non-deterministic version is that the deterministic version always gives the exact same result (which will still be just an approximation) while the non-deterministic result will always give a result within a certain accuracy, but it wont be exactly the same for every execution of the algorithm.

PA's simulation will be the same: it is just an approximation with a certain accuracy. The results will differ within that accuracy, which makes creating a synchronous simulation on multiple clients impossible.

Am I correct?
Click to expand...

That's roughly correct. When you get into threading it's not just accuracy it's also about the order things complete. For example imagine 2 units on different threads probing each others positions. If one unit has updated because it was on another thread then the unit probing it's position in parallel may get a different results (imagine a range check to fire a weapon for example). Butterfly effect from there.

Also keep in mind the Gauß-Seidel is simply a simple example of one algorithm that is already fairly parallel. Game code tends to not be very easy to make parallel and many times more complex than just this one algorithm.

calmesepai · March 11, 2013

Wow intresting thread and it is not about the original topic :lol:
(presses the placebo +1 button )

theodoreroosevelt · March 13, 2013

If elements of the simulation can be formulated as a system of linear equations or linear PDEs then it would be easy to use opencl/cuda to take work load off the cpu. Surely much of the simulation boils down to some system of equations.

There are implementations of tools like LAPACK and BLAS for graphics cards and cpus - they all have the same interface it just runs on a different hardware for each.

magma (an implementation of lapack/blas for opencl AND cuda if I remember right)
http://icl.cs.utk.edu/magma/

cpu implementations for amd and intel cpus are ACML and MKL repsectivly
http://developer.amd.com/tools/cpu-deve ... rary-acml/
http://software.intel.com/en-us/intel-mkl

Libraries like these are used for scientific computing and these implementations are pretty heavily optimized I think. Surely game developers utilize them?

asgo · March 13, 2013

my impression is, that many people make a few assumptions that don't necessarily hold.
From the associations/assumptions "modern games use fast GPUs" and "many modern HPC systems use fast GPUs" people arrive at the conclusion a game server can likely be accelerated efficiently by GPUs (including parallelization and optimization strategies used in hpc).

Correct me if I'm wrong, but my guess would be, if you want to compare a game server to another real world example it would probably more a match for a database server than a HPC application in terms of requirements and offered services:
- high interactivity
- fast response times
- computation complexity as a sum of a diverse set of algorithms
- relatively low computation per data rate (at least compared to efficient cuda applications)

Raevn · March 13, 2013

If CUDA was truly useful in gaming, it would be used in games. It's not, so that says it all, really.

neutrino · March 13, 2013

theodoreroosevelt said:

If elements of the simulation can be formulated as a system of linear equations or linear PDEs then it would be easy to use opencl/cuda to take work load off the cpu. Surely much of the simulation boils down to some system of equations.

There are implementations of tools like LAPACK and BLAS for graphics cards and cpus - they all have the same interface it just runs on a different hardware for each.

magma (an implementation of lapack/blas for opencl AND cuda if I remember right)
http://icl.cs.utk.edu/magma/

cpu implementations for amd and intel cpus are ACML and MKL repsectivly
http://developer.amd.com/tools/cpu-deve ... rary-acml/
http://software.intel.com/en-us/intel-mkl

Libraries like these are used for scientific computing and these implementations are pretty heavily optimized I think. Surely game developers utilize them?
Click to expand...

No. Total mismatch.

neutrino · March 13, 2013

raevn said:

If CUDA was truly useful in gaming, it would be used in games. It's not, so that says it all, really.
Click to expand...

To be fair some games use it. Generally just not for game logic. In almost all cases people use it for either fluff physics (e.g. physics that doesn't effect gameplay) or simply to accelerate rendering in some way.

The fact that everyone thinks it's the greatest thing ever is major kudos to the nvidia marketing department.

BTW Just so you guys know we aren't exactly inexperienced when it comes to working with massively parallel systems. I've been involved with gaming since before GPU's were invented. When we first started Uber one of our first projects was writing a raytracer on a massively parallel processor called Larrabee. See my blog www.mavorsrants.com for some stuff on that.

Game programmers tend to be very pragmatic about stuff like this and I think actually think it's kind of funny that people are trying so hard to get us to do something that's just such a mismatch for what we need. For the 100th time it doesn't make any sense whatsoever for our servers to use GPU acceleration. I've gone through the reasons ad nauseum.

thepilot · March 13, 2013

In offline rendering, parallelism is really easy to achieve.
Constructing the scene can be dispatched on multiple CPU without much trouble, the acceleration structure is more a mystery for me, but the actual render is pretty simple to thread : the image is split in buckets, and each bucket is rendered by a separate thread, with a little overlap of X pixels to be able to do the AA in the same thread afterward (the AA needed X pixels next to the current one to be able to be anti-aliased), so one thread doesn't need to wait for another before starting another bucket.

Is it the same logic for a real-time raytracer ?

I've seen and tested a lot of renderer using the GPU - not realtime, just using the GPU for computation -, but I'm not really amazed. The GFX memory is way too low to fill a scene in it, and the results are not precise enough to be reliable.
According to the guy of Next Limit (maxwell render), GPUs have big approximations for floating point numbers and that's why they stick to strictly CPU computations.

Rendering is my field or work, so I was really interesting by your article, particularly when you say that raytracing is interesting for artist time.

We currently have the same twist in offline rendering. The CPUs are getting bigger and bigger, while artist time is incompressible.
Using rasterisation techniques is kind of obselete, as setting up all the pre-passes, shadow maps tweakings, ... take a huge amount of time on the artist, and the industry is taking a really fast turn into full raytracing techniques.

It's interesting to note that you are having the same conclusion for realtime too.

But isn't the current architecture of GPU a stopper for these kind of thing ?

bobucles · March 13, 2013

The fact that everyone thinks it's the greatest thing ever is major kudos to the nvidia marketing department.
Click to expand...

Indeed. CUDA has the biggest numbers ever, and I just know that big numbers mean I'll have a big game!

neutrino · March 13, 2013

thepilot said:

In offline rendering, parallelism is really easy to achieve.
Constructing the scene can be dispatched on multiple CPU without much trouble, the acceleration structure is more a mystery for me, but the actual render is pretty simple to thread : the image is split in buckets, and each bucket is rendered by a separate thread, with a little overlap of X pixels to be able to do the AA in the same thread afterward (the AA needed X pixels next to the current one to be able to be anti-aliased), so one thread doesn't need to wait for another before starting another bucket.

Is it the same logic for a real-time raytracer ?
Click to expand...

Pretty much.

I've seen and tested a lot of renderer using the GPU - not realtime, just using the GPU for computation -, but I'm not really amazed. The GFX memory is way too low to fill a scene in it, and the results are not precise enough to be reliable.
According to the guy of Next Limit (maxwell render), GPUs have big approximations for floating point numbers and that's why they stick to strictly CPU computations.
Click to expand...

Modern GPU's are IEEE compliant mostly.

Rendering is my field or work, so I was really interesting by your article, particularly when you say that raytracing is interesting for artist time.

We currently have the same twist in offline rendering. The CPUs are getting bigger and bigger, while artist time is incompressible.
Using rasterisation techniques is kind of obselete, as setting up all the pre-passes, shadow maps tweakings, ... take a huge amount of time on the artist, and the industry is taking a really fast turn into full raytracing techniques.

It's interesting to note that you are having the same conclusion for realtime too.

But isn't the current architecture of GPU a stopper for these kind of thing ?
Click to expand...

I wasn't writing this on a GPU it was on what they now call Xeon Phi.

menchfrest · March 13, 2013

neutrino said:

Game programmers tend to be very pragmatic about stuff like this and I think actually think it's kind of funny that people are trying so hard to get us to do something that's just such a mismatch for what we need. For the 100th time it doesn't make any sense whatsoever for our servers to use GPU acceleration. I've gone through the reasons ad nauseum.
Click to expand...

I think a part of the disconnect is also that people don't understand the difference between parallelizable and stupid parallelizable.

My attempt to help:
If huge things can be made fully independent then yes, gpu's are great, no pixel on your screen cares what any other pixel is or does, so running them all at the same time over many cores is a win. But with something like a physics based RTS game, units will interact with each other, through flowfields, shooting each other, repairs, etc. and while these things can be parallelized, the level of complexity in interactions reaches a point where more complex instructions, more ram, and faster clockspeeds will win out over more cores.

Warning, this is a Mechanical Engineers take on this, so I could be totally wrong

CUDA

neutrino low mass particle Uber Employee

cola_colin Moderator Alumni

neutrino low mass particle Uber Employee

cola_colin Moderator Alumni

dmii Member

neutrino low mass particle Uber Employee

dmii Member

exterminans Post Master General

Attached Files:

Gauß-Seidel with MPI.png

cola_colin Moderator Alumni

neutrino low mass particle Uber Employee

calmesepai Member

theodoreroosevelt New Member

asgo Member

Raevn Moderator Alumni

neutrino low mass particle Uber Employee

neutrino low mass particle Uber Employee

thepilot Well-Known Member

bobucles Post Master General

neutrino low mass particle Uber Employee

menchfrest Active Member

Share This Page

CUDA

neutrino low mass particle Uber Employee

cola_colin Moderator Alumni

neutrino low mass particle Uber Employee

cola_colin Moderator Alumni

dmii Member

neutrino low mass particle Uber Employee

dmii Member

exterminans Post Master General

Attached Files:

Gauß-Seidel with MPI.png

cola_colin Moderator Alumni

neutrino low mass particle Uber Employee

calmesepai Member

theodoreroosevelt New Member

asgo Member

Raevn Moderator Alumni

neutrino low mass particle Uber Employee

neutrino low mass particle Uber Employee

thepilot Well-Known Member

bobucles Post Master General

neutrino low mass particle Uber Employee

menchfrest Active Member

Share This Page

Useful Searches