1. neutrino

    neutrino low mass particle Uber Employee

    Messages:
    3,123
    Likes Received:
    2,687
    You are correct. This is only one of the many reasons though.

    Bottom line, this is a sophisticated subject that most people simply aren't qualified to comment on. If someone has shipped a game using these techniques and tells me that's it's awesome and can explain to me the techniques they used to ship I'm all ears. Otherwise you are just talking out of your ***.

    Of course even if it was a good idea in the general case it still doesn't make sense for our game for a number of reasons I've already enumerated earlier.
  2. bobucles

    bobucles Post Master General

    Messages:
    3,388
    Likes Received:
    558
    A good CPU is like a sports car. It handles well, corners well, and goes fast. It can even do some grocery shopping at the end of the day.

    CUDA is like a rocket car. It goes super sanic fast, but look out! Anything that isn't a straight line on a desert plain is going to crash and burn.
  3. neutrino

    neutrino low mass particle Uber Employee

    Messages:
    3,123
    Likes Received:
    2,687
    It's like that episode of myth busters with the rocket that crushes the car.
  4. Pawz

    Pawz Active Member

    Messages:
    951
    Likes Received:
    161
    You made me look it up.

    So CUDA is like.. First stage.. SECOND STAGE OMG SO FAST..... KAAAABLOOOOM. Little bits of your game everywhere.

    Got it. :D
  5. dyslexington

    dyslexington New Member

    Messages:
    5
    Likes Received:
    0
    while it is obviously true that the order in which units are processed makes the outcome indeterministic, is it not also true that the processing order is irrelevant in case the 2 units are too far apart from each other to be in communication?

    Take for example a simulated planet that is divided in 3 zones:
    1) 30 degrees latitude north till 90 north
    2) 30 north till 30 south
    3) 30 south till 90 south

    Assuming that no units are affected by entities that are further apart than the vertical size of zone 2 for the duration of a simulation step, it would be possible to simulate zone 1 and 3 concurrently and then zone 2 afterwards and remain deterministic.
    I understand that it is irrelevant to PA, i am just curious.
  6. antillie

    antillie Member

    Messages:
    813
    Likes Received:
    7
    Real servers don't have video cards. They have these. ;)
  7. b0073d

    b0073d New Member

    Messages:
    28
    Likes Received:
    0
  8. exterminans

    exterminans Post Master General

    Messages:
    1,881
    Likes Received:
    986
    The solution to this problem is actually a different one. Instead of trying to find a way to respect the dependency graph (which may not be loop-free, so forget about that, it might not be possible at all), you could simply have a copy of the previous state and do all calculations on that thing.

    This comes with several new problems, among others you suddenly need TWICE the memory, which is a bad thing again, but it makes the problem scalable.

    Thats why you screw the additional copy, but rather place semaphores on objects which are currently modified, so even though you have no consistent reads during the full simulation frame, reads are still consistent while a single entity is processed. This WILL break the simulation for really fast moving entities (units which move more than half their diameter per frame, like fast projectiles), but it still works quite fine for everything below that threshold.

    This doesn't address the original problem however, you still can't do that on GPU. Too much inhomogeneous data, too complex algorithms. And you need to transport all the data forth and back between CPU and GPU.

    Things like CUDA accelerated PhysX (I'm talking about extreme situations with several thousand particles!) and alike only work because internally the data is very uniform and most of the data is hold persistent in video memory, there is no need to ever transfer anything back to the main memory or to process via CPU. It's also actually a rather simple calculation (in terms of lines of code) and the result of the calculation can be written straight to another region of video ram where it is then used for actual rendering.
    And yet there was put a ridiculous amount of effort into developing PhysX while a feature wise identical CPU only version can be developed in a fraction of time since you don't need to bother about GPU optimized data structures, data transfer or all the limitations which are put upon you by the CUDA / OpenCL language (like the inability to do recursion with variable depth!).
  9. antillie

    antillie Member

    Messages:
    813
    Likes Received:
    7
  10. cwarner7264

    cwarner7264 Moderator Alumni

    Messages:
    4,460
    Likes Received:
    5,390
    I love Uber. So delightfully frank.
  11. dyslexington

    dyslexington New Member

    Messages:
    5
    Likes Received:
    0
    i dont think keeping all state twice is a practical solution.

    there can certainly be loops in the dependencies, but also with a single threaded design these loops are resolved by simply choosing an arbitrary of the dependent units to be simulated before the others. as long as each client in a synchronous simulation chooses the same order, the outcome is the same.

    i'd rather say that it easily breaks the determinism, regardless of the speed of the entity, because you do not enforce a specific order in which to process dependent units.
  12. exterminans

    exterminans Post Master General

    Messages:
    1,881
    Likes Received:
    986
    Forget about determinism, thats not the topic of this thread. Synchronized game states (needs to be deterministic) vs. client/server (doesn't need to be deterministic) was a different thread. This thread was all about parallelism and GPGPU computing.

    Oh, and keeping the full state of the simulation twice actually IS a practical solution if you need real determinism, especially when you can't traverse the dependency graph in correct order without significant overhead. It's also the only way to achieve determinism across multiple runs of the simulation when you don't have full control over the order in which entities are stored and processed. Thats saves you a lot of hassle since it also removes the restriction of using only stable sorting algorithms and like. This limitation is always in place, even doing only single threaded.

    Last but not least: You don't need determinism anyway. What you need are reasonable, not precise results.
    Using a single threaded, fully deterministic and mathematically correct general purpose solution surely gives you a precise and reasonable result, but you can get imprecise, yet still reasonable result from a non deterministic, parallelized algorithm which can calculate a approximation using simplified rules and additional restrictions which are fulfilled by the given input.
  13. antillie

    antillie Member

    Messages:
    813
    Likes Received:
    7
    Its kind of amusing to watch people debate CUDA when real servers don't have video cards and home PCs running the PA client will be unable to use their video cards for CUDA because they will be using them to render PA's graphics instead.

    I mean how many gaming PCs out there have Tesla cards in them? Really. Looking at the main uses for GPGPUs I don't see anything releated to games. Unless we need to start doing hyper accurate weather or physics simulations I just don't see what CUDA can really do for PA. I guarantee that systems built with Tesla cards in mind were not built to be game servers.

    Besides CUDA is an Nvidia only thing. What about all the people who have AMD cards?
  14. ours99

    ours99 New Member

    Messages:
    7
    Likes Received:
    0
    I find it funny that people expect their GPUs to do more work when most games today tend to leave most multi-core CPUs snoring.

    Sure, it's about the server part but still.
  15. dyslexington

    dyslexington New Member

    Messages:
    5
    Likes Received:
    0
    well that is what i am talking about. if you have a dependency graph consisting of 3 subgraphs and there are no dependencies between subgraph 1 and subgraph 3, then
    you can process both in parallel. in this first steps the dependencies of both 1 and 3 towards 2 are satisfied from the old state of subgraph 2 and only when 1 and 3 are finished, subgraph 2 is processed and the dependencies of it are satisfied from the new state of subgraph 1 and 3.
    since the operations on each subgraph are sequential, you can get precise results in each subgraph and since the order in which the subgraphs are processed is defined, you will always either depend on old state or new state.
    still you can process those subgraphs in parallel that do not have dependencies.
    in a RTS simulation there should be a lot of entities that are independent, because what happens on one side of a planet cannot affect a thing that happens on the opposite side in a timespan as short as a simulation tick.
  16. exterminans

    exterminans Post Master General

    Messages:
    1,881
    Likes Received:
    986
    Pairwise unrelated. But that does not mean that there aren't any indirect dependencies. You can't tell that easily which units are related and which not.
  17. dyslexington

    dyslexington New Member

    Messages:
    5
    Likes Received:
    0
    no entity in subgraph 1 has a direct dependency on any entity in subgraph 3 and vice versa (i.e. no tank on the north pole will attempt to shoot at a tank on the south pole, and thus probe its position).
    there are also no indirect dependencies that take effect in a single simulation tick. If a tank on the north pole shoots at a tank on the equator and at the very same time a tank on the south pole shoots at the very same tank, then both projectiles would have to travel through the equatorial region for at least one tick and thus the thread that processes that region would roll dice and decide which projectile hits, independent of whether the north pole tank was simulated before the south pole tank or the other way around, because the simulation of the equatorial region would always wait for both north and south to finish before adding the projectiles to its simulation state.
  18. exterminans

    exterminans Post Master General

    Messages:
    1,881
    Likes Received:
    986
    And what do you think what your graph will look like when units form a belt around the planet and each unit has at least two units in each direction it needs to consider?

    There are no more subgraphs which could be handled independently. The generation of the graph and finding subgraphs is also quite costly, thats nothing you would do on the fly. Actually, you use such graphs ONLY while conceptualizing the software, you would never try to work with them in the final implementation.
  19. dyslexington

    dyslexington New Member

    Messages:
    5
    Likes Received:
    0
    if they form a belt and thus are all at the same latitude then all would be simulated by a single thread. no performance gained but still deterministic.
    But this configuration would be improbable. Especially in cases where you need parallelism most: on big planets.

    you can just divide the planet into equal-sized horizontal slices. with 3 slices the simulation time would be the time used in the most populated slice plus the time used in the second most populated slice as opposed to the time used in all slices, so with equal distribution of units, roughly 66%. The bigger the planet, the more slices you can fit in. So if you have 5 horizontal slices, and slice1, slice3 and slice5 are mutually independent as well as slice2 and slice4, then the time with equal distribution would be 2 slices instead of 5 or 40%.
    I.e. thread A,B,C first simulate slice 1,3,5 in parallel, then
    thread D and E simulate 2 and 4 in parallel.

Share This Page