What do the developers think about using OpenCL or CUDA for the simulation?

Discussion in 'Planetary Annihilation General Discussion' started by jvickers, September 28, 2014.

  1. jvickers

    jvickers Member

    Messages:
    56
    Likes Received:
    32
    (bit of background for readers of the forum)

    Both OpenCL and CUDA allow GPUs to run code that's not specific to graphics - they can enable many more calculations to take place than using the CPU. Bitcoin went through a phase of having it's most efficient mining done on GPUs. Now it's best done using ASICs (which at least is an interesting idea for how to run the PA simulation extremely quickly).

    OpenCL is a technology that allows GPUs to run code that's a bit like C. It's reportedly hard to program in, but there are innovations (to do with compiling higher level code to OpenCL) which look like they will make it easier in the longer term.

    CUDA is a technology that's specific to NVIDIA cards. It's more like C++ and reportedly easier to program in.

    Both have various advantages and disadvantages.
    cdrkf likes this.
  2. squishypon3

    squishypon3 Post Master General

    Messages:
    7,971
    Likes Received:
    4,356
    "What do the developers think about recreating the engine in a whole other programming format?"

    Sorry, couldn't resist.

    Anyway, is OpenCL compatible with Linux/Mac?
    Geers likes this.
  3. cola_colin

    cola_colin Moderator Alumni

    Messages:
    12,074
    Likes Received:
    16,221
    There have been threads about this before. I don't remember the exact reasons given, but the short answer was: Won't work very, as too many things that PA does just do not fit together with the quite heavy limitations that OpenCL and similar have.

    From my own technical understanding, apart from the high cost of moving data toward the gpu and back it won't give a good speedup, as the high processing power that gpus have is a result of extreme parallelisation. That's because they are aimed at rendering images made up of millions of pixels. Pixels tend to be independent from each other, so a lot of slow processing units can work on a lot of pixels in parallel, resulting in fast image generation.

    It's already hard to make PA scale well with 4-8 and probably more cpu cores in the future. But gpu's have hundreds (thousands?) of really stupid (like they do not support many of the operations you'd do on a cpu, so many things become much more complicated) and slow processing units. So if you can't split your workload onto them you'll get nothing out of it. Basically gpus are good at doing a million really simple things in parallel, that's why stuff like "brute force hash values" is so fast on them.

    I also doubt an ASIC would be feasible for a computer game like PA. I might be wrong (somebody please correct me in that case), but an ASIC is nothing more than a hardware implementation of whatever it is supposed to do fast. So a bitcoin mining ASIC is a hardware implementation of that hashing algorithm that bitcoins use. It can do exactly that and nothing else. Implementing the hashing algorithm in hardware is probably not that hard. Implementing the whole "algorithm" of simulating PA in hardware is much harder. It's just so much more complex.
    xankar, Remy561 and squishypon3 like this.
  4. SXX

    SXX Post Master General

    Messages:
    6,896
    Likes Received:
    1,812
  5. cdrkf

    cdrkf Post Master General

    Messages:
    5,721
    Likes Received:
    4,793
    I think I've read that the devs are actively looking at moving more work onto the GPU. It's unlikely that the SIM is suitable as others have said, although I think things like the projectile paths are currently CPU bound that *could* be moved over to GPU...

    Edit: The real question is though at what point do you shift the balance and just make the game GPU bound? I actually think PA (client side) uses hardware quite well at the moment (it scales from my dual core laptop with GT420m gpu, to my 8 core desktop with a GTX 560 quite happily, putting much more on the GPU might actually *hurt* lower end machines).
  6. jvickers

    jvickers Member

    Messages:
    56
    Likes Received:
    32
    Yes, that is the jist of the question. With CUDA or some of the tooling around OpenCL such as SPIR (http://www.khronos.org/spir) the code would not need to be as different to the current C++ (please correct me someone if I'm wrong that it's C++) as it would be if it was written by hand in OpenCL.

    OpenCL is compatible with Linux and Mac. Over the last few years Apple has made it so that OpenCL is well supported in OS X (that's my opinion and I don't know that much about OS X or OpenCL), and they have been making use of OpenCL to make the OS run more efficiently, but I could not tell you exactly where.
  7. cola_colin

    cola_colin Moderator Alumni

    Messages:
    12,074
    Likes Received:
    16,221
    I am pretty sure that was all about the client, as the client is the only thing that currently even has a graphics card.
    jvickers and SXX like this.
  8. SXX

    SXX Post Master General

    Messages:
    6,896
    Likes Received:
    1,812
    Things like projectile simulation can't be moved to GPUs. Basically simulation likely can't even use more than one core while remain efficient. E.g likely even overhead between CPU cores going to kill actual sim performance and unloading something to GPU it's thousands times slower than this.

    Only thing OpenCL might be suitable for on server-side it's physics (flocking) and path finding (likely very limited).
    tatsujb and cdrkf like this.
  9. SXX

    SXX Post Master General

    Messages:
    6,896
    Likes Received:
    1,812
    Yeah developers wanted to use compute shaders in future as well as number of others modern GL extensions.
  10. cdrkf

    cdrkf Post Master General

    Messages:
    5,721
    Likes Received:
    4,793
    Ah ok, I thought re projectiles that although *each projectile* could only use 1 core, given the number of them maybe that would have made sense for GPU (as you can have thousands at once). Perhapse not though.

    Hi yeah I was thinking more client side...
  11. zihuatanejo

    zihuatanejo Well-Known Member

    Messages:
    798
    Likes Received:
    577
    Videogames do not lend themselves well to massive parallelisation.
  12. SXX

    SXX Post Master General

    Messages:
    6,896
    Likes Received:
    1,812
    Problem is: when you have real projectile simulation with collisions you have to keep tons and tons of data in memory about game state for example on certain planet. There is no way all this state can be loaded into GPU fast enough and if simulation wouldn't take into account all required details then there is no point in simulate projectiles at all, better just fake them like tons of MMO games do where all "attacks" it's just fake effect when server just return "hit / miss / dmg" result on each action.
    squishypon3, tatsujb and cdrkf like this.
  13. jvickers

    jvickers Member

    Messages:
    56
    Likes Received:
    32
    First off, I don't know all the details about CPU-GPU communication, but when I see someone say something technological can't be done, I'm sceptical. If the game state can be transferred over the internet quickly enough, it would be surprising if it could not also be transferred over the PCIe bus connecting the CPU to the GPU. The rated bandwidths there for some relatively high-end cards are in the hundreds of GB/s (such as http://www.tomshardware.com/reviews/amd-radeon-r9-285-tonga,3925.html). 'There is no way all this state can be loaded into GPU fast enough' seems like the kind of statement that could be proven wrong by someone actually doing it. It may not be at all easy though, with syncronization and timing issues getting in the way of a smooth programming experience. Just from considering numbers regarding the bandwidth, the amount of data that would need to be transferred would be way below the capacity of the link between the CPU and the GPU (the PCIe bus).

    AMD (and others) have been doing some work on making a unified memory space for the GPU and CPU components of their APU (Accelerated Processing Unit) system. The architecture is known as Heterogeneous System Architecture (HSA). AMD explains it here: http://developer.amd.com/resources/...hat-is-heterogeneous-system-architecture-hsa/. The HSA foundation, which was founded by AMD and others, has a site here: http://www.hsafoundation.com/. An example of an AMD APU is the AMD A10-7850K, which retails for $180.

    I am very impressed with the theory and goals behind some of AMD's latest chips.

    (I am not an employee of, contractor for, or shareholder of AMD)
  14. tylerseacrest

    tylerseacrest Member

    Messages:
    56
    Likes Received:
    19
    As a student currently studying programing, I know a little bit about this and I tend to get a little exited. So I went a little overboard with this. Basicly the jist of it is you could get perf increases with opencl/cuda or hsa but only for very expensive specailized hardware.

    HSA is a very well thought out technique that when mature will fully allow shared cpu/gpu simulation. However, HSA is still brand new. The hardware that supports it isn't powerful enough to make a whole lot of difference in PA and it will be several years before it becomes a viable technology. Mainly because we will need to redesign how we build our motherboards to support it.

    OpenCL is fully capable of supporting a physics engine. The open source bullet physics library is switching to a full opencl pipeline in bullet 3.x coming next year - with stunning results. However. OpenCL is extremely difficult to code for. Bullet 3.x has been in development for several years now and it is just barely coming close to a usable state. It isn't just a new language, it's an entirely new and very restrictive coding paradigm. Likely the entire physics engine would have to be recoded from the ground up to support it.

    And the gain isn't a guarantee. If a gpu is only doing one thing its crazy fast. But the way PA is designed, you have AI, navigation, physics, and the communications between them to worry about and switch between these tasks would slow the GPU down dramatically. Higher end GPU's could probably get a speed increase but how much of one is a difficult question.

    Finally we have memory. Most graphics cards come with 1-4 GB of onboard memory these days. While thats great in most cases PA works best with at least 8 GB. Not all of that is related to simulation but a good portion of it is. There just isn't enough memory on a normal graphics card to handle all those units and planets. You can transfer between the System RAM and GPU onboard memory but that is 'slow', about 15.4 GB/s for a PCI-E 3.0 card running at x16. 15.4 GB/s may seem fast, especially when PA sends all that unit data over the internet. However, the data PA sends over the internet is only small bits and pieces. X unit, with Y acceleration, is currently at position Z1, and will arrive at position Z2 at time T. For simulation you have to send every location, rotation, velocity, and the previous location rotation and velocity for every single frame at 10 FPS. In addition to any planet data they might need. In fact, due to the limited amount of ram on the gpu you would have to do multiple data updates per frame, breaking each planet in to sections. Once again this is probably doable, but would not get much, if any, perf increase.

    TL;DR: You're right. You could probably convert PA to OpenCL. However, mainly due to memory constraints, you probably couldn't get a perf increase without using pro-grade cards that have lots (8 to 16GB) of memory. While this would be very awesome, its not useful for 99% of the PA crowd.

    PS: The stats you linked to refer to the speed that GPU can access its Onboard memory and has no relation to how fast the GPU can communicate with the rest of the system. That is reliant on the PCI-E bus which currently has a maximum speed of 15.4 GB/s.
    jvickers likes this.
  15. jvickers

    jvickers Member

    Messages:
    56
    Likes Received:
    32
    Thanks for the very informative reply.

    I just did a quick sum:

    1000000 * 500 * 10 / 1024 / 1024 / 1024 = 4.66

    1,000,000 objects with 500 bytes of data being sent 10 times a second is 4.66 GB data. It seems possible.

    I also think at this stage, in terms of Uber's development of the PA code, if OpenCL were to be used, offloading some of the processing, rather than the whole simulation, could work well. Pathfinding being one thing that could be done on OpenCL in a modular way, updating position vectors based on movements would be another. The tasks which would run well on the GPU could be run there, while the CPU is under less pressure.

    I'd be interested in knowing how much space would be needed to represent the game state. I have a hunch that there would be a way to fit in a few GBs of RAM (or less) but I don't really know about that. Depending on the amount of data that is retained in the GPU, less would need to be sent for each cycle. If I were to say there were 1,000,000 objects with 500 bytes of data each, that would be 0.47 GB. Perhaps a PA developer could provide better figures to use as an estimate. I'd prefer to base estimates on the amount of data that needs to be held (and can be derived theoretically) rather than how much RAM is actually used in the current system.

    I've just considered the amount of data needed to hold a planet - if it's the whole planet similar to how it gets displayed then I can envisage it being far too big (maybe using up more than all of that 1,000,000 object budget I arbitrarily set). A much smaller approximation of the planets could be held, which functionally would say what is where, but would not have sufficient detail to render something to be visually appealing.
  16. SXX

    SXX Post Master General

    Messages:
    6,896
    Likes Received:
    1,812
    It's can be done, but overhead will be so high so there is completely no reason to do that at all. Problem not just in speeds, but in too high latency. Might be with HSA and a lot more powerful GPUs on die with CPU we'll see something like that possible, but for now it's just not there.

    Problem is: APU CPU cores are slow like hell compared to high-end Intel CPUs and AMD FX CPUs. Integrated graphics still a lot slower than any $200 GPU. Simulation in RTS game it's extremely CPU-intensive task and trust me you don't want to see games hosted by some AMD APU owner.
  17. cola_colin

    cola_colin Moderator Alumni

    Messages:
    12,074
    Likes Received:
    16,221
  18. ace63

    ace63 Post Master General

    Messages:
    1,067
    Likes Received:
    826
    Keep in mind that video is only cubes. Simulating anything other than cubes or spheres is a gazillion times more complicated.
  19. cdrkf

    cdrkf Post Master General

    Messages:
    5,721
    Likes Received:
    4,793
    Kaveri (latest amd apu) is equipped with the latest core design from amd. Per core it is faster than fx, though due to being an apu it's only got 2 modules. That said I wouldn't discount it, it sits between an i3 and i5 in most cpu tasks (that support multi threading at least). The onboard gpu is powerful, also the few hsa tests available show that kaveri can out run an i7 if the application supports hsa (and that is including the i7 using its on board gpu + cpu cores). There is a lot of potential there, but as usual amd are a bit early. It will take a bigger player (intel) to support the hsa spec before it will take off...
  20. cola_colin

    cola_colin Moderator Alumni

    Messages:
    12,074
    Likes Received:
    16,221
    bullet 3.x seems to support way more than just plain cubes though.

Share This Page