Multi-GPU compatibility to avoid SUPCOM problems

Discussion in 'Planetary Annihilation General Discussion' started by sovietsandwhich, April 8, 2013.

  1. sylvesterink

    sylvesterink Active Member

    Messages:
    907
    Likes Received:
    41
    It's an informative post, BulletMagnet, but FPGAs have several other limitations that make it unlikely that they'll gain favor over the current CPU implementations. The main issues are that they perform significantly slower and are quite a bit larger than their fixed-architecture counterparts, so when it comes to the consumer market, fixed-architecture will be dominant for a good while, even if FPGA prices go down. (And that is somewhat unlikely too, seeing as it takes more to build an FPGA than a fixed-architecture chip.)
  2. BulletMagnet

    BulletMagnet Post Master General

    Messages:
    3,263
    Likes Received:
    591
    All true, but you don't need fast clock speeds with an FPGA.

    We're all after high clock speeds in regular CPUs because 1) everything is largely done sequentially and 2) a lot of complicated things need to be done over several clock cycles. My second point is largely an extension of the first.

    There's often a limit on how parallelizable something is (even for FPGAs), but you can get away with a whole lot more. I don't think CPUs do division operations in a single cycle because devoting that much silicon to that operation is wasteful compared to paying the penalty of doing it over a few cycles.

    Not every program uses division. Of those that do, only some do it often enough to warrant thinking about performance. Someone designing a CPU has to choose one of these; no division operation at all, a small but slow operation, or a large but fast one. An FPGA isn't limited by that; programs (for lack of a better word) that don't divide don't allocate any silicon to it. Ones that can suffer a low divide can use a small divider circuit, and ones that need to do it quickly can have quick ones.


    Another bit of history, early processors didn't have a single-cycle multiplication operation. So multiplications were slow. When you had a trivial y = 2 * x, you would actually write y = x + x because adding was faster. The other option the programmer had was y = y << 1 which just bumps all the ones and zeros to the left one place. That exploits how binary numbers work, so it technically makes it a hack. A very fast, efficient, and well known hack. But a hack nonetheless.

    People who made compilers quickly made the compilers detect when a multiplication was guaranteed to multiply by 2 (or 4, 8, 16, etc) and silently replace the written code with the faster operation. Eventually Moore's Law made it viable to put in a larger multiplication circuit... at least in desktop/server processors. Dinky little microcontrollers used in industrial processes still rely on compilers heavily.

    Division is worse, it's a nasty motherfucker of a thing to ask a computer to do. Will Moore's Law will let us have a single-cycle division without doubling the cost of new CPUs? I don't know. Maybe. It's entirely possible that it's already happened.


    The CPU has to be able to add, multiply, divide... everything when you buy it. It's capable of things you often don't need. Things you'll never use. An FPGA doesn't. Suppose you had (two equally sized) wafers of Silicon, one having a bog-standard CPU on it, and the other an FPGA. When your CPU isn't doing certain operations, the transistors for doing unused operations are wasted space. Wasted space that you still paid money for. The FPGA isn't limited in that way, it can use that spare space to do more operations at the same time.

    This comes back to the wrist-watch example. It has two tiny blocks of memory, one to remember the time, one to remember when the alarm is set. It's got two little adder circuits, one to +1 the time and one to +1 the alarm when you set it. Finally it's got a circuit that compares the two blocks of memory and beeps when they match.

    Every Core i5 from Intel has those, as well as about 20MB more memory, and about 80 more adders, as well as multipliers, comparators, and a whole host of things I can't recall. But using an Core i5 as a wrist-watch is hilariously wasteful.


    Another reason why FPGAs are expensive is because manufacturers bundle piles and piles of IP in. It's basically DRM-loaded software for FPGAs. Buying any high-end FPGA always included buying a pile of software with it. Intel and AMD don't include extra programs that can only be used on that product. If they did expect them to raise the prices of their products at the same time.
  3. bobucles

    bobucles Post Master General

    Messages:
    3,388
    Likes Received:
    558
    Any half decent FPGA comes with dedicated algorithmic hardware, just like a CPU does. It's not an amorphous blob.
  4. drsinistar

    drsinistar Member

    Messages:
    218
    Likes Received:
    0
    If I may ask, an FPGA is basically stem cell for a computer, but it can change it's state at any given time based on the program you have?
  5. menchfrest

    menchfrest Active Member

    Messages:
    476
    Likes Received:
    55
    Another reason FPGA's are not common place is that they hard to program for well (or so I have been told). You basically need to hire someone who is specialized in programing them, given the limited number of experts in such things, more widespread adoption is tricky at best.

    This is changing though, or so is my impression.

    Although given I am taking this from a known unreliable source, I could be totally wrong
  6. BulletMagnet

    BulletMagnet Post Master General

    Messages:
    3,263
    Likes Received:
    591
    If you're willing to pay for dedicated multipliers and memory that you're not necessarily going to use, then strips of dedicated circuits are fine to have. The benefit is that a dedicated multiplier is faster, the downside is that you've got it even when you don't need it.

    It's just pros and cons. Signal processing (no matter what you're doing) always uses heaps and heaps of multiply-add operations, so it's worth the cost of having them. If I was running a database on an FPGA, I'd much rather have dedicated comparators than dedicated multipliers.

    It'd be more accurate to say stem cells for digital logic. Computers still need power supplies, and regulators. Even FPGAs need those (from my experience FPGAs have been more picky about power than regular microcontrollers).

    I think you're putting the cart before the horse.

    They're currently harder to program than writing regular software. That is certainly true.

    But there's more demand for regular software than there's for FPGA software. So there's more expert software programmers than there is expert hardware programmers. I don't think someone who can program an FPGA is any more deserving to be called an expert than any regular programmer. I've had to program them at university, and I'm definitely not an expert.
  7. cola_colin

    cola_colin Moderator Alumni

    Messages:
    12,074
    Likes Received:
    16,221
    I have never heard of those.
    So if I were to write a programm for an FPGA I would write normal code in whatever language and run a compiler. That compiler would than modify the structure of the FPGA to execute the program I want to run? Sounds good in theory, but how would you possibly do any kind of multitasking on it? I would guess that switching the structure of an FPGA is rather slow.
  8. BulletMagnet

    BulletMagnet Post Master General

    Messages:
    3,263
    Likes Received:
    591
    Assuming you have enough room to fit both in at the same time; you literally do both at the same time.

    If you didn't have enough room, then you'd need to swap code and data back and forth. We do that with CPUs. When you look at the technical details, it's messy as hell. But it's still done dozens if not hundreds of times every second in computers around the world, and done fast enough that we rarely ever notice. Ever wondered how your old single-core computer could run multiple programs at once?
  9. cola_colin

    cola_colin Moderator Alumni

    Messages:
    12,074
    Likes Received:
    16,221
    I know how that works and I know switching the thread context is rather expensive to do and should be considered when writing a multithreaded application.
    An FPGA would have to reconfigure itself completely. I have no idea of how that works, but my guess is that it is too slow to do it dozens or hundreds of times per second.
  10. bobucles

    bobucles Post Master General

    Messages:
    3,388
    Likes Received:
    558
    FPGAs use dedicated multiplier and memory hardware, because the actual FPGA is horrible at the task.

    But they aren't. FPGAs are more like an emulator than any complete CPU. They use an incredible amount of silicon because they are designed to build logic circuits in any arbitrary combination. This means huge heaps of redundant hardware at every step along the way.

    I saw a class project to build a 24 instruction RISC CPU on an FPGA. It was $150 of hardware that could emulate a CPU at 50MHz. So yeah. It's a tool for building arbitrary hardware, but it will always be inferior to the dedicated thing.
  11. sylvesterink

    sylvesterink Active Member

    Messages:
    907
    Likes Received:
    41
    This right here.
    You also have to remember that all the custom circuitry paths in an FPGA add to the overall latency of the hardware. For those who aren't EEs, the primary limitation for almost any modern hardware is how fast the signal propagates. To put it simply, if you input a number into a black-box of circuitry, the result will not come out of the other end immediately, because each component adds a tiny delay to the calculation. This is what limits clock speeds. The clock cycle shouldn't be shorter than the delay of that signal propagation, otherwise the timing of the system desynchronizes, so calculation events will not happen in the order that they should.
    (I don't know if you got the chance to hook up your FPGA to an oscilloscope and see the delay for yourself when you worked with them, BulletMagnet, but if you did, you'd have seen exactly what I refer to.)

    An FPGA is a whole mass of this kind of circuitry, and even for unused parts (assumed to be straight-wires), there is still a slight bit of delay added. This is why production components use fixed-architecture hardware, because all those unused components are cut out to allow better signal propagation and reduce latency.

    Furthermore, the signal propagation issue is the same reason why multiply and division operations are so challenging to implement. The basic algorithms themselves aren't too complex, but implementing them in hardware tends to take a lot of component circuitry, again adding to the delay. (Note that this is especially critical in modern chip design, as they tend to implement a lot of features, such as pipelining and partial cycles, that are very reliant on synchronization.)

    As for having these components when they aren't needed, most modern chip implementations bypass the slower ones, such as addition and multiplication, when they aren't in use, avoiding most of the added signal delay. (This is, in fact, why those types of operations may take extra cycles when they are used. However, I haven't looked into recent chip techniques, so I might be a bit out of date on this.)

    Oh, and for those curious about how to program an FPGA, the most basic method is to use software to lay out the actual circuitry for the board to emulate. However, more experienced developers use languages like Verilog to specify the configuration. Either way, the results are "compiled" down into instructions on which paths to enable on the FPGA, resulting in a circuitry path that acts in the desired manner.

    [EDIT]
    I forgot to mention that signal propagation plays a big role in the limitation of multi-core development too. The cores have to be synchronized to a certain degree, no matter how multi-core friendly you make your code. The more cores you have, the more hardware you need to coordinate them, again adding to the signal latency. As a result, adding more cores results in diminishing returns, even if the code is multi-core idea. There's a certain point at which having more cores reduces the performance because of all the added hardware.
    In fact, last time I checked, for general-purpose computing, 8 cores seems to be the sweet spot, and after 16 cores, performance decreases. (I know there are huge, 64-core implementations, but these tend to be used by servers, an environment where the extra hardware isn't as critical, as those cores don't need to coordinate as much.)
    Technology moves on, so perhaps these limitations have increased by now, but they'll still persist.
  12. apocatequil

    apocatequil Member

    Messages:
    109
    Likes Received:
    9
    On a derail of a derail, finally the mention of a million units with 64 cores makes sense. lol

    However this back and forth is Fascinating. And it feels like I want to add to this conversation, but the points made about synchronization and latency in the FGPA's ability to change make me realize that even though I'm still on BulletMagnet's side, I've got nothing valuable to say. Other than maybe a token comment that BulletMagnet is talking about things that require and imply advancements in technology, and the thought process of computing in general.

    Perhaps, the possibility of simulating multiple cores at once, all on a single FGPA, could remove some of the drag and delay out of that dance of synchronizing multiple cores? Probably not... But the potential of FGPAs enhancing computer strength is definitely there, just hard to get at.
  13. sylvesterink

    sylvesterink Active Member

    Messages:
    907
    Likes Received:
    41
    There are no sides, really. We all want sweet new tech, and BulletMagnet is correct in that specialized hardware implementations are faster and more efficient than using generalized hardware to run specialized software. But the reality is that with the current level of technology, there are too many technical disadvantages involved with going the FPGA route. (Of course, as with all computer technology, this may change in the near future.)
  14. apocatequil

    apocatequil Member

    Messages:
    109
    Likes Received:
    9
    Ahh, indeed, there are no sides. lol I often have that problem. But yeah, really cool conversation, and I think it's come to it's end, unless someone has some awesome info to push forward about how technology may be changing to open or close that possibility.

Share This Page