xkcd is the funniest thing I know https://cloudplatform.googleblog.co...-machine-learning-tasks-with-custom-chip.html Now I am wondering what specifically those chips do. I thought graphics cards are already pretty nice for dealing with machine learning number crunching.
Neural nets are fascinating. Is it just a matter of building a big enough net with huge amounts of layers to get something that can do all sorts of different tasks without being specifically retrained?
It's not that simple, research has struggled (and still is) for years to find ways to effectively train larger and especially deeper nets. A network can't do anything it has not been trained for. Although for example the Atari playing nets in some cases could learn to play different (but somewhat similar) Atari games with the same net. It can also be possible to train a single network for two different tasks. The obvious benefits of very deep nets with 100k+ neurons are shown in the video: Each layer adds some abstraction to the problem. However the learning of those nets is a problem. Basically learning in those nets works because the whole neural net is actually just a really big formula, that takes a very high dimensional vector as input and produces a smaller vector at the end as output. In between can be tens or even hundreds of thousands of free parameters that define what the network does. To change them in a way that makes the network what it is supposed to do you'd put in an example input, look at the output, calculate how wrong it was and then use the gradient of the function that is the whole network and simply descent that gradient. Like walking downhill to find the lowest point in a 200k dimensional valley. Problem is the deeper your network the less noticeable the gradient is in earlier layers. Basically the gradient usually grows weaker (in some rare situations it instead also may explode, which is just as problematic) in an exponential fashion as it is moved backwards through the network. This is bad for two reasons: - The early layers take forever to learn - The later layers will learn quickly, based on the still near untrained (random behaving) early layers. So they learn to interpret the random chaos that the still really "stupid" first layers produce and then later the early layers end up learning how to produce even better chaos for the later layers. This means that just naively adding more layers to a "vanilla" network will make it less capable to learn stuff, not more. You need to come up with pretty creative ways to setup your network and lots of trickery to combat this. The bad part is: So far it seems that if your learning is based on gradient descent, you'll get this kind of problem and at the same time it seems gradient descent still is by far the best thing you can do to learn a network. The reason why networks like the one shown in the video still can exist is mainly because people cheated by copying what the brain seems to do in the visual cortex: Connect early layers of the network not to the whole image, but instead let the earliest layer for example only have a look at 5x5 big regions. That drastically reduces the number of free parameters that need to be tuned and it forces that layer to only learn useful things that could come out of such small region. So that layer will end up learning things like edge detection. If you let it train long enough. A network like shown in the video can easily take a week or two on high end graphics card to train.
https://shattered.it/ Doing sha-1 on two different files and getting the same result is quite a weird feeling