Predictive Coding Approximates Backprop along Arbitrary Computation Graphs (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 9 чер 2024
#ai #biology #neuroscience
Backpropagation is the workhorse of modern deep learning and a core component of most frameworks, but it has long been known that it is not biologically plausible, driving a divide between neuroscience and machine learning. This paper shows that Predictive Coding, a much more biologically plausible algorithm, can approximate Backpropagation for any computation graph, which they verify experimentally by building and training CNNs and LSTMs using Predictive Coding. This suggests that the brain and deep neural networks could be much more similar than previously believed.
OUTLINE:
0:00 - Intro & Overview
3:00 - Backpropagation & Biology
7:40 - Experimental Results
8:40 - Predictive Coding
29:00 - Pseudocode
32:10 - Predictive Coding approximates Backprop
35:00 - Hebbian Updates
36:35 - Code Walkthrough
46:30 - Conclusion & Comments
Paper: arxiv.org/abs/2006.04182
Code: github.com/BerenMillidge/Pred...
Abstract:
Backpropagation of error (backprop) is a powerful algorithm for training machine learning architectures through end-to-end differentiation. However, backprop is often criticised for lacking biological plausibility. Recently, it has been shown that backprop in multilayer-perceptrons (MLPs) can be approximated using predictive coding, a biologically-plausible process theory of cortical computation which relies only on local and Hebbian updates. The power of backprop, however, lies not in its instantiation in MLPs, but rather in the concept of automatic differentiation which allows for the optimisation of any differentiable program expressed as a computation graph. Here, we demonstrate that predictive coding converges asymptotically (and in practice rapidly) to exact backprop gradients on arbitrary computation graphs using only local learning rules. We apply this result to develop a straightforward strategy to translate core machine learning architectures into their predictive coding equivalents. We construct predictive coding CNNs, RNNs, and the more complex LSTMs, which include a non-layer-like branching internal graph structure and multiplicative interactions. Our models perform equivalently to backprop on challenging machine learning benchmarks, while utilising only local and (mostly) Hebbian plasticity. Our method raises the potential that standard machine learning algorithms could in principle be directly implemented in neural circuitry, and may also contribute to the development of completely distributed neuromorphic architectures.
Authors: Beren Millidge, Alexander Tschantz, Christopher L. Buckley
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Наука та технологія

КОМЕНТАРІ • 84

3 роки тому ⁺⁷
Thank you for analyzing this awesome paper Yannic, much appreciated.
@sebastianmestre8971 3 роки тому ⁺⁵
If I understand correctly, you first do a forward pass to make some guesses, then you do a backward pass to find better guesses, then you do a parallel pass to improve weights. (though you can kick off the weight refinement on a separate thread as soon as you find the improved guess)
The cool thing is that we can refine weights on multiple layers at once, instead of going one at a time, even if there are a few sequential steps before that.
@1234dbk 9 місяців тому ⁺²
This might be a silly question, but if you are doing a backwards pass to refine your guesses, doesn't that still not solve the main issue with why these people created this in the first place -- to solve the lack of bidirectionally in biological circumstances (for example an RNN of synapses of neural pathways). To generalize, if the graph of nodes is purely one directional, how would information on error be sent backwards after calculating it?
@rockapedra1130 3 роки тому ⁺⁴
This was super helpful! Thanks! I love this channel !!
@cedricvillani8502 3 роки тому ⁺¹
Which part exactly was helpful?
@rockapedra1130 3 роки тому ⁺¹
@@cedricvillani8502 well ...all of it! He goes from the abstract and motivation to describing the general idea with simplified drawings to analyzing each equation to commenting on figures to dissecting the code and finally to his considered opinion about the whole thing.
For me, this level of comprehension would take weeks (at least). Plus there are tons of papers out there and he filters and reviews “what’s hot” papers, another huge time saving!
This channel is awesome!!!
@gruffdavies 3 роки тому
This could be a gamechanger. Thanks for the analysis!
@leylakhenissi6641 3 роки тому ⁺⁶
Thank you for the paper presentation, it's really well done and provides a useful overview of the topic and the paper. May I kindly ask that in the future you refrain from poking fun at other people's code though? It may keep others, especially in scientific computing, from making their code open/public, which would be a shame for everyone. Cheers.
@JamesAwokeKnowing 3 роки тому ⁺¹⁹
I think the big deal (other than plausibility) only makes sense in context of hardware. With this scheme you can build local hardware neurons which only compute locally. In software it seems like "backward pass" because central processor goes around computing for all the neurons. Instead imagine a cuda core per neuron which never needs to load memory from anywhere except the other cores it's physically connected to.
@fimbulvntr 3 роки тому ⁺⁴
Also, again thinking about hardware, this would enable "dynamic scaling" of a network, where you simply throw more neurons into the mix (since they're all clones and independent). I.e. imagine a gpu where you can bolt on extra cuda cores, ad infinitum.
The current model needs (maybe I am wrong and misunderstood, I am a layman) to know the entire topology before it can work
@eelcohoogendoorn8044 3 роки тому ⁺¹
Exactly; where this becomes relevant is with hardware that is explicitly simplified to take advantage of this compute structure that presumably does not need any global connections.
@ssssssstssssssss 3 роки тому ⁺²
I am doubtful about the "plausibility" argument, but the realization of such a learning mechanism in hardware seems to me a very powerful argument. I imagine we could get analog processors to carry out this learning algorithm incredibly fast.
@23kl104 3 роки тому ⁺³
Can't you just as well make the same case for backpropagation?
Imagine a bunch of backprop neurons only receiving information from their neighboring nodes (last hidden state for forward pass / gradient of next node for backward pass).
@ssssssstssssssss 3 роки тому ⁺⁷
Interesting paper.. But this still does not seem biologically plausible to me, which they stated as the purpose. Not to mention, from what I see, so-called predictive coding is a variant of backpropagation (implementing dynamic programming) so saying it approximates backpropagation is misleading. They should qualify the title "Predictive Coding Approximates Backpropagation with Gradient Descent".
@JTMoustache 3 роки тому ⁺³¹
Predictive coding is a red herring, it is really a dynamic programming version of a variational gradient descent.
@skdx1000 3 роки тому ⁺⁵
yeah it seems analagous to using a taylor series to approximate a function where in this case the error term corresponds to the nth derivative differential multiplier and the function is represented as the evaluation of the original LSTM cell.
@jordyvanlandeghem3457 3 роки тому ⁺¹
@@skdx1000 oomph what resources should I check to understand this reply? :)
@skdx1000 3 роки тому ⁺⁴
@@jordyvanlandeghem3457 this link will provide an explanation as to what a taylor series is: brilliant.org/wiki/taylor-series/ and then from there you can check the formula for derivation against the techniques used in the paper explained by yannic and then compare how the error term technique used in this paper corresponds to how a taylor series approximates error using the derivative
@AbeDillon 3 роки тому ⁺⁶
I don't see anything wrong with giving "a dynamic programming version of variational gradient descent" a shorter name, like "predictive coding". What makes it a red herring?
@peterfireflylund 3 роки тому ⁺¹
@@jordyvanlandeghem3457 take a look at 3brown1blue. He has a series of videos that explain Taylor series intuitively. In order to REALLY understand them, you need to understand calculus and do lots of homework exercises, of course. But maybe the videos are enough for you? Or maybe just the brilliant link was enough? Up to you :)
@woolfel 3 роки тому ⁺¹
This paper makes me ask this question. After you've trained a base model, could the local errors reduce the need to backprop during re-training? If that's possible, would it actually reduce the cost of retraining base models?
@v.gedace1519 3 роки тому ⁺¹
WOW! The paper is great. Your explanations are greater!
@subarashii1368 3 роки тому ⁺⁵
I feel it just keep input/target fixed, then back-propagate one layer per iteration. In real life, you don't keep input fix until your brain form an equilibrium.
@Yash-vm4uk 3 роки тому ⁺²
It is still using back-propagate which he said is not possible in brain and done by looping, so how is this biologically possible?
@raunaquepatra3966 3 роки тому ⁺⁴
If in the inner loop (where they update the guesses with 100 iterations or so) we only run it once and instead of updating the predictions with small steps we just add the whole error, Then isn’t it becomes normal backprop 🤨
Pls correct me if I am wrong.
@probbob947 3 роки тому ⁺²
The structure of the update rule resembles a graph Laplacian.
@lucidraisin 3 роки тому ⁺²
Thank you!!
@lucidraisin 3 роки тому ⁺²
Nobody could have explained it as well as you did!
@kimanthony1667 3 роки тому ⁺²
Next project ==> lucidrains/predictive-coding-backprop-pytorch
@herp_derpingson 3 роки тому ⁺¹¹
21:30 I wonder how would skip connections look for this system.
34:20 I wonder if we should run it to convergence or would it cause instability as it overfits to the batch.
I am not sold on this. We are still sending information backward. How is this biologically feasible?
@linminhtoo 3 роки тому ⁺³
Looks like it happens through the local 'feedback' connections between neurons
So the main difference from backprop is that the gradient doesn't need to be computed exactly all the way from the loss value back to the very first neurons that received the input, in one pass, like in backprop. We can just do it locally and it approximates backprop (which makes sense, since the errors are being sent backwards anyway)
@herp_derpingson 3 роки тому ⁺¹
@@linminhtoo Regardless if it done in one pass or multiple. Bidirectional propagation is not feasible.
@ibrax1 3 роки тому ⁺⁷
@@herp_derpingson Biological neurons do have local feedback dendrites.
@wunkewldewd 3 роки тому ⁺⁵
I was confused by this too! I have two qualms: A) it seems like this still requires sending info backwards like you said, so I don't see how it solves the problem... and B) backprop could be considered local IMO: even though the gradient at some much earlier layer is dL/dw_1 or whatever, the chain rule decomposition has the effect of breaking it down into a local gradient, da/dw_1, times the error signal from later in the network (the dL/dy * dy/dw * ... etc).
@danielbrennan5942 3 роки тому ⁺²
long term potentiation and long term depression (loosely) follow hebbian learning rules. if this algorithm also follows those hebbian rules, it should be biologically plausible
@TheIvanIvanich 3 роки тому ⁺¹¹
More papers about predicitive coding!
@boss91ssod 3 роки тому ⁺²
-> ... please!
@Zantorc 3 роки тому ⁺⁷
This is interesting but I'm not sure it's applicable. The brain doesn't use point neurons, nor can it be replicated using them. You'll be lucky if you get 2 bits of accuracy out of most neurons. Beyond sensory motor inputs the idea that a neuron could output a value which could be compared to some other value is a non starter. Most connections are feedback in the brain not feed forward. The more you know about the brain the less like the idealised NN it seems.
@semjuel3077 3 роки тому ⁺¹
@Zantorc Could you explain what you mean by "Most connection are feedback, not feed forward"?
@Zantorc 3 роки тому ⁺¹
@@semjuel3077 ua-cam.com/video/iccd86vOz3w/v-deo.html
Explains it quite well.
@charleshong1196 3 роки тому ⁺²
I just don't get it. What's the difference? it still need to backpropagate... the temporal and spatial dependence have not changed...
@YannicKilcher 3 роки тому
the algorithm is biologically plausible
@dm_grant 3 роки тому ⁺³
Neurons are not bidirectional. Exactly!
@quebono100 3 роки тому ⁺¹³
Nice Paper :) tanh-k you
@jonatan01i 3 роки тому ⁺⁴
tanh-q
@quebono100 3 роки тому ⁺³
@@jonatan01i even nicer :D tanh-q
@DavidSaintloth 3 роки тому ⁺³
This looks a lot like the mechanism I presented as salience modulation when I presented the idea in a 2013 post.
sent2null.blogspot.com/2013/11/salience-theory-of-dynamic-cognition.html?m=1
The back propagation happens through a salience driven remapping of stored information in any given sensory dimension. With inferencing happening continuously between data mapping into the networks.
Tangent: there is some evidence that real neurons do have feedback sub signals along the firing path. Which would make this paper more biologically similar than you asserted.
@lemurpotatoes7988 3 роки тому
Link to evidence of feedback subsignals, please?
@ayesaac Рік тому
How is this not just recursion-based backpropogation?
Predictive coding, as I understand it, is a neuron making guesses about the _input_ it will get, not the output of the next neuron. Then the neuron adjusts its model to better predict its own inputs. That's what makes it local; its learning is based on its input; it doesn't need to know anything else.
@kascesar 3 роки тому ⁺¹
wich program did you use to read papers?
@jerrygreenest 3 роки тому ⁺¹
And what OS 🤔
@YannicKilcher 3 роки тому ⁺¹
OneNote on Windows
@v.gedace1519 3 роки тому ⁺¹
I am pretty sure that the linearity of the decompose is the issue. Means dL/dh2 * dh2/ dw2 -> ...h3 ... -> ... h4 ....
Nature make it different.
dl/dh2 *dh2 /dw2 -> F[h3](L, h3w'3 ... h0w'0) -> F[h2](L, h2w'2 ... h0w'0) where w'... are weights aka "feed back connections". Hard to explain using text only. But you got the idea ;-)
@23kl104 3 роки тому ⁺³
no lol, I don't
@Yash-vm4uk 3 роки тому
It is still using back-propagate which he said is not possible in brain and done by looping, so how is this biologically possible?
@SianaGearz 3 роки тому
Back propagation as defined is a global mechanism that makes use of the computer implementation of neural networks. However, in the brain, there can be no explicit metadata describing the connections, and no direct connections spanning all the way across the brain!
Two-way communication for the purpose of reinforcement occurs biologically, but it is local, spanning just every pair of adjacent neurons. There are many mysteries regarding function of biological neural tissue.
So this paper presents a mechanism which it shows to be identical in result to back-propagation, but which is local only, not global, and appears biologically plausible. It helps come one step closer to understanding the function of biological neural tissue.
@amitkumarsingh406 3 роки тому ⁺⁴
How about the papers in dark mode
@keeperofthelight9681 Рік тому
It doesn’t for Reinforcement learning though
@diegofcm6201 3 роки тому
Like Jeff Hawkins says: neurons CANNOT be assigned numerical precision whatsoever. So even if there wasn’t any backwards pass, just by the fact that it’s assuming that much stability in input output is flawed from the POV of biological plausibility
@diegofcm6201 3 роки тому
It’s much more likely that it’s something more discrete, with Hebbian learning happening through information sent by neurotransmitters
@Hukkinen 3 роки тому
How do neurons cannot be approximated by numerical representations? - I'd say this is just a trade-off between realism and abstraction of the model. Why am I wrong here?
@diegofcm6201 3 роки тому
@@Hukkinen
TL;DR: It's naive to try to pick just a single part of bio neural networks (local update rules) and try to tie it (with expectancy of similar/better performance) in artificial one, without considering most of the other computational aspects of the real thing.
The idea is that neuronal connections, in the actual brain, are maintained by STDP (spike time dependent plasticity) which is a rule that is not much dependent on the actual voltage but on their behaviour in the long term (potentiation or depression). There are no static weights, they're a dynamical property, evolving over time.
There are also lots of other things we are neglecting, like the fact that memories are in the connections (somehow) and computation is done in the time domain (tied to the latency of the input neuron's time before spiking occurs in the outputs, and, just a "small" detail, in bio-neural networks the output neuron can spike before inputs).
@minecraftermad 3 роки тому
I hope i can understand this cuz those graphs sure didn't look promising
@hoaxuan7074 3 роки тому ⁺¹
Well almost anything will train a neural net and there is no point in being too clever about it. A dot product is a statistical summary measure and a filter. It will respond to the statistics of the neurons of the prior layers. No neuron can be so exceptional because its output will be shared by many forward dot products. And any realistic optimisation algorithm will be able to search only a small space of statistical solutions. And is that a bad thing? You exclude many brittle overfitted solutions.
@hoaxuan7074 3 роки тому
I guess one way to test that is to delete a weight and see how badly it affects the net, or delete one neuron.
Do you only ever get a small statistical effect or does such an action sometimes dramatically impact the net?
Evolutionary algorithms like Continuous Gray Code Optimization can actually train large nets. And can have low network bandwidth requirements relative to BP. for federated learning. Each compute device is given the full network model and part of the training set. The same short sparse list of mutations to make to the model is sent to each device and it returns the cost for its part of the training set. The costs are summed and if an improvement an accept mutations message is sent to each device else a reject message.
Anyway there is some kind of related chat at 'discourse numenta' under sparse numenta nets.
@blacklistnr1 3 роки тому ⁺⁸
I'd like to say that I appreciate how you handled discussing this paper. Perhaps this is my biased incomplete view, but damn some research is this over pompous explanation of a really basic idea that makes you facepalm "Is that it?". I imagine these guys chuckling with pipes:
- What should we research next?
- Well I'd love to do something useful, but all the money seems to go to A.I. these days.. *scratches beard*
- Oh... these primal monkeys, will they ever understand the beauty of exploring math?
- I truly don't know, but let's give them what they want: deep networks and backprop.
- Hasn't that been done like 10000 times already?
- No no no, we don't do backprop, we break the chain with local variables and call it predictive coding
- You're mad! *loud laugh* So you want do 100 LOCAL iterations to propagate what could be done in one pass?
- You wouldn't say it like that.. use flashy words neuromorphic, LSTM, etc.
- Neuromorphic Machine Learning? isn't that like what we've been calling what we're doing since 1970s? Have a little dignity, at least call it "Hebbian plasticity"
- *drinks the whole glass and slams it on the table* Fine with me. Let's get this over with.
@gruffdavies 3 роки тому ⁺²
The paper's purpose was to address "biological plausibility" so "Hebbian plasticity" is perfectly appropriate.
@albertwang5974 3 роки тому ⁺²
Brain do backpropagation by generating connect between activating cell to the confirmed result.
@yasurikressh8325 3 роки тому ⁺¹
Doesn’t look hideous to me. If it can be mapped than it is a beauteous model
@Prince-sf5en 3 роки тому ⁺⁹
Can't believe I'm first here
@bassr3hab 3 роки тому ⁺²
haha same here
@herp_derpingson 3 роки тому ⁺²
Can't believe its not butter
@andreassyren329 3 роки тому ⁺¹
Oh I had no idea this just premiered.
@notgabby604 3 роки тому ⁺¹
Naw, it's trans-fat margarine. Which certainly was a terrible thing.
@quebono100 3 роки тому ⁺¹
@@bassr3hab same here on your post xD (recursion?)
@Rizhiy13 3 роки тому ⁺²
Not very convincing so far, distribution of errors doesn't seem to offer any advantages in comparison to backprop.
@AirmailMRCOOL 3 роки тому ⁺³
"Advantages" aren't really what they were looking for. They were looking for a biologically possible training method. Your brain doesn't use backprop, so they're just theory shooting what it does use.
@444haluk 3 роки тому ⁺¹
This is the smartest thing I have ever heard. I have always hated backprop because at each step it assumes it finds the temporarily perfect solution. This approach fixes that monstrosity.
@23kl104 3 роки тому ⁺³
no it doesn't. It finds the locally steepest direction.
@quAdxify 2 роки тому
This is a bit difficult to understand. I think it just needs a bit more theoretical context for all the people not familiar with variational inference. For the interested viewers, here is an excellent review (including predictive coding and VI) by the authors of the discussed paper (I belief) arxiv.org/pdf/2107.12979.pdf.

Наступне

Автоматичне відтворення

Deep Networks Are Kernel Machines (Paper Explained)