It may be worth to note that instead of partial derivatives one can work with derivatives as the linear transformations they really are, and also looking at the networks in a more structured manner thus making clear how the basic ideas of BPP apply to much more general cases. Several steps are involved. 1.- More general processing units. Any continuously differentiable function of inputs and weights will do; these inputs and weights can belong, beyond Euclidean spaces, to any Hilbert space. Derivatives are linear transformations and the derivative of a neural processing unit is the direct sum of its partial derivatives with respect to the inputs and with respect to the weights; this is a linear transformation expressed as the sum of its restrictions to a pair of complementary subspaces. 2.- More general layers (any number of units). Single unit layers can create a bottleneck that renders the whole network useless. Putting together several units in a unique layer is equivalent to taking their product (as functions, in the sense of set theory). The layers are functions of the of inputs and of the weights of the totality of the units. The derivative of a layer is then the product of the derivatives of the units; this is a product of linear transformations. 3.- Networks with any number of layers. A network is the composition (as functions, and in the set theoretical sense) of its layers. By the chain rule the derivative of the network is the composition of the derivatives of the layers; this is a composition of linear transformations. 4.- Quadratic error of a function. ... --- Since this comment is becoming too long I will stop here. The point is that a very general viewpoint clarifies many aspects of BPP. If you are interested in the full story and have some familiarity with Hilbert spaces please google for papers dealing with backpropagation in Hilbert spaces. A related article with matrix formulas for backpropagation on semilinear networks is also available. For a glimpse into a completely new deep learning algorithm which is orders of magnitude more efficient, controllable and faster than BPP search in this platform for a video about deep learning without backpropagation; in its description there are links to a demo software. The new algorithm is based on the following very general and powerful result (google it): Polyhedrons and perceptrons are functionally equivalent. For the elementary conceptual basis of NNs see the article Neural Network Formalism. Daniel Crespin
Professor Strang this is another great lecture on Backpropagation and finding partial derivatives from a computational graph. Calculus is still an important piece of all upper level mathematics.
I found this lecture quite confusing. His last example of computing (AB)C vs A(BC) is the key idea. The blog article he talks about I think misses the main computational point because it focuses on the composition of 2 functions and there is little computational advantage between forward vs. backward with only two functions. In case it might help others, here is how I think about it: Say everything is linear but the variables have different dimensions: F(x)=F3(F2(F1(x)) start with variable x (input layer) h1=F1(x) = Cx (hidden layer 1) h2=F2(h1)=B*h1 (hidden layer 2) y=F3(h2)=A*h2 (output layer) Then F(x)=A*B*C*x and jacobian(F)=A*B*C Forward differentiation first computes derivatives of h2 wrt x by multiplying jacobians (JF2)*(JF1). Then finds derivative of y wrt x by multiplying by JF3 --> so this is A*(B*C) Backward differentiation first computes derivatives of y wrt h1 by multiplying jacobians (JF3)*(JF2). Then finds derivative of y wrt x by multiplying by JF1 --> so this is (A*B)*C In a neural network, you might have x very large N-dimensional input (pixels in an image), hidden layers h1 and h2 with a moderate n-number of dimensions, and output y a scalar (probability the network thinks image is a cat). So forward differentiation computes A*(B*C) -> cost is 1*n*N + n^2*N backward differenations computes (A*B)*C costs 1*n^2 + 1*n*N
Usually this guy is great when teaching linear algebra, but I think he got thrown into this course for couple days and was unprepared. I mean that discussion when off track quick.
Thankfully the derivative of x^2(x+3y) wrt x evaluated at x=2, y=3 really does work out to 140. So the method was working. His confusion about whether he needed derivatives wrt y was probably because he would need it going forward (after doing wrt x, do wrt y) but wrt y comes for free doing back propagation.
but if the input nodes count in NN is lower than num. of outputs then the forward mode changes to backprop from the other side, I mean if you compute Jacobians and use them in successive computations in the right order, is that so?
I can't find how to derive the Loss func. in respect to weight1 or 2,3etc, SO how the math is done behind the derivative part of the gradient descent formula. Everyone is showing the result or a smaller reshaped formula, but I would need the steps inbetween. An example where we backprop a single perceptron (with 1 or 2 weights, L2 function) would do it. Pls someone give me a link or a hand. Thanks! I do understand the essence of chain rule, but I wanna know dL/dw = dL/dPred * dPred/dw (dL/dw is the change in the weight1 respec to L2 Loss function, dPred is the derivative of the neuron's output (mx+b) ) I don't get why the result is 2(target-pred) * input1. Is that because L2 is a squarefunction and the derivative of x^2 or error^2 is 2X , so 2error --> 2(target-pred), but then why the other parts of the formula disappear.
It may be worth to note that instead of partial derivatives one can work with derivatives as the linear transformations they really are. Also, looking at the networks in a more structured manner makes clear that the basic ideas of BPP apply to very general types of neural networks. Several steps are involved. 1.- More general processing units. Any continuously differentiable function of inputs and weights will do; these inputs and weights can belong, beyond Euclidean spaces, to any Hilbert space. Derivatives are linear transformations and the derivative of a neural processing unit is the direct sum of its partial derivatives with respect to the inputs and with respect to the weights. This is a linear transformation expressed as the sum of its restrictions to a pair of complementary linear subspaces. 2.- More general layers (any number of units). Single unit layers can create a bottleneck that renders the whole network useless. Putting together several units in a unique layer is equivalent to taking their product (as functions, in the sense of set theory). The layers are functions of the of inputs and of the weights of the totality of the units. The derivative of a layer is then the product of the derivatives of the units; this is a product of linear transformations. 3.- Networks with any number of layers. A network is the composition (as functions, and in the set theoretical sense) of its layers. By the chain rule the derivative of the network is the composition of the derivatives of the layers; this is a composition of linear transformations. 4.- Quadratic error of a function. ... --- With the additional text down below this is going to be excessively long. Hence I will stop the itemized previous comments. The point is that a sufficiently general, precise and manageable foundation for NNs clarifies many aspects of BPP. If you are interested in the full story and have some familiarity with Hilbert spaces please google for our paper dealing with Backpropagation in Hilbert spaces. A related article with matrix formulas for backpropagation on semilinear networks is also available. We have developed a completely new deep learning algorithm called Neural Network Builder (NNB) which is orders of magnitude more efficient, controllable, precise and faster than BPP. The NNB algorithm assumes the following guiding principle: The neural networks that recognize given data, that is, the “solution networks”, should depend only on the training data vectors. Optionally the solution network may also depend on parameters that specify the distances of the training vectors to the decision boundaries, as chosen by the user and up to the theoretically possible maximum. The parameters specify the width of chosen strips that enclose decision boundaries, from which strips the data vectors must stay away. When using the traditional BPP the solution network depends, besides the training vectors, in guessing a more or less arbitrary initial network architecture and initial weights. Such is not the case with the NNB algorithm. With the NNB algorithm the network architecture and the initial (same as the final) weights of the solution network depend only on the data vectors and on the decision parameters. No modification of weights, whether incremental or otherwise, need to be done. For a glimpse into the NNB algorithm, search in this platform our video about : NNB Deep Learning Without Backpropagation. In the description of the video links to a free demo software will be found. The new algorithm is based on the following very general and powerful result (google it): Polyhedrons and Perceptrons Are Functionally Equivalent. For the conceptual basis of general NNs in see our article Neural Network Formalism. Regards, Daniel CrespinP
Note: Videos of Lectures 28 and 29 are not available because those were in-class lab sessions that were not recorded.
We love Gilbert Strang and MIT OCW
Thank you for posting this. I was wondering and you answered my question. Thanks again for posting these.
My guess is that Lectures 28 and 29 reveal the real essence of neural network. Just kidding!
ok....I was asked this question, and now I found the answer here.
It may be worth to note that instead of partial derivatives one can work with derivatives as the linear transformations they really are, and also looking at the networks in a more structured manner thus making clear how the basic ideas of BPP apply to much more general cases. Several steps are involved.
1.- More general processing units.
Any continuously differentiable function of inputs and weights will do; these inputs and weights can belong, beyond Euclidean spaces, to any Hilbert space. Derivatives are linear transformations and the derivative of a neural processing unit is the direct sum of its partial derivatives with respect to the inputs and with respect to the weights; this is a linear transformation expressed as the sum of its restrictions to a pair of complementary subspaces.
2.- More general layers (any number of units).
Single unit layers can create a bottleneck that renders the whole network useless. Putting together several units in a unique layer is equivalent to taking their product (as functions, in the sense of set theory). The layers are functions of the of inputs and of the weights of the totality of the units. The derivative of a layer is then the product of the derivatives of the units; this is a product of linear transformations.
3.- Networks with any number of layers.
A network is the composition (as functions, and in the set theoretical sense) of its layers. By the chain rule the derivative of the network is the composition of the derivatives of the layers; this is a composition of linear transformations.
4.- Quadratic error of a function.
...
---
Since this comment is becoming too long I will stop here. The point is that a very general viewpoint clarifies many aspects of BPP.
If you are interested in the full story and have some familiarity with Hilbert spaces please google for papers dealing with backpropagation in Hilbert spaces. A related article with matrix formulas for backpropagation on semilinear networks is also available.
For a glimpse into a completely new deep learning algorithm which is orders of magnitude more efficient, controllable and faster than BPP search in this platform for a video about deep learning without backpropagation; in its description there are links to a demo software.
The new algorithm is based on the following very general and powerful result (google it): Polyhedrons and perceptrons are functionally equivalent.
For the elementary conceptual basis of NNs see the article Neural Network Formalism.
Daniel Crespin
King Gilbert Strang 🙏
He is such an inspiration
Professor Strang this is another great lecture on Backpropagation and finding partial derivatives from a computational graph. Calculus is still an important piece of all upper level mathematics.
lol "what function has that derivative! okay. it's a freshman course"
love this guy
Well the decent problem is at 2187 to me. We all are here to learn
It looks like 137174 is gonna swing you left.
I found this lecture quite confusing. His last example of computing (AB)C vs A(BC) is the key idea. The blog article he talks about I think misses the main computational point because it focuses on the composition of 2 functions and there is little computational advantage between forward vs. backward with only two functions.
In case it might help others, here is how I think about it: Say everything is linear but the variables have different dimensions:
F(x)=F3(F2(F1(x))
start with variable x (input layer)
h1=F1(x) = Cx (hidden layer 1)
h2=F2(h1)=B*h1 (hidden layer 2)
y=F3(h2)=A*h2 (output layer)
Then F(x)=A*B*C*x and jacobian(F)=A*B*C
Forward differentiation first computes derivatives of h2 wrt x by multiplying jacobians (JF2)*(JF1). Then finds derivative of y wrt x by multiplying by JF3 --> so this is A*(B*C)
Backward differentiation first computes derivatives of y wrt h1 by multiplying jacobians (JF3)*(JF2). Then finds derivative of y wrt x by multiplying by JF1 --> so this is (A*B)*C
In a neural network, you might have x very large N-dimensional input (pixels in an image), hidden layers h1 and h2 with a moderate n-number of dimensions, and output y a scalar (probability the network thinks image is a cat).
So forward differentiation computes A*(B*C) -> cost is 1*n*N + n^2*N
backward differenations computes (A*B)*C costs 1*n^2 + 1*n*N
Usually this guy is great when teaching linear algebra, but I think he got thrown into this course for couple days and was unprepared. I mean that discussion when off track quick.
linear algebra and learning from data 👍👍👍
Thankfully the derivative of x^2(x+3y) wrt x evaluated at x=2, y=3 really does work out to 140. So the method was working. His confusion about whether he needed derivatives wrt y was probably because he would need it going forward (after doing wrt x, do wrt y) but wrt y comes for free doing back propagation.
but if the input nodes count in NN is lower than num. of outputs then the forward mode changes to backprop from the other side, I mean if you compute Jacobians and use them in successive computations in the right order, is that so?
Thankyou
May I be able to get access to the mentioned and used books on this course
All notes are available in the textbook for this course " linear algebra and learning from data"
@@arvindpal-rb7sb Thank you so much for your reply, I appreciate it!
I can't find how to derive the Loss func. in respect to weight1 or 2,3etc, SO how the math is done behind the derivative part of the gradient descent formula. Everyone is showing the result or a smaller reshaped formula, but I would need the steps inbetween. An example where we backprop a single perceptron (with 1 or 2 weights, L2 function) would do it. Pls someone give me a link or a hand. Thanks!
I do understand the essence of chain rule, but I wanna know dL/dw = dL/dPred * dPred/dw (dL/dw is the change in the weight1 respec to L2 Loss function, dPred is the derivative of the neuron's output (mx+b) )
I don't get why the result is 2(target-pred) * input1.
Is that because L2 is a squarefunction and the derivative of x^2 or error^2 is 2X , so 2error --> 2(target-pred), but then why the other parts of the formula disappear.
The lectures on backpropagation is ok, need the textbook referenced.
I always thought mathematician would write very properly.
It may be worth to note that instead of partial derivatives one can work with derivatives as the linear transformations they really are.
Also, looking at the networks in a more structured manner makes clear that the basic ideas of BPP apply to very general types of neural networks. Several steps are involved.
1.- More general processing units.
Any continuously differentiable function of inputs and weights will do; these inputs and weights can belong, beyond Euclidean spaces, to any Hilbert space. Derivatives are linear transformations and the derivative of a neural processing unit is the direct sum of its partial derivatives with respect to the inputs and with respect to the weights. This is a linear transformation expressed as the sum of its restrictions to a pair of complementary linear subspaces.
2.- More general layers (any number of units).
Single unit layers can create a bottleneck that renders the whole network useless. Putting together several units in a unique layer is equivalent to taking their product (as functions, in the sense of set theory). The layers are functions of the of inputs and of the weights of the totality of the units. The derivative of a layer is then the product of the derivatives of the units; this is a product of linear transformations.
3.- Networks with any number of layers.
A network is the composition (as functions, and in the set theoretical sense) of its layers. By the chain rule the derivative of the network is the composition of the derivatives of the layers; this is a composition of linear transformations.
4.- Quadratic error of a function.
...
---
With the additional text down below this is going to be excessively long. Hence I will stop the itemized previous comments.
The point is that a sufficiently general, precise and manageable foundation for NNs clarifies many aspects of BPP.
If you are interested in the full story and have some familiarity with Hilbert spaces please google for our paper dealing with Backpropagation in Hilbert spaces. A related article with matrix formulas for backpropagation on semilinear networks is also available.
We have developed a completely new deep learning algorithm called Neural Network Builder (NNB) which is orders of magnitude more efficient, controllable, precise and faster than BPP.
The NNB algorithm assumes the following guiding principle:
The neural networks that recognize given data, that is, the “solution networks”, should depend only on the training data vectors.
Optionally the solution network may also depend on parameters that specify the distances of the training vectors to the decision boundaries, as chosen by the user and up to the theoretically possible maximum. The parameters specify the width of chosen strips that enclose decision boundaries, from which strips the data vectors must stay away.
When using the traditional BPP the solution network depends, besides the training vectors, in guessing a more or less arbitrary initial network architecture and initial weights. Such is not the case with the NNB algorithm.
With the NNB algorithm the network architecture and the initial (same as the final) weights of the solution network depend only on the data vectors and on the decision parameters. No modification of weights, whether incremental or otherwise, need to be done.
For a glimpse into the NNB algorithm, search in this platform our video about :
NNB Deep Learning Without Backpropagation.
In the description of the video links to a free demo software will be found.
The new algorithm is based on the following very general and powerful result (google it): Polyhedrons and Perceptrons Are Functionally Equivalent.
For the conceptual basis of general NNs in see our article Neural Network Formalism.
Regards,
Daniel CrespinP
no doubt that he is a great professor, but this example for Backpropagation was so dissapointing!
agree it's easy to understand Backpropagation from a textbook. His other lectures are amazing, especially PCA and SVD
final call of steve jobs
awesome man
Man, I love professsor Strang, but his brain was not working very well during this lecture.
Such a mess, I don't get it.