Very nice work especially the sinusoidal activation. I like to point out Candes in 1997 covered it rigorously in "Harmonic Analysis of Neural Networks" about periodic activation function - "admissible neural activation function". Strangely enough, the paper is not even cited by the authors.
The application of this kind of representations for 3D rendering is fascinating. Could it be that in the future modelers will give up on the polygon+textures model and represent the whole scene with a neural network instead ?
It requires huge processing time at the inference stage. Take sdf as example, you'll need to query the network for a huge number of times to find out where the surface is (also depends on the discretization resolution). I think currently it's only good at offline use.
@@kwea123 I expressed myself poorly. I was thinking of rendering, not modeling. More precisely, I was thinking of the problem of rendering scenes with very high poly counts, for instance where a very long draw distance is required. Currently modelers have to use level of details but this technique has limitations.
@@luciengrondin5802 I hope, the network does not have to learn to give the pixel values for a coordinate - but could also learn to give coordinates and pixel values for an index. The real issue would be the compression stage requiring training a network of apropriate size on the scene to be "compressed".
The music part was outstanding. Audio waveforms are just stacked sinewaves, as opposed to images or text where the input may not be too related to the sine function. So it just feels right to use sine activations and the required tweaks to make that work, instead of ReLUs, but I'm going to be careful with this as even though I have some experience in ML i haven't ever touched anything other than ReLUs, sigmoids, tanh and straight up linear activations
@@Oktokolo let me rephrase that then. Audio waveforms can be approximated by a relatively SMALL number of stacked sine waves, so it feels natural to use them in NNs. Everything can be approximated by infinite numbers of sine waves, but sometimes it doesn't make sense to do it
@@TileBitan It obviously makes sense for images as that is how the best compression algorithms use. It should also be possible to encode text reasonably well - even though the resulting set of weights is probably larger than the text itself when not encoding input of a huge language model...
@@Oktokolo i don't understand. Sounds are different amplitude waves with different frequencies inside the hearing range. Images nowadays can be 100M pixels with 3 times 256 on the BEST case scenario, where relationships between pixels can be really close to nothing. The case is completely different. The text case doesn't really have much to do with a wave. They might use FFTs for images but you gotta agree with me, for the same error you need way way less terms for sound than images.
@@TileBitan Doesn't matter whether it looks like it has anything to do with a wave or not or whether adjacent values look like they are in any relation to eachother. Treating data as signals and then encoding the signal as stacked waves just works surprisingly well. It might not work well for truly random bit noise. But most data interesting to humans seems to exhibit a surprisingly low entropy and can be compressed using stacked sines.
The first layer Sin(Wx + b) could be thought of as a vector of waves with frequency wi and phase offset bi. After the second linear layer, we have a vector of trigonometric series which look like a Fourier expansions except the frequencies and phase offsets can be anything. Although the next nonlinearity might do something new, we can already represent any function with the first 1 1/2 layers. What advantages does this approach offer vs representing and evaluating functions as a Fourier series?
@@_vlek_ I think this is it, indeed. Efficient Fourier transform algorithms only work with a regularly sampled signal and, if I'm not mistaken, of low dimension. This machine learning approach can work with any kind of signal, I think.
@@isodoubIetthe Fourier transform is linear. The Fourier series is not. I assume you're implying that the neural net is a fundamentally more expressive by being nonlinear. But the Fourier series is also nonlinear.
@@convolvr Eh no if you have a smooth periodic signal it's still expressible as a linear combination of Fourier components, so, yes, this is fundamentally more expressive.
The arxiv has an incorrect reference. The paper states, "or positional encoding strategies proposed in concurrent work [5]" and video mentions a paper in 2020, but reference [5] your current arxiv is C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera. Filling-in by joint interpolation of vectorfields and gray levels.IEEE Trans. on Image Processing, 10(8):1200-1211, 2001. I believe this should reference what you list as [35].
Yes, Nerf: Representing scenes as neural radiance fields for view synthesis uses positional encoding. And they recently published a paper that uses Fourier transform.
Goodbye ReLU, you had a good run! I feel I have to watch this a few more times to have a good idea of what's going on 😄but it looks like a breakthrough!
yeah i was wondering why people weren't using sin's and cosine's cause i watched a video and the guy explained that, a neural network of L number of layers, and N number of nodes per Layer, which use relu activation, can perfectly match a function with N to the power L number of bends or turning points in its curve (assuming the neural network has a single scalar node output), i guess that is why it failed on the audio, there is a lot of turning point in audio data, so technical the SIREN networks performance can be matched by a large enough relu neural network, so am looking at SIREN as an optimization on the usual relu networks. Am glad i saw this, i will look into it further. i suspect that sinusoidal activation will be useful in domains with some sort of repetition, cause relu act more like threshold switches.
@@rainerzufall1868 Twice as many outputs - just doubling the features. You can do a similar thing with ReLU where you threshold at maximum zero and at minimum zero and split into two parts, I'm not sure it's a whole lot better than just one though...
@@rainerzufall1868 Does the activation function necessarily have to be real ? I don't think so. I think using a complex exponential could help making the calculations and implementation clearer. It could have an overhead computational cost, though.
@@luciengrondin5802 I don't think it would simplicate things if you model it as the activation having 2 outputs, it would need some re-implementation.. and if you instead use 1 complex output and complex multiplication, the libraries are not optimized at all for this.. thus the computational hit would be big, i think..
Also, cosine and sine are the same except for a constant difference in the input, which we could learn from the bias. Thus, i don't think we would add much value. On the flipside, the deritive of sine is cosine and vice versa (with a minus), such that we can just reuse the output from the other in the derivative computation.
Wow, just watch Yannic Kilcher's video on this work and this is fascinating... I bet this work is going to change many things in ML. Please share the code!
Very nice work especially the sinusoidal activation. I like to point out Candes in 1997 covered it rigorously in "Harmonic Analysis of Neural Networks" about periodic activation function - "admissible neural activation function". Strangely enough, the paper is not even cited by the authors.
How is this compare to just taking the Fourier (or discrete cosine) transform of the signal?
The application of this kind of representations for 3D rendering is fascinating. Could it be that in the future modelers will give up on the polygon+textures model and represent the whole scene with a neural network instead ?
It requires huge processing time at the inference stage. Take sdf as example, you'll need to query the network for a huge number of times to find out where the surface is (also depends on the discretization resolution). I think currently it's only good at offline use.
@@kwea123 I expressed myself poorly. I was thinking of rendering, not modeling.
More precisely, I was thinking of the problem of rendering scenes with very high poly counts, for instance where a very long draw distance is required. Currently modelers have to use level of details but this technique has limitations.
@@luciengrondin5802 that is solved by another solution, take a look at ue5
@@luciengrondin5802 I hope, the network does not have to learn to give the pixel values for a coordinate - but could also learn to give coordinates and pixel values for an index. The real issue would be the compression stage requiring training a network of apropriate size on the scene to be "compressed".
The music part was outstanding. Audio waveforms are just stacked sinewaves, as opposed to images or text where the input may not be too related to the sine function. So it just feels right to use sine activations and the required tweaks to make that work, instead of ReLUs, but I'm going to be careful with this as even though I have some experience in ML i haven't ever touched anything other than ReLUs, sigmoids, tanh and straight up linear activations
You can aproximate _everything_ with stacked sine waves. All modern video and image compression algorithms are based on that.
@@Oktokolo let me rephrase that then. Audio waveforms can be approximated by a relatively SMALL number of stacked sine waves, so it feels natural to use them in NNs. Everything can be approximated by infinite numbers of sine waves, but sometimes it doesn't make sense to do it
@@TileBitan It obviously makes sense for images as that is how the best compression algorithms use.
It should also be possible to encode text reasonably well - even though the resulting set of weights is probably larger than the text itself when not encoding input of a huge language model...
@@Oktokolo i don't understand. Sounds are different amplitude waves with different frequencies inside the hearing range. Images nowadays can be 100M pixels with 3 times 256 on the BEST case scenario, where relationships between pixels can be really close to nothing. The case is completely different. The text case doesn't really have much to do with a wave.
They might use FFTs for images but you gotta agree with me, for the same error you need way way less terms for sound than images.
@@TileBitan Doesn't matter whether it looks like it has anything to do with a wave or not or whether adjacent values look like they are in any relation to eachother.
Treating data as signals and then encoding the signal as stacked waves just works surprisingly well.
It might not work well for truly random bit noise. But most data interesting to humans seems to exhibit a surprisingly low entropy and can be compressed using stacked sines.
The first layer Sin(Wx + b) could be thought of as a vector of waves with frequency wi and phase offset bi. After the second linear layer, we have a vector of trigonometric series which look like a Fourier expansions except the frequencies and phase offsets can be anything. Although the next nonlinearity might do something new, we can already represent any function with the first 1 1/2 layers. What advantages does this approach offer vs representing and evaluating functions as a Fourier series?
Because you can learn the representation of lots of different signals via gradient descent?
@@_vlek_ I think this is it, indeed. Efficient Fourier transform algorithms only work with a regularly sampled signal and, if I'm not mistaken, of low dimension. This machine learning approach can work with any kind of signal, I think.
Fourier series are linear
@@isodoubIetthe Fourier transform is linear. The Fourier series is not. I assume you're implying that the neural net is a fundamentally more expressive by being nonlinear. But the Fourier series is also nonlinear.
@@convolvr Eh no if you have a smooth periodic signal it's still expressible as a linear combination of Fourier components, so, yes, this is fundamentally more expressive.
The arxiv has an incorrect reference. The paper states, "or positional encoding strategies proposed in concurrent work [5]" and video mentions a paper in 2020, but reference [5] your current arxiv is C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera. Filling-in by joint interpolation of vectorfields and gray levels.IEEE Trans. on Image Processing, 10(8):1200-1211, 2001. I believe this should reference what you list as [35].
Yes, Nerf: Representing
scenes as neural radiance fields for view synthesis uses positional encoding. And they recently published a paper that uses Fourier transform.
Goodbye ReLU, you had a good run! I feel I have to watch this a few more times to have a good idea of what's going on 😄but it looks like a breakthrough!
Awesome work!
Can you please share your code? The link on the project page is not working
yeah i was wondering why people weren't using sin's and cosine's cause i watched a video and the guy explained that, a neural network of L number of layers, and N number of nodes per Layer, which use relu activation, can perfectly match a function with N to the power L number of bends or turning points in its curve (assuming the neural network has a single scalar node output), i guess that is why it failed on the audio, there is a lot of turning point in audio data, so technical the SIREN networks performance can be matched by a large enough relu neural network, so am looking at SIREN as an optimization on the usual relu networks. Am glad i saw this, i will look into it further. i suspect that sinusoidal activation will be useful in domains with some sort of repetition, cause relu act more like threshold switches.
Love this! Thank you!!
lol at tanh - but very cool general purpose work; i can imagine this being a good exploratory topic/bonus project for intro signal processing courses
How does it compare to using a sawtooth wave in place of the sine wave?
What if you were to use exp(i x) = cos(x) + i sin(x) as the activation function? That seems potentially more elegant.
What would it mean for an activation to have a complex output? Or 2 outputs?
@@rainerzufall1868 Twice as many outputs - just doubling the features. You can do a similar thing with ReLU where you threshold at maximum zero and at minimum zero and split into two parts, I'm not sure it's a whole lot better than just one though...
@@rainerzufall1868 Does the activation function necessarily have to be real ? I don't think so. I think using a complex exponential could help making the calculations and implementation clearer. It could have an overhead computational cost, though.
@@luciengrondin5802 I don't think it would simplicate things if you model it as the activation having 2 outputs, it would need some re-implementation.. and if you instead use 1 complex output and complex multiplication, the libraries are not optimized at all for this.. thus the computational hit would be big, i think..
Also, cosine and sine are the same except for a constant difference in the input, which we could learn from the bias. Thus, i don't think we would add much value. On the flipside, the deritive of sine is cosine and vice versa (with a minus), such that we can just reuse the output from the other in the derivative computation.
Wow, just watch Yannic Kilcher's video on this work and this is fascinating... I bet this work is going to change many things in ML. Please share the code!
That’s amazing!
No link to the paper in the video description ?
And project page at vsitzmann.github.io/siren/
Super cool!
Code available?
Awesome
is it like a new jpg?
Hi, could I download this video and upload it to bilibili.com, where Chinese students and researchers can visit freely?
already available implementation at: github.com/titu1994/tf_SIREN
Did anyone try using this in transformers?
post colab