Artificial Neural Networks : Activation Functions and Optimization Algorithms

Поділитися
Вставка
  • Опубліковано 16 жов 2024

КОМЕНТАРІ • 17

  • @DerinWilson-cb8jx
    @DerinWilson-cb8jx 3 роки тому +2

    Sir, Can you tell about some cases where Leaky ReLU can be preferred over ReLU.

    • @EvolutionaryIntelligence
      @EvolutionaryIntelligence  3 роки тому +2

      Check out this paper for a detailed analysis for one specific problem of image classification : arxiv.org/pdf/1505.00853.pdf
      However, neither of these two methods is inherently superior to the other one. Generally, it's good to use the simple ReLU first and then try the leaky version only if the simple version does not work.

  • @DesarajuHarshaVardhan
    @DesarajuHarshaVardhan 3 роки тому +2

    With ReLU which is linear, how can there be learning as it's derivative is constant?

    • @EvolutionaryIntelligence
      @EvolutionaryIntelligence  3 роки тому +1

      The derivative has different values in two different regions, which makes it piece-wise linear, and not perfectly linear.

  • @NamanJoshi-pi4bn
    @NamanJoshi-pi4bn 3 роки тому +1

    In SGD, if we take cost function with respect to smaller set of random data values (~33%), then how do we know that the global maxima of the new cost function and old cost function would coincide? The new function may a different global maxima.

    • @EvolutionaryIntelligence
      @EvolutionaryIntelligence  3 роки тому

      Firstly, even if we take a small fraction of the dataset for each iteration of SGD, finally we are covering all the data points in each epoch. Secondly, even in conventional gradient descent, the cost function is non-convex and finding global minima is practically impossible. So, in ANN optimisation, the goal is not to search for the global minima but find a local minima that gives good enough accuracy without overfitting.

  • @YashAgrawal-cs6tj
    @YashAgrawal-cs6tj 3 роки тому +1

    Sir, how can a momentum term be defined for increasing the effect of stochastic gradient descent? If the momentum term is too high, that might again result in larger jumps, so can't we adjust that momentum term rather than using the Nesterov SGD, or using Nesterov SGD is preferred over using the momentum term?

    • @EvolutionaryIntelligence
      @EvolutionaryIntelligence  3 роки тому +1

      Yes, you are right! The momentum term and learning rate are adjusted dynamically in advanced ANN optimizers available nowadays in tensorflow.

  • @VaibhavSingh-iz4bf
    @VaibhavSingh-iz4bf 3 роки тому +1

    Can't the non convex cost function be converted to a convex cost function by suitable transformations?
    (if not always in practice , atleast in theory?)
    Also , as far as I understand , the stochastic part increases the randomness by making bigger jumps on the graph . Why cant we do a similar thing with normal gradient descent but with a large learning parameter?

    • @EvolutionaryIntelligence
      @EvolutionaryIntelligence  3 роки тому +1

      Finding convex approximations of non-convex functions is an important problem in the field of optimisation, but it has to be done for each given function and there is no general algorithm that will work for all non-convex functions (or for all ANN architectures). Also, this process does not always give the most optimal solution to the original non-convex problem.
      SGD does not lead to randomness through bigger jumps. That can be easily done by conventional gradient descent as well just by adding a momentum term. SGD adds randomness through taking a smaller batch size of the given data for computing the cost function and its gradients. It not only changes the step size but also the direction of the step.

  • @AayushMishra-pt4yx
    @AayushMishra-pt4yx 3 роки тому +1

    Are there some advantages in using tanh activation function instead of sigmoid for binary classification as tanh varies from ( -1,1) and also has a higher gradient as compared to sigmoid ?

    • @EvolutionaryIntelligence
      @EvolutionaryIntelligence  3 роки тому +1

      The tanh function does have some advantages since it has higher gradients and is also symmetric (centred around zero along the y-axis), but these differences do not matter much since we nowadays mostly use ReLU for the hidden layers. For the output layer, sigmoid has become the preferred choice since it has an easy interpretation in terms of probabilities.

    • @AayushMishra-pt4yx
      @AayushMishra-pt4yx 3 роки тому

      @@EvolutionaryIntelligence Got it, sir. Thank you!

  • @siddharthsethi7773
    @siddharthsethi7773 3 роки тому

    Sir, In discussion of activation function we see that sigmoid function will suffer from Vanishing gradient and hence it's used in output layer. But ReLU can suffer with exploding gradient as from the graph it's value increases with x but still it's said it's the best and used in Hidden Layers. Can you please clarify how ReLU is better?

    • @EvolutionaryIntelligence
      @EvolutionaryIntelligence  3 роки тому

      The value of ReLU increases but its gradient remains constant (since its a straight line). So the gradient of ReLU is either zero or one, and so it does not suffer from the problem of exploding gradients. But still, since the value of ReLU keeps increasing, it may lead to problems if the ANN weights are not initialised properly. So the key is to initialise the ANN weights properly and avoid extreme values.

  • @Parth-vn5xj
    @Parth-vn5xj 3 роки тому

    Sir at 13:30 the cost function is plotted against all weights, how is that possible?

    • @EvolutionaryIntelligence
      @EvolutionaryIntelligence  3 роки тому +1

      This is just for illustration on a 2D screen surface. The actual graph is multi-dimensional and lot more complicated.