Naive Bayes in Python - Machine Learning From Scratch 05 - Python Tutorial

Поділитися
Вставка
  • Опубліковано 31 січ 2025

КОМЕНТАРІ •

  • @patloeber
    @patloeber  4 роки тому +15

    There is a slight fix in the fit method that must be applied if class labels do not start at 0:
    for idx, c in enumerate(self._classes)
    instead of
    for c in self._classes

    • @AliHussain-kb3ew
      @AliHussain-kb3ew 4 роки тому +3

      how to solve this problem.what I do.
      for idx, c in enumerate(self._classes):

      X_c = X[y==c]
      self._mean[idx, :] = X_c.mean(axis=0)
      self._var[idx, :] = X_c.var(axis=0)
      self._priors[idx] = X_c.shape[0] / float(n_samples)
      boolean index did not match indexed array along dimension 1; dimension is 5 but corresponding boolean dimension is 1

    • @alitaangel8650
      @alitaangel8650 4 роки тому

      @@AliHussain-kb3ew Above code works fine for me, maybe something is wrong with your input data ?

    • @Dhanush-zj7mf
      @Dhanush-zj7mf 4 роки тому +1

      I was stucked for 2 days and also posted question in stack overflow I think I should have watched comments first

    • @robinsonnadar5457
      @robinsonnadar5457 3 роки тому

      @@AliHussain-kb3ew Even I am stuck up with the same error :(

    • @umarmughal5922
      @umarmughal5922 3 роки тому

      @Python Engineer could you please explain how to apply Laplace to this?

  • @kougamishinya6566
    @kougamishinya6566 3 роки тому +2

    I love the way you explain what each line is doing and relate it back to the formulae, that's super helpful thank you!

  • @mattgoodman2687
    @mattgoodman2687 5 років тому +4

    Thank you for this. I had no clue how to conceptually grasp Naive Bayes, but after watching your video I understand it very well

    • @patloeber
      @patloeber  5 років тому +1

      I’m glad it is helpful :)

  • @heidycespedes9220
    @heidycespedes9220 2 роки тому

    Awesome explanation! It helped me to understand the concept and work on my project. Thanks a lot!

  • @dinarakhaydarova4898
    @dinarakhaydarova4898 2 роки тому

    exactly what i needed! thank you bunchesss

  • @vanshikajain8353
    @vanshikajain8353 3 роки тому +1

    In the second function predict, under the for loop, there is misplaced x which can be replaced by c in class conditional otherwise you get an exception of ValueError.

    • @chandank5266
      @chandank5266 2 роки тому

      Yeah! Actually I got confused at that point but now its clear. Thanks for confirming :)

  • @andreaq.y1770
    @andreaq.y1770 5 років тому +4

    very good tutorial !!! hope you will update more about algorithm implementations

    • @patloeber
      @patloeber  5 років тому +1

      Thank you! Yes more videos are coming soon :)

  • @tkaczoro
    @tkaczoro 11 місяців тому

    Looks like for the same reason you removed P(X) from formula for y, you can also remove the prior term P(y). You will get the same result in calculation of accuracy.

  • @amauryribeiro1860
    @amauryribeiro1860 4 роки тому +2

    just... thank you !! for your help! ^^

  • @changsinlee4634
    @changsinlee4634 3 роки тому

    A great tutorial and implementation. Just one correction on the implementation.
    _pdf is implemented differently than the formula. It should be:
    numerator = np.exp(- (x-mean)**2 / (2 * var**2))
    denominator = np.sqrt(2 * np.pi * var**2)
    The implemented code is missing the squared part.
    numerator = np.exp(- (x-mean)**2 / (2 * var))
    denominator = np.sqrt(2 * np.pi * var)

    • @patloeber
      @patloeber  3 роки тому

      thanks for the feedback. but you are wrong, you may have confused standard deviation and variance. in most formulas (and this video) it is written with the squared standard deviation, which is equal to the variance (so no square when using the variance directly) :)

    • @changsinlee4634
      @changsinlee4634 3 роки тому

      @@patloeber Thanks for the quick reply. Ah, yes, I see it. In that case, it should be std**2. You get different values based on whether you use var or std**2. I was comparing the results with those of the standard library (from scipy.stats import norm
      ) and that's when I discovered the differences.

    • @patloeber
      @patloeber  3 роки тому

      @@changsinlee4634 oh this is interesting. Thanks for noticing this! I would expect that std**2 and var are exactly the same except for rounding errors

  • @akshaygoel2184
    @akshaygoel2184 2 роки тому +2

    Amazing implementation!
    Small question/point - for the PDF shouldn't the numerator var have a square term? i.e. (2 * var**2)?

    • @BlackHeart-AI
      @BlackHeart-AI Рік тому

      f(x) = (1 / (σ * sqrt(2π))) * e^(-((x-μ)^2) / (2σ^2))
      In statistics, σ (the Greek letter sigma) represents the standard deviation of a population. The standard deviation is a measure of the spread or dispersion of a set of data around its mean.
      Standard deviation is closely related to the variance, which is equal to the square of the standard deviation, and is denoted by σ^2.
      Just σ^2 == variance

  • @kidspast7294
    @kidspast7294 2 роки тому

    Great tutorial thanks!

  • @godwingeorgethekkanath
    @godwingeorgethekkanath 4 роки тому

    Great tutorial😍
    It was useful for me.

    • @patloeber
      @patloeber  4 роки тому +1

      thanks, glad you like it!

  • @ГарикКубич
    @ГарикКубич 4 роки тому +1

    Thank you so much friend, very helpfull

  • @Fresh290PL
    @Fresh290PL 2 роки тому +1

    Great video, thanks! Just one thing - how we can avoid the zero-frequency problem in this implementation?

  • @matthewcallinankeenan2034
    @matthewcallinankeenan2034 4 роки тому +2

    @PythonEngineer I'm using this on a large dataset with 8 columns and ~16000 rows. Its saying 'IndexError: index 10000 is out of bounds for axis 0 with size 210" Do you know how I can fix this?

  • @boooringlearning
    @boooringlearning 3 роки тому

    great video!

  • @T4l0nITA
    @T4l0nITA 4 роки тому

    Really good explanation.

  • @robertrey7002
    @robertrey7002 2 роки тому

    Hey man that was a great tutorial! I would just like to ask however, is there a way to know when you should use the Naive Bayes classifier?

    • @no_guarantees
      @no_guarantees 2 роки тому

      Simplest application would be a binary classifier (0/1) or (no/yes) such as spam classification. You could experiment with NB where you would typically use logistic regression to build your intuition.

  • @debatradas9268
    @debatradas9268 3 роки тому

    thank you

  • @posadzd7343
    @posadzd7343 3 роки тому +1

    Good video, learnt a lot, please can you implement Bayes-classifier based on parzen window density estimation?

  • @jossyrayonieram5231
    @jossyrayonieram5231 2 роки тому

    Hi. What do you mean by "classes" here. You mention classes "0" and "1", but still not sure what you meant or why they are called "classes".

  • @OnlineGreg
    @OnlineGreg 2 роки тому

    hey, thanks a lot for this series. One question: why do you often put an underscore _ in front of a function or a variable?

    • @derilraju2106
      @derilraju2106 2 роки тому

      It's a general way to describe private methods which need not be called in the main function

  • @prateekarora4549
    @prateekarora4549 4 роки тому

    very good tutorial !

  • @matthewcallinankeenan2034
    @matthewcallinankeenan2034 4 роки тому +1

    What do we change about this program if the class isn't just True/False eg self._classes isn't just [0,1]

    • @patloeber
      @patloeber  4 роки тому

      It works for multiple classes, however you have to change the for loop like this: for idx, c in enumerate(self._classes):
      In my gitHub repo I already updated this fix....

  • @abhisheksuryavanshi979
    @abhisheksuryavanshi979 2 роки тому

    No init function inside the NaiveBayes class?

  • @joydeepkr.devnath193
    @joydeepkr.devnath193 4 роки тому

    Hi, great video btw...1 question at 4:43, where you define P(x_i|y) = Gaussian formula..but the Gaussian pdf is a distribution, so to get the probabilities we need integration. So, do we approximate this integration as area inside the rectangle having height=pdf and breadth = some delta. So, since we have a ratio of probabilities in the Bayesian formula, so the numerator delta cancels the denominator delta. So, that is why we dont include that delta term in our formula. Is this how you are doing ?

    • @patloeber
      @patloeber  4 роки тому +1

      This is a very good question! I hope this helps: stats.stackexchange.com/questions/26624/pdfs-and-probability-in-naive-bayes-classification

    • @joydeepkr.devnath193
      @joydeepkr.devnath193 4 роки тому

      @@patloeber yes this link was helpful. Thanks !

    • @patloeber
      @patloeber  4 роки тому

      @@joydeepkr.devnath193 sure :)

  • @ramazanburakguler5842
    @ramazanburakguler5842 Рік тому

    In terms of regularization, what can be done?

  • @anjaliacharya9506
    @anjaliacharya9506 4 роки тому +1

    I try to implement this in wbcd dataset but getting an error in the line " numerator = np.exp(- (x-mean)**2 / (2 * var))" UFuncTypeError, could you help me with this

    • @anjaliacharya9506
      @anjaliacharya9506 4 роки тому

      I have used label encoder to change 'diagnosis' target column to integer type but the error persists in the same line I mentioned. UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('

    • @jonn6897
      @jonn6897 4 роки тому

      I have the same error with another dataset, looking forward to any help!

    • @anjaliacharya9506
      @anjaliacharya9506 4 роки тому +2

      @@jonn6897 I tried converting all columns with feature except target to numpy array for probability calculation, then it works. In my case it is WBCD dataset.
      y = wbcd_data.diagnosis
      X = wbcd_data.drop('diagnosis',axis=1)
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
      #convert all columns with feature except target to numpy array to calculate probability
      X_train = np.array(X_train)
      X_test = np.array(X_test)

    • @patloeber
      @patloeber  4 роки тому +2

      try casting your x to dtype=np.float64 before calling fit(), and yes of course it must be a numpy array

  • @ozysjahputera7669
    @ozysjahputera7669 3 роки тому

    The pdf implemented here is only for univariate gaussian, correct? Multivariate would have involved covariance matrix inverse, and determinant.
    Never mind. You assume all features are independent of each other.

  • @Lanipops
    @Lanipops 5 років тому +1

    Tried to run this but i keep getting this error:
    ~/anaconda3/envs/XXXXXX6/aima-python-master/naivebayes.py in fit(self, X, y)
    15 for c in self._classes:
    16 X_c = X[y==c]
    ---> 17 self._mean[c, :] = X_c.mean(axis=0)
    18 self._var[c, :] = X_c.var(axis=0)
    19 self._priors[c] = X_c.shape[0] / float(n_samples)
    IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

    • @omkarpatil4386
      @omkarpatil4386 4 роки тому

      make your labels binary or encode the labels .

  • @AliHaider-hg7lj
    @AliHaider-hg7lj 4 роки тому +1

    How can we train any model on it? I mean if we have a csv file so how can we use it on this model?

    • @patloeber
      @patloeber  4 роки тому

      load the data with pandas or just manually with open(filename) and convert each line to your x and y vectors. then create training and testing data and train your model

    • @patloeber
      @patloeber  4 роки тому

      I'm actually planning to release a short video in the next 1-2 days on how to load your own datasets from csv

    • @AliHaider-hg7lj
      @AliHaider-hg7lj 4 роки тому

      @@patloeber Perfect & Thanks:)

    • @T4l0nITA
      @T4l0nITA 4 роки тому +3

      data = pandas.read_csv("file_name.csv")
      X = data.iloc[samples, features].values
      y = data.iloc[samples, y_column].values

  • @_Shrivi_
    @_Shrivi_ 4 роки тому

    Hi, very good explanation . Can I use this code to train data for sentiment analysis as well?

  • @samii8104
    @samii8104 3 роки тому

    So i'm trying to run the algorithm for a dataset which have features for y_train first half 0 and second half 1.
    The problem is that when im trying to get the predict for the first half of y_train im getting error of dividing with 0.
    Is there anyway using laplace in the code help me???

  • @BlueSkyGoldSun
    @BlueSkyGoldSun 2 роки тому

    Any book you recommend to learn ml in native python?

  • @bryanchambers1964
    @bryanchambers1964 4 роки тому

    Hey there, I like your videos you explain well but I am confused about something. There is a step in your code where you have:
    for c in self.classes:
    X_c = X[c==y]
    I understand the first line in the code (for c in self.classes:), but I have no idea why you have X_c = X[c==y].,
    if my c values are for example [ 1, 4, 8] , then X_c = X[1==1] just gives me X_c with an extra dimension. For example if X is a 3x4 matrix, X_c is now the same matrix except it has dimension 1x3x4. Am I just dumb or overthinking this detail?

    • @patloeber
      @patloeber  4 роки тому

      Note that y is an array as well, not just a number, and the length of y has to be the same as the first dimension of X! So X_c[1==y] gives you all rows of X where y is 1. Please note also that my code has a slight but. It should be this (compare with my code on Github):
      for idx, c in enumerate(self._classes):
      X_c = X[y==c]
      self._mean[idx, :] = X_c.mean(axis=0)

    • @bryanchambers1964
      @bryanchambers1964 4 роки тому +1

      @@patloeber Thanks, yeah I kind of realized this after a while. So, this will extract the rows of X that have that class y=1. Makes sense.

  • @FoodieTechVoyager
    @FoodieTechVoyager 3 роки тому

    Hi, I am new to Machine learning, it would be very helpful if you could provide the dataset too , or share a tutorial on how to create that

    • @patloeber
      @patloeber  3 роки тому

      thanks for the suggestion

  • @shehanjanidu2334
    @shehanjanidu2334 3 роки тому

    I was using my own csv file as my dataset but it gives ufunc 'subtract' did not contain a loop with signature matching types (dtype('

  • @srikaramanaganti1285
    @srikaramanaganti1285 3 роки тому

    can you model class conditional probability using Multinomail distribution

  • @tanziahkhanam6451
    @tanziahkhanam6451 3 роки тому

    I got very less accuracy for my own dataset. Accuracy only 0.3 , what is the reason? And also got warning, RuntimeWarning: divide by zero encountered in true_divide numerator = np.exp(- (x - mean) ** 2 / (2 * var))

    • @bong-techie
      @bong-techie 3 роки тому

      how did you fix it, i'm facing the problem now, please help[

  • @viperz301
    @viperz301 4 роки тому

    Hi! what do you mean by the self that you pass into every function? is it the data frame?

    • @patloeber
      @patloeber  4 роки тому +1

      This is an essential concept of object oriented programming and using classes in Python. self represents the instance of the class. By using the “self” keyword we can access the attributes and methods of the class in python. It binds the attributes with the given arguments.

    • @jossyrayonieram5231
      @jossyrayonieram5231 2 роки тому

      @@patloeber out of all the things Python does for you automatically, they stopped with "self". >_

  • @abhisheksuryavanshi979
    @abhisheksuryavanshi979 2 роки тому

    can anyone pls tell why are we adding prior+class_conditional variables?

  • @MuhammadAli-pf4ww
    @MuhammadAli-pf4ww 3 роки тому

    Can anyone explain what X_c = X[c==y] is doing? I'm a little confused

  • @nafesafirdous3670
    @nafesafirdous3670 4 роки тому

    If I have my on dataset which is not present in sklearn datasets then how can I make classification?
    please help!

    • @patloeber
      @patloeber  4 роки тому +1

      You need to load the dataset (probably from a csv file) and setup your X and y numpy arrays

    • @nafesafirdous3670
      @nafesafirdous3670 4 роки тому

      @@patloeber Helpful
      Thanks

  • @prithviamin6847
    @prithviamin6847 4 роки тому

    hi
    i'm getting this error:
    UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('

    • @patloeber
      @patloeber  4 роки тому

      Try converting your data to np.float. And check if all your data is valid, probably you have NaN for some data points...

    • @AliHussain-kb3ew
      @AliHussain-kb3ew 4 роки тому

      Hi, I face a Same problem ,you got it right.
      if correct the code please suggest me what I do.

    • @AliHussain-kb3ew
      @AliHussain-kb3ew 4 роки тому

      Hi

  • @seyeeet8063
    @seyeeet8063 4 роки тому

    so NB does not have any updating rule like gradient decent?

    • @patloeber
      @patloeber  4 роки тому +1

      No you just have to pre calculate priors and mean and var, and then apply the formula using Bayes‘ theorem

  • @kritamdangol5349
    @kritamdangol5349 4 роки тому

    I got this errror while performing run .Please provide me solution for this.
    line 54, in
    predicted_values=(model.predict(Features_test))
    line 20, in predict
    y_pred=[self._predict(x) for x in X]
    , in
    y_pred=[self._predict(x) for x in X]
    line 29, in _predict
    line 40, in _pdf
    numerator=np.exp(-(x-mean)**2/(2*var))
    numpy.core._exceptions.UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('

    • @patloeber
      @patloeber  4 роки тому +1

      probably your datatype or the shape of your vector is not correct. try casting to np.float32

    • @kritamdangol5349
      @kritamdangol5349 4 роки тому

      @@patloeber Thank u !

  • @amitupadhyay6511
    @amitupadhyay6511 4 роки тому

    what if the values in _pdf matrix are inf, then?

    • @patloeber
      @patloeber  3 роки тому

      then you have a problem ;) yeah you should add some error checking and maybe clip the allowed range in the calculation

  • @marcosraphael3390
    @marcosraphael3390 4 роки тому

    This is an unlabeled classifier?

    • @patloeber
      @patloeber  4 роки тому +1

      No, it is supervised learning

  • @nobody2937
    @nobody2937 3 роки тому

    Also, make sure var is NOT 0 ...

  • @AliHussain-kb3ew
    @AliHussain-kb3ew 4 роки тому

    How to use this code in python Anaconda ?,

    • @patloeber
      @patloeber  4 роки тому

      I have a tutorial for Anaconda setup

  • @madsmith1352
    @madsmith1352 Рік тому

    Guass.. rhymes with house..

  • @Lanipops
    @Lanipops 5 років тому

    need to make the naive bayes file allow 2d array

    • @patloeber
      @patloeber  5 років тому

      try to cast y to int before fitting the data: y = y.astype(np.int)

  • @tsotnegams
    @tsotnegams 4 роки тому

    In the pdf method you wrote (2*var), it should be(2*var**2) because of squared variance in the formula. Great tutorial otherwise.

    • @patloeber
      @patloeber  4 роки тому +2

      No. The formula shows the squared standard deviation, which is equal to the variance (small sigma is always used in statistics for standard deviation). probably i should have pointed this out better. thanks for watching :)

    • @tsotnegams
      @tsotnegams 4 роки тому +1

      @@patloeber You are right, thanks for the reply.

    • @patloeber
      @patloeber  4 роки тому +1

      No problem :) you can always reach out when you have questions or find different errors

  • @AliHussain-kb3ew
    @AliHussain-kb3ew 4 роки тому

    I try to Run this code on Anaconda an other iris dataset but ,i face a problen.

  • @reellezahl
    @reellezahl 2 роки тому

    You need either a better microphone or to better adjust your sound settings. Your volume levels keep crashing and it's very grating on the ear.

  • @redhwanalgabri7281
    @redhwanalgabri7281 3 роки тому

    ('Naive Bayes classification accuracy', 0)

  • @ragaistanto6722
    @ragaistanto6722 4 роки тому

    Terimakasih. Untuk teman" lainya saya juga ada nih video tutorial ngoding Naive Bayes python 3 bisa di cek barangkali cocok.
    ua-cam.com/video/m0HVDfe0k90/v-deo.html