For Ball tree, you also would also need to scan through every point in region 7 as well as region 7 is a leaf node. For example, this would matter if a point (7,6) existed.
Thank you for your comment. I appreciate your engagement and your point about the Ball tree algorithm. Yes, if additional points existed they would have to be considered during the search process.
Thank you for the awesome video. I was able to use this to actually speed up distance between elements from each other in a very large dataset (6418). Would you know why computing all distances between all the data points is faster than brute force since it's still finding distances between each pair just like brute force? Your channel is very underrated. Should have so many more subscribers for such quality content!
Please write the conditions while you split the trees. Additionally, I would suggest using letters like A, B, C, or C1, C2, C3 to designate circle names. Writing numbers while designating circles can confuse first-time learners about their actual meaning. One more thing, I would like to suggest elaborating a bit more on finding the centroid. While we know that the median is the middle value in a set of data points, it would be helpful to explain how to find the centroid since we will be applying all these concepts to a dataset. Overall, the video is short and nice.
Thank you for your feedback. Hoping below helps. Ball tree circle condition: - Draw a circle with centroid as its centre with a radius that is the farthest point from that centroid in that half. # How to calculate centroid of points: (1, 2), (3, 4), (5, 6), (7, 8)? x_cord = (1+3+5+7)/4 = 4.0 y_cord = (2+4+6+8)/4 = 5.0 Thus, centroid would be: (4.0, 5.0) Try in Python: ############### # Calculate centroid ############### def calculate_centroid(points): x_sum = 0 y_sum = 0 num_points = len(points) for point in points: x_sum += point[0] y_sum += point[1] centroid_x = x_sum / num_points centroid_y = y_sum / num_points return centroid_x, centroid_y # Example points = [(1, 2), (3, 4), (5, 6), (7, 8)] centroid = calculate_centroid(points) print("Centroid:", centroid) -- -- -- Output of the code would be: -- -- -- Centroid: (4.0, 5.0)
Thank you. According to the scikit-learn docs: Note: N = number of samples; D = number of features (i.e. dimensionality) > Brute Force: - fast computation - scales as O[DN2] - for small datasets (N < 30 or so) > K-D Tree - address computational inefficiencies of brute-force - these structures attempt to reduce the required number of distance calculations by efficiently encoding aggregate distance information for the sample - scales as O[DN log(N)] - very fast for low-dimensional (D < 20) neighbors searches - it becomes inefficient as grows very large > Ball Tree - To address the inefficiencies of KD Trees in higher dimensions - scales as O[D log(N)] Link: scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms
Hi, very good video! I have a doubt! In Ball Tree, when calculating the centroid and then the farthest 2 points? How we came up with this farthest point? Is it sort of bruce force for this?
@@learndataa Yeah, I mean you'd have to compute the distances at every iteration for that set and then the set becomes divided? But what after the iterations after that, i think for the first run through while you assemble the tree it will take the same time as a for-loop? But after you have the tree inserting and comparing every new input will only be as simple as finding the nearest set point and computing the few distances inside the set/node/leaf. You're technicly on the test-data not on the 'training data' better called the knowledge base or corpuse. I may be wrong tho, let me know if this makes any sense or if someone really knows the correct answer? Seems really interesting!
For Ball tree, you also would also need to scan through every point in region 7 as well as region 7 is a leaf node. For example, this would matter if a point (7,6) existed.
Thank you for your comment. I appreciate your engagement and your point about the Ball tree algorithm. Yes, if additional points existed they would have to be considered during the search process.
Clearly Explained !!!
Thank you for the awesome video. I was able to use this to actually speed up distance between elements from each other in a very large dataset (6418). Would you know why computing all distances between all the data points is faster than brute force since it's still finding distances between each pair just like brute force?
Your channel is very underrated. Should have so many more subscribers for such quality content!
Sir How KD or Ball tree using in DBscan clustering
Please write the conditions while you split the trees. Additionally, I would suggest using letters like A, B, C, or C1, C2, C3 to designate circle names. Writing numbers while designating circles can confuse first-time learners about their actual meaning. One more thing, I would like to suggest elaborating a bit more on finding the centroid. While we know that the median is the middle value in a set of data points, it would be helpful to explain how to find the centroid since we will be applying all these concepts to a dataset. Overall, the video is short and nice.
Thank you for your feedback. Hoping below helps.
Ball tree circle condition:
- Draw a circle with centroid as its centre with a radius that is the farthest point from that centroid in that half.
# How to calculate centroid of points: (1, 2), (3, 4), (5, 6), (7, 8)?
x_cord = (1+3+5+7)/4 = 4.0
y_cord = (2+4+6+8)/4 = 5.0
Thus, centroid would be: (4.0, 5.0)
Try in Python:
###############
# Calculate centroid
###############
def calculate_centroid(points):
x_sum = 0
y_sum = 0
num_points = len(points)
for point in points:
x_sum += point[0]
y_sum += point[1]
centroid_x = x_sum / num_points
centroid_y = y_sum / num_points
return centroid_x, centroid_y
# Example
points = [(1, 2), (3, 4), (5, 6), (7, 8)]
centroid = calculate_centroid(points)
print("Centroid:", centroid)
-- -- -- Output of the code would be: -- -- --
Centroid: (4.0, 5.0)
Sir which one is best. And you should explain this thing. How we can choose one of them based on the scenario?
Thank you. According to the scikit-learn docs:
Note:
N = number of samples;
D = number of features (i.e. dimensionality)
> Brute Force:
- fast computation
- scales as O[DN2]
- for small datasets (N < 30 or so)
> K-D Tree
- address computational inefficiencies of brute-force
- these structures attempt to reduce the required number of distance calculations by efficiently encoding aggregate distance information for the sample
- scales as O[DN log(N)]
- very fast for low-dimensional (D < 20) neighbors searches
- it becomes inefficient as grows very large
> Ball Tree
- To address the inefficiencies of KD Trees in higher dimensions
- scales as O[D log(N)]
Link: scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms
Very well explained. ❤️
Thank you. Glad it was helpful.
Great explanation!
Sir how to find the centroid after we split the data using median
@@learndataa ((x1+x2+x3+...xn)/n,(y1+y2+y3+...+yn)/n)
Hi, very good video! I have a doubt! In Ball Tree, when calculating the centroid and then the farthest 2 points? How we came up with this farthest point? Is it sort of bruce force for this?
@@learndataa Yeah, I mean you'd have to compute the distances at every iteration for that set and then the set becomes divided? But what after the iterations after that, i think for the first run through while you assemble the tree it will take the same time as a for-loop? But after you have the tree inserting and comparing every new input will only be as simple as finding the nearest set point and computing the few distances inside the set/node/leaf. You're technicly on the test-data not on the 'training data' better called the knowledge base or corpuse. I may be wrong tho, let me know if this makes any sense or if someone really knows the correct answer? Seems really interesting!
Great , thanks
Thank you.