- 98
- 115 043
probabl
Приєднався 6 лют 2024
This is the official Probabl UA-cam channel where we feature humans teaching and learning about machine learning, data science, and open source. More often than not, we will discuss scikit-learn, as well as a plethora of other tools and libraries to help data scientists, data engineers and data owners extract the most value out of their data.
Playing with the classification report
In this video we will play around with a confusion matrix widget that will help us understand how the numbers in the classification report in scikit-learn are created. The classification report is a great utility, but it can help to remind oneself of what the numbers really mean.
Scikit-learn documentation:
scikit-learn.org/1.5/modules/generated/sklearn.metrics.classification_report.html
Appendix with notebooks:
github.com/probabl-ai/youtube-appendix/tree/main/16-metrics
Website: probabl.ai/
LinkedIn: www.linkedin.com/company/probabl
Twitter: x.com/probabl_ai
Bluesky: bsky.app/profile/probabl.bsky.social
Discord: discord.probabl.ai
We also host a podcast called Sample Space, which you can find on your favourite podcast player. All the links can be found here:
rss.com/podcasts/sample-space/
#probabl
Scikit-learn documentation:
scikit-learn.org/1.5/modules/generated/sklearn.metrics.classification_report.html
Appendix with notebooks:
github.com/probabl-ai/youtube-appendix/tree/main/16-metrics
Website: probabl.ai/
LinkedIn: www.linkedin.com/company/probabl
Twitter: x.com/probabl_ai
Bluesky: bsky.app/profile/probabl.bsky.social
Discord: discord.probabl.ai
We also host a podcast called Sample Space, which you can find on your favourite podcast player. All the links can be found here:
rss.com/podcasts/sample-space/
#probabl
Переглядів: 388
Відео
Introducing the EstimatorReport
Переглядів 7889 годин тому
Skore version 0.6 introduces a new EstimatorReport, which can be seen as a wrapper around an estimator that automatically detects all the relevant metrics and charts. The goal of the project is to be a useful sidekick for scikit-learn and you can expect more utilities to be released soon. Links: skore v0.6 documentation: skore.probabl.ai/0.6/index.html skore GitHub repository: github.com/probab...
Time for some (extreme) distillation with Thomas van Dongen - founder of the Minish Lab
Переглядів 626День тому
Word embeddings might feel like they are a little bit out of fashion. After all, we have attention mechanisms and transformer models now, right? Well, it turns out that if you apply distillation the right way you can actually get highly performant word embeddings out. It's a technique featured by the model2vec project from the Minish lab and in this episode we talk to the founder to learn more ...
Dumb models can be very smart
Переглядів 1,1 тис.14 днів тому
Dummy models are models that really just make a prediction without learning any mayor patterns from your dataset. But what makes them useful is that they can be compared to other models. If your trained system cannot outperform a dummy model then you've got a signal to dive deeper. 00:00 Just metrics 04:11 Toward dummy models 07:35 Regression as well Website: probabl.ai/ LinkedIn: www.linkedin....
What the official scikit-learn certification looks like
Переглядів 596Місяць тому
We got a lot of questions about our certification program. Some of these questions weren't so much on the material, but more on the medium and interface of the actual exam. That why we made this small recording. It shows what you can expect as we go through a few questions in a mock exam. 00:00 Intro and setup 01:56 Starting the exam 03:19 First questions 05:41 Programming exercise 09:47 Final ...
When precision equals recall
Переглядів 905Місяць тому
Precision can actually be equal to recall. For balanced datasets it can even be pretty common! But understanding when this happens may also help you understand both metrics a bit more. 00:00 Introduction 00:32 Experiment setup 03:45 Code 06:45 Why? 09:30 Math proof via sympy Appendix with notebooks: github.com/probabl-ai/youtube-appendix/tree/main/16-metrics Website: probabl.ai/ LinkedIn: www.l...
Precision, recall and F1-score
Переглядів 667Місяць тому
Metrics are important. If you are careless with them you will have a bad time comparing algorithms. That's why we will dive deeper into metrics in the next few videos. To get things started, let's dive into precision, recall and the F1 score. These metrics are common, but they are also intimately related. 00:00 Introduction 00:30 Example 04:46 Shape of F1 score 06:37 Code Appendix with notebook...
Imbalanced-learn: regrets and onwards - with Guillaume Lemaitre, core-maintainer
Переглядів 829Місяць тому
Imbalanced learn is one of the most popular scikit-learn projects out there. It has support for resampling techniques which historically have always been used for imbalanced classification use-cases. However, now that we are a few years down the line, it may be time to start rethinking the library. As it turns out, other techniques may be preferable. We talk to the maintainer, Guillaume Lemaitr...
Why the MinHashEncoder is great for boosted trees
Переглядів 1 тис.2 місяці тому
Why the MinHashEncoder is great for boosted trees
You want to be in control of your own Copilot with Ty Dunn - founder of Continue.dev
Переглядів 3082 місяці тому
You want to be in control of your own Copilot with Ty Dunn - founder of Continue.dev
What it is like to maintain the scikit-learn docs with David Arturo Amor Quiroz, docs maintainer
Переглядів 3902 місяці тому
What it is like to maintain the scikit-learn docs with David Arturo Amor Quiroz, docs maintainer
Sqlite can totally do embeddings now with Alex Garcia, creator of sqlite-vec
Переглядів 1,3 тис.3 місяці тому
Sqlite can totally do embeddings now with Alex Garcia, creator of sqlite-vec
How to rethink the notebook with Akshay Agrawal, co-creator of Marimo
Переглядів 1,1 тис.3 місяці тому
How to rethink the notebook with Akshay Agrawal, co-creator of Marimo
Feature engineering for overlapping categories
Переглядів 8064 місяці тому
Feature engineering for overlapping categories
You're always (always!) dealing with many (many!) tables - with Madelon Hulsebos
Переглядів 9004 місяці тому
You're always (always!) dealing with many (many!) tables - with Madelon Hulsebos
How Narwhals has many end users ... that never use it directly. - Marco Gorelli
Переглядів 6335 місяців тому
How Narwhals has many end users ... that never use it directly. - Marco Gorelli
More flexible models via sample weights
Переглядів 8615 місяців тому
More flexible models via sample weights
Why ridge regression typically beats linear regression
Переглядів 1,5 тис.5 місяців тому
Why ridge regression typically beats linear regression
Understanding how the KernelDensityEstimator works
Переглядів 8356 місяців тому
Understanding how the KernelDensityEstimator works
Pragmatic data science checklists with Peter Bull
Переглядів 9786 місяців тому
Pragmatic data science checklists with Peter Bull
Don't worry too much about missing data
Переглядів 1 тис.6 місяців тому
Don't worry too much about missing data
This is an excellent addition.
I think there website used to say something like "airflow was built for 2015...". It seems quite right 😉 This is certainly much better than airflow and hope it will be here for some time. What are work pools, automations and blocks ? Thank you
Insightful, as always! By the way, what do you use to make annotations on screen?
Screenbrush.
What is that notebook platform you are using?
This notebook uses Marimo, you can find a livestream on that tool as well as a podcast with one of the creators on our YT channel.
Great information
Thanks!
What a clear explanation, for example, of the slider that works as an input field and the chart as an output, where user input is used to update or generate something on the page. This approach really helps me understand the pattern.
This looks very nice
this is so cool
Many of us come from the AWS background, modal is 👽no more. Will have to certainly check this out. Thanks as always. 🙏
Not sure what you mean with "modal is 👽no more", could you elaborate?
@@probabl_ai Sure Modal is alien no more.
very cool use of Modal!!
We like to think so, yeah :)
Amazing!
Thank you! Cheers!
Amazing stuff
Thanks!
Thank you for sharing
My pleasure
Perfect timing for me!!! Thanks
Enjoy!
Amazing as always
Thank you! Cheers!
Thank you Vincent! This is very helpful for understanding purposes. Please do more of these type of presentations.
Will do!
Thanks Vincent for this amazing video
Thank you for sharing 🙂
Thanks for the video. Love your videos Vincent.
Thanks for watching everyone!
your channel is amazing , I love your content especially cause provide so much clarity in comparison to what we learn in uni.
@@mberoakoko24 Happy to hear it!
love your videos Vincent, keep them coming <3 <3
Also, wanted to do the internship, won't say I was the perfect candidate, would have applied. Now it is gone :') :')
Another clear and fun video to start the year… happy new year ! 🎉
Ah yes, how wonderful... we now get yet another pricey certification which lasts just a couple of years. Rejoice everyone ✨ Maybe, instead of trying to capitalize on a certification, it would've been better to begin by creating a comprehensive resource to learn scikit-learn. Something a bit more in-depth and hands-on than the existing Inria MOOC or your random videos (which, by the way, are great but clearly not suited to help newcomers to get started with scikit-learn)
Totally agree - Sci-kit learn is NOT easy to learn. A GOOD resource to learn the most important parts would be fantastic. And tell us what REALLY are the vital bits to understand...it's big...
I agree. Its not about the price but the annoying thing of certification expiry. This is the reason this certification would never be popular and eventually a newer/better package would replace scikit. This is missed opportunity.
We appreciate the feedback but would also like to clarify a few things here. 1. The reason why there is an expiry is because the best practices change over time. The library is evolving (2 years represent 4 releases), and so is knowledge. We want to make sure you're up to date for good practices. Just to give one example: a few years ago we might've recommended imbalanced-learn for imbalanced classification use-cases but have since dropped this recommendation because calibration tends a better avenue for those kinds of problems. Our recent podcast dives into this topic to those who are interested. It is because of this "techniques have to be re-evaluated"-phenomenon that we also set an expiry date on the certification. If the recommendations changes over time, the certification should also get an update. Whenever somebody wants to renew their certification they can do so at a reduced price and with a shorter exam. We're still working out the details of this, but we don't plan on charging the same full amount during a certification renewal/update. 2. We are working on a good resource for the certification exam as we understand that it can be frustrating to prepare for an exam without having a clear guide on what is expected. For now we have the official scikit-learn MOOC (which is a for-free resources that we invested in), as well as the details on the certification site that describe the expectations. We are working on adding material that is more focussed on the certification. Note that the video description contains some extra links now to these aforementioned resources (adding them in the comments usually triggers the YT anti-spam mechanism). 3. We hope that people can appreciate that we do our best to make this certification accessible and affordable but that we also hope to develop a stream of income that helps fund the scikit-learn project. There are many other courses and certification providers out there that easily charge ten times our listed price while they contribute *nothing* to the maintenance of the project.
Could we access pandas and numpy's doc as scikit learn heavily relies on it ?
Technically, if you have access to a Jupyter enviornment you will always have access to the docstring via help(thing). Dunno about the rest of the docs though.
Alas, in principle folks only have access to the scikit-learn doc, not pandas. That said, we try to avoid making really elaborate pandas/numpy stuff. That said, and as mentioned below, you still have access to docstrings from Jupyter.
Nice illustration of the relationship between these metrics, and happy to discover SimPy that I did not know 😊
Oh it is a super cool project!
What’s your setup for drawing on screen?
A cheap drawing tablet and an app called screenbrush.
What do you think about purposefully varying the random seed to verify your model's sensitivity to randomness in formal experiments? I've been discussing about this a lot with my colleagues recently and I have been doing this type of analysis specially with neural networks experiments, but some people advised me to not do this as to avoid the temptation of hand picking the "best" random state... however, some other people have been saying that a random seed is as much of a hyperparameter as any other, so it would be fine to hand pick it...
You might enjoy this TIL on the topic: koaning.io/til/optimal-seeds/ That said, lets think for a moment what the `random_seed` is meant for. The reason that it exists is to allow for repeatability of the experiment. When we set the seed, we hope to ensure that future runs give us the same result. That's a useful mechanism, but notice how it is not a mechanism that is meant as a control lever of an algorithm. That's why I find there is something dirty about finding an "optimal" algorithm by changing the seed value. It's meant to control consistency, not model performance. There is a middle path though: use randomized search. Adding a random seed here is "free" in the sense that it does not blow up the search space but it might allow you to measure the effect of randomness in hindsight. Does this help?
@@probabl_ai Makes sense to me! Thanks.
How do we study for this cert?
You can follow the official Scikit-learn MOOC. The Associate certification level is based on it.
Thank You!
We have a special LAUNCH20 discount code - valid until end of December... now is the perfect time to schedule 🚀
Super episode ! Une belle réflexion sur le fait que les apprentissages tirés tout au long du développement de imbalance learn sont en fait les plus grands trésors de cette bibliothèque
Finally, found someone who knows his stuff and more importantly knows how to teach. Thanks for sharing this much depth and breadth of knowledge.
This exposes how data with periodic components can cause resonances in analytical processes that likewise make use of periodic components. Randomness and complexity can appear the same on the surface, but if something appearing random was constructed by combining multiple relatively-prime roots (for example), those roots can then stick out in analysis. Using a modulo component like that can be a good way to do that.. ultimately this kinda edges into the territory covered by Fourier Transforms, in a way. Cool stuff!
This is Pure gold!
Happy to hear it!
I ran into this recently - ran 400 models with a hyperparameter search, and then discarded the top 2 (by validation %) because they were super lucky, and failed to do anything special with the holdout test set ... ultimately i settled on the 4th "best" model out of 400, its parameters were "nice" in a particular way.
Out of curiosity, did you look at the hyperparameters and did these also show regions of better performance?
Would nested cross validation help mitigate the effects of the optimizer's curse? Maybe I'm not understanding the material well - but this also reads like an issue of overfitting to the validation set.
It's more that you are battling the fact that luck may work against you.
Thought this was about DnD. Came for the dragon, stayed for the interesting science stuff.
How so? Is the optimisers curse a DnD thing?
@@probabl_ai I am not aware of one, but the name certainly sounds like it could be referring to a D&D optimizer. As someone with a lot of science and D&D content showing up in my feed, I honestly half thought it was related too. Haha.
for your split hash bloom vectorizer I don't understand how it won't get the same collisions again if you original hash h has a collision then taking sliding windows of that hash to make multiple hash will result in the same collisions no?
The original hash has *a lot* of entropy. The odds of getting a collision there is very small. But we reduce that entropy by a lot when we introduce a set size for the vocab. In the example here I take 10_000, which is a tiny set of values compared to the original possible values that the hash could give. The concern isn't so much that the original integer from the hash can collide, rather that the (hash_int % vocab) size might. When you look at it this way, do you still have the same concern? Now that I think more of it, you are right to be critical of what I proposed here. On reflection, I think that we can (and should) do much better than a sliding window because this still introduces a relationship from (window a) and (window a + 1). Instead it might be better to just calculate a big hash and to chop it up into non-overlapping segments.
@@probabl_ai I think I have more holes in my knowledge than I realised, I'll learn more about hashes
@@cunningham.s_law No worries, feel free to keep asking questions because it is totally possible that I have made a mistake.
MMH3 hash seems to only generate 16 bytes (128 bits). Isn't sliding window kinda limited in that case?
@@Mayur7Garg as mentioned before, the sliding window is indeed suboptimal. But the number of bits could still work for a bloom vectoriser. You need a few thousands of buckets, not a million.
Or someone might say that you fool yourself since you look at the trained models and assume that the best model is a part of that set when your optimization problem most likely is non-convex.
I've even seen this happen on convex models actually. Granted, when this happened it was related to the `max_iter` variable being too low so it isn't converging properly. Bit of an edge case, but devil is in the details.
@ 4:50 "And this phenomenon actually has a name" I was 100% certain you were going to say null hypothesis significance testing, because that's what it's called
It's not a wrong perspective, but the phenomenon that is being described is the counterintuitive situation where adding more hyper-parameters may make the "best performance" statistic less reliable. Hypothesis testing also tends to be a bit tricky in the domain of hyper-parameters too, mainly around the question of what underlying distribution/test you can assume.
love these, Sample Space is my favourite podcast out there
Quality video
hey @vincent, in your WASM demo, you effectively generated all the data points on the notebook itself. How would one go about accessing the data from a source? would it be possible to include data and send while generating wasm link? or is there something else. Appreciate your inputs. Thank you. Have a great day.
If the datasource is open, say on Github, then you should be able to fetch it as you would normally in Python. However, if you need private keys then it is a different story. I would not do anything with private data here because everything you share with Marimo now is public on the internet.
@@probabl_aiNoted. Thank you.
is it possible to get the gradients of the hyperparameters?
Typically the hyperparameters do not have "gradients". To use Pytorch as an analogy, the weights of a neural network might be differentiable, but the amount of dropout isn't. Not to mention that a lot of algorithms in scikit-learn aren't based on gradient algorithms.
It seems like running a bog standard, factor analysis after the tests would reveal this. It's basically what you are doing in your visualizer, except it can run on thousands of parameters more than you can visualize, and it feels more formal than "ey, this graph looks like it has a correlation".
Statistical tests can certainly be a good idea for sure, but confirming the results with a visual can be a nice insurance policy either way.
Factor analysis finds linear relationships, which is good, but there are important nonlinear relationships between hyperparameters, especially for complex models and/or datasets (learning rate vs batch size for neural networks is one common example of this).
@@joshuaspeckman7075 Good point!
It would have hurt if you didn't choose 42, so genuinely Thank You!
I'd be pretty suspicious if the result of my random tests looked like a bell curve over some random hyper parameters. I'd start to think that probably any hyper-parameter would do basically the same thing down to some natural variability in the score. I guess we can do hypothesis testing to determine whether our results are significant.
A uniform distribution might also be suspicious. But the main lesson is that you can indeed apply a healthy amount of doubt when looking at these stats. They can give you an optimistic version of reality and it can be very tempting to get fooled by good numbers.
I know its not practical but this is one of the reasons I get really crosswise with introducing the concept of scoring to any decision makers higher up the pay ladder. Its SOOOOOO easy for these things to become THE measure of a model instead of A measure. Although that problem goes the other way as well. I've seen a professor fit a battery of models and just pick the highest score... which sort of defeats the purpose of the statistician. Idk how much value is lost in just withholding scores from anyone not trained up in the stats behind them
Goodhearts law is indeed something to look out for. When a metric becomes a target, it is not a great metric anymore.
Can this be used to define sort of reliability for the model? For a given RF with fixed hyper params, calculate the scores for various random states. Then use the standard deviation to depict the spread in score only due to randomness. The idea being that for a given model and data, a lower spread means that the model is able to model the data similarly in all instances irrespective of any randomness in it. If the spread is high, the model might be too sensitive, or the data needs more engineering.
I might be a bit careful there. By using a random seed you will sample from a distribution that is defined by the hyperparams. Change the hyperparams and you will have another distribution and I don't know if you can claim anything general about the difference of these two distributions upfront.
@probabl_ai As I said fixed hyper params. I am only updating the random seed. The idea is to establish some sort of likelihood of the model's score distribution just due to chance. So instead of saying that the model score is 0.7 for some specific random seed value, I can say something like model score is 0.7±0.1 where the latter is the std of the scores or that the model scores over 0.6 for 95% of random seed values.
Is a random state similar to initial starting point? So with a bad random state you end up in a local minima
I must say the first half left me with a uneasy taste, seeing vectors with "apples and oranges" features being thrown at a distance-based method like KNN. but was delighted to see you come out strong on the other side and address the issue in the second half. the Ridge trick is pretty slick. 👍 great video, great channel!
In order to make a point sometimes you first have to show how not to do it, that way it is easier to motivate the alternative route. Happy to hear you like the content!
Thanks for the optuna exploration. Do you have the link to the notebook ?
d0h! My bad! Just added a link to the notebook in the shownotes of this video.
This is a common problem i think everyone experiences at some point, and understanding the model as well as having metrics that cover a wide variety of edge cases both seem to resolve this quite well. There also plenty of strategies to circumvent the issue such as the cross validation you showcased, but also more "stratified" approaches exist such as genetic algorithms or particle swarm optimization. My issue however is how to deal with this when you have a limited amount of compute on hand and wish to obtain a good result without having to spend a lot of time testing until you isolate the good hyperparameters and the more noisy ones ? obviously i don't expect a one size fits all solution, but i'd love to hear what solutions or workaround people use, especially nowadays when models are getting bigger and bigger.
There is certainly a balance there yeah. Not everyone has unlimited compute. The simplest answer is to just try and remain pragmatic. Make sure you invest a little bit in visuals and don't trust your numbers blindly. Really think about what metric matters and really try to think about what problem needs to be solved in reality.
For deep models, initializing weights sampled from a relatively small variance Gaussian distribution has been shown to give faster convergence. Andrew Karpathy doesn’t touch on it in his making GPT video, but if you go to the GitHub code you can see the change. Also, adding a weight-size penalty to the loss can encourage the model to come up with more general parameters, but this can be very delayed (grokking). I have seen several gradient and double descent methods that basically “pick up the signal” early, though. Remember for nontrivial tasks and good architecture this is more of icing on cake tho.