98
115 043

2:45

Time for some (extreme) distillation with Thomas van Dongen - founder of the Minish Lab

49:22

Dumb models can be very smart

8:55

What the official scikit-learn certification looks like

11:06

When precision equals recall

11:36

Precision, recall and F1-score

12:19

Playing with the classification report

In this video we will play around with a confusion matrix widget that will help us understand how the numbers in the classification report in scikit-learn are created. The classification report is a great utility, but it can help to remind oneself of what the numbers really mean.
Scikit-learn documentation:
scikit-learn.org/1.5/modules/generated/sklearn.metrics.classification_report.html
Appendix with notebooks:
github.com/probabl-ai/youtube-appendix/tree/main/16-metrics
Website: probabl.ai/
LinkedIn: www.linkedin.com/company/probabl
Twitter: x.com/probabl_ai
Bluesky: bsky.app/profile/probabl.bsky.social
Discord: discord.probabl.ai
We also host a podcast called Sample Space, which you can find on your favourite podcast player. All the links can be found here:
rss.com/podcasts/sample-space/
#probabl

Відео

2:45

Introducing the EstimatorReport

Переглядів 7889 годин тому

Skore version 0.6 introduces a new EstimatorReport, which can be seen as a wrapper around an estimator that automatically detects all the relevant metrics and charts. The goal of the project is to be a useful sidekick for scikit-learn and you can expect more utilities to be released soon. Links: skore v0.6 documentation: skore.probabl.ai/0.6/index.html skore GitHub repository: github.com/probab...

Time for some (extreme) distillation with Thomas van Dongen - founder of the Minish Lab

49:22

Time for some (extreme) distillation with Thomas van Dongen - founder of the Minish Lab

Переглядів 626День тому

Word embeddings might feel like they are a little bit out of fashion. After all, we have attention mechanisms and transformer models now, right? Well, it turns out that if you apply distillation the right way you can actually get highly performant word embeddings out. It's a technique featured by the model2vec project from the Minish lab and in this episode we talk to the founder to learn more ...

8:55

Dumb models can be very smart

Переглядів 1,1 тис.14 днів тому

Dummy models are models that really just make a prediction without learning any mayor patterns from your dataset. But what makes them useful is that they can be compared to other models. If your trained system cannot outperform a dummy model then you've got a signal to dive deeper. 00:00 Just metrics 04:11 Toward dummy models 07:35 Regression as well Website: probabl.ai/ LinkedIn: www.linkedin....

What the official scikit-learn certification looks like

11:06

What the official scikit-learn certification looks like

Переглядів 596Місяць тому

We got a lot of questions about our certification program. Some of these questions weren't so much on the material, but more on the medium and interface of the actual exam. That why we made this small recording. It shows what you can expect as we go through a few questions in a mock exam. 00:00 Intro and setup 01:56 Starting the exam 03:19 First questions 05:41 Programming exercise 09:47 Final ...

11:36

When precision equals recall

Переглядів 905Місяць тому

Precision can actually be equal to recall. For balanced datasets it can even be pretty common! But understanding when this happens may also help you understand both metrics a bit more. 00:00 Introduction 00:32 Experiment setup 03:45 Code 06:45 Why? 09:30 Math proof via sympy Appendix with notebooks: github.com/probabl-ai/youtube-appendix/tree/main/16-metrics Website: probabl.ai/ LinkedIn: www.l...

12:19

Precision, recall and F1-score

Переглядів 667Місяць тому

Metrics are important. If you are careless with them you will have a bad time comparing algorithms. That's why we will dive deeper into metrics in the next few videos. To get things started, let's dive into precision, recall and the F1 score. These metrics are common, but they are also intimately related. 00:00 Introduction 00:30 Example 04:46 Shape of F1 score 06:37 Code Appendix with notebook...

Imbalanced-learn: regrets and onwards - with Guillaume Lemaitre, core-maintainer

54:07

Imbalanced-learn: regrets and onwards - with Guillaume Lemaitre, core-maintainer

Переглядів 829Місяць тому

Imbalanced learn is one of the most popular scikit-learn projects out there. It has support for resampling techniques which historically have always been used for imbalanced classification use-cases. However, now that we are a few years down the line, it may be time to start rethinking the library. As it turns out, other techniques may be preferable. We talk to the maintainer, Guillaume Lemaitr...

11:42

The optimisers curse

Переглядів 13 тис.Місяць тому

The optimisers curse

Why the MinHashEncoder is great for boosted trees

11:53

Why the MinHashEncoder is great for boosted trees

Переглядів 1 тис.2 місяці тому

Why the MinHashEncoder is great for boosted trees

12:56

How the HashingVectorizer works

Переглядів 5732 місяці тому

How the HashingVectorizer works

You want to be in control of your own Copilot with Ty Dunn - founder of Continue.dev

1:07:17

You want to be in control of your own Copilot with Ty Dunn - founder of Continue.dev

Переглядів 3082 місяці тому

You want to be in control of your own Copilot with Ty Dunn - founder of Continue.dev

What it is like to maintain the scikit-learn docs with David Arturo Amor Quiroz, docs maintainer

55:02

What it is like to maintain the scikit-learn docs with David Arturo Amor Quiroz, docs maintainer

Переглядів 3902 місяці тому

What it is like to maintain the scikit-learn docs with David Arturo Amor Quiroz, docs maintainer

Sqlite can totally do embeddings now with Alex Garcia, creator of sqlite-vec

59:21

Sqlite can totally do embeddings now with Alex Garcia, creator of sqlite-vec

Переглядів 1,3 тис.3 місяці тому

Sqlite can totally do embeddings now with Alex Garcia, creator of sqlite-vec

How to rethink the notebook with Akshay Agrawal, co-creator of Marimo

1:12:05

How to rethink the notebook with Akshay Agrawal, co-creator of Marimo

Переглядів 1,1 тис.3 місяці тому

How to rethink the notebook with Akshay Agrawal, co-creator of Marimo

9:00

Topics vs. embeddings

Переглядів 8523 місяці тому

Topics vs. embeddings

11:39

How the GapEncoder works

Переглядів 8633 місяці тому

How the GapEncoder works

11:14

PCA as an embedding technique

Переглядів 1,4 тис.4 місяці тому

PCA as an embedding technique

Feature engineering for overlapping categories

12:08

Feature engineering for overlapping categories

Переглядів 8064 місяці тому

Feature engineering for overlapping categories

You're always (always!) dealing with many (many!) tables - with Madelon Hulsebos

1:09:11

You're always (always!) dealing with many (many!) tables - with Madelon Hulsebos

Переглядів 9004 місяці тому

You're always (always!) dealing with many (many!) tables - with Madelon Hulsebos

11:51

Data checks for estimators

Переглядів 5344 місяці тому

Data checks for estimators

11:22

Improving models via subsets

Переглядів 6734 місяці тому

Improving models via subsets

How Narwhals has many end users ... that never use it directly. - Marco Gorelli

1:00:54

How Narwhals has many end users ... that never use it directly. - Marco Gorelli

Переглядів 6335 місяців тому

How Narwhals has many end users ... that never use it directly. - Marco Gorelli

10:13

Decayed estimators for timeseries

Переглядів 1,9 тис.5 місяців тому

Decayed estimators for timeseries

12:58

More flexible models via sample weights

Переглядів 8615 місяців тому

More flexible models via sample weights

Why ridge regression typically beats linear regression

12:47

Why ridge regression typically beats linear regression

Переглядів 1,5 тис.5 місяців тому

Why ridge regression typically beats linear regression

Understanding how the KernelDensityEstimator works

12:15

Understanding how the KernelDensityEstimator works

Переглядів 8356 місяців тому

Understanding how the KernelDensityEstimator works

Pragmatic data science checklists with Peter Bull

1:05:39

Pragmatic data science checklists with Peter Bull

Переглядів 9786 місяців тому

Pragmatic data science checklists with Peter Bull

13:15

Use-cases for inverted PCA

Переглядів 1,8 тис.6 місяців тому

Use-cases for inverted PCA

10:48

Don't worry too much about missing data

Переглядів 1 тис.6 місяців тому

Don't worry too much about missing data

КОМЕНТАРІ

@BezzaDoc 9 годин тому
This is an excellent addition.
@SHAMIKII 14 годин тому
I think there website used to say something like "airflow was built for 2015...". It seems quite right 😉 This is certainly much better than airflow and hope it will be here for some time. What are work pools, automations and blocks ? Thank you
@vishal1278 2 дні тому
Insightful, as always! By the way, what do you use to make annotations on screen?
@probabl_ai 2 дні тому
Screenbrush.
@vmgustavo 3 дні тому
What is that notebook platform you are using?
@probabl_ai 3 дні тому
This notebook uses Marimo, you can find a livestream on that tool as well as a podcast with one of the creators on our YT channel.
@TheCJD89 3 дні тому
Great information
@probabl_ai 3 дні тому
Thanks!
@jayhu6075 3 дні тому
What a clear explanation, for example, of the slider that works as an input field and the chart as an output, where user input is used to update or generate something on the page. This approach really helps me understand the pattern.
@TheCJD89 3 дні тому
This looks very nice
@SHAMIKII 4 дні тому
this is so cool
@SHAMIKII 7 днів тому
Many of us come from the AWS background, modal is 👽no more. Will have to certainly check this out. Thanks as always. 🙏
@probabl_ai 6 днів тому
Not sure what you mean with "modal is 👽no more", could you elaborate?
@SHAMIKII 6 днів тому
@@probabl_ai Sure Modal is alien no more.
@ModalLabs 9 днів тому
very cool use of Modal!!
@probabl_ai 8 днів тому
We like to think so, yeah :)
@nabeelbassam 10 днів тому
Amazing!
@probabl_ai 9 днів тому
Thank you! Cheers!
@SHAMIKII 10 днів тому
Amazing stuff
@probabl_ai 9 днів тому
Thanks!
@SleepeJobs 10 днів тому
Thank you for sharing
@probabl_ai 4 дні тому
My pleasure
@studyireland 10 днів тому
Perfect timing for me!!! Thanks
@probabl_ai 10 днів тому
Enjoy!
@jrob37 10 днів тому
Amazing as always
@probabl_ai 10 днів тому
Thank you! Cheers!
@sroy2138 14 днів тому
Thank you Vincent! This is very helpful for understanding purposes. Please do more of these type of presentations.
@probabl_ai 13 днів тому
Will do!
@nabeelbassam 15 днів тому
Thanks Vincent for this amazing video
@DreamsAPI 16 днів тому
Thank you for sharing 🙂
@SHAMIKII 16 днів тому
Thanks for the video. Love your videos Vincent.
@probabl_ai 16 днів тому
Thanks for watching everyone!
@mberoakoko24 16 днів тому
your channel is amazing , I love your content especially cause provide so much clarity in comparison to what we learn in uni.
@probabl_ai 16 днів тому
@@mberoakoko24 Happy to hear it!
@beastfromeast-w2d 16 днів тому
love your videos Vincent, keep them coming <3 <3
@beastfromeast-w2d 16 днів тому
Also, wanted to do the internship, won't say I was the perfect candidate, would have applied. Now it is gone :') :')
@GammaOmega-o3t 17 днів тому
Another clear and fun video to start the year… happy new year ! 🎉
@matteo679 Місяць тому
Ah yes, how wonderful... we now get yet another pricey certification which lasts just a couple of years. Rejoice everyone ✨ Maybe, instead of trying to capitalize on a certification, it would've been better to begin by creating a comprehensive resource to learn scikit-learn. Something a bit more in-depth and hands-on than the existing Inria MOOC or your random videos (which, by the way, are great but clearly not suited to help newcomers to get started with scikit-learn)
@tyronefrielinghaus3467 27 днів тому
Totally agree - Sci-kit learn is NOT easy to learn. A GOOD resource to learn the most important parts would be fantastic. And tell us what REALLY are the vital bits to understand...it's big...
@HontuZindabad 15 днів тому
I agree. Its not about the price but the annoying thing of certification expiry. This is the reason this certification would never be popular and eventually a newer/better package would replace scikit. This is missed opportunity.
@probabl_ai 10 днів тому
We appreciate the feedback but would also like to clarify a few things here. 1. The reason why there is an expiry is because the best practices change over time. The library is evolving (2 years represent 4 releases), and so is knowledge. We want to make sure you're up to date for good practices. Just to give one example: a few years ago we might've recommended imbalanced-learn for imbalanced classification use-cases but have since dropped this recommendation because calibration tends a better avenue for those kinds of problems. Our recent podcast dives into this topic to those who are interested. It is because of this "techniques have to be re-evaluated"-phenomenon that we also set an expiry date on the certification. If the recommendations changes over time, the certification should also get an update. Whenever somebody wants to renew their certification they can do so at a reduced price and with a shorter exam. We're still working out the details of this, but we don't plan on charging the same full amount during a certification renewal/update. 2. We are working on a good resource for the certification exam as we understand that it can be frustrating to prepare for an exam without having a clear guide on what is expected. For now we have the official scikit-learn MOOC (which is a for-free resources that we invested in), as well as the details on the certification site that describe the expectations. We are working on adding material that is more focussed on the certification. Note that the video description contains some extra links now to these aforementioned resources (adding them in the comments usually triggers the YT anti-spam mechanism). 3. We hope that people can appreciate that we do our best to make this certification accessible and affordable but that we also hope to develop a stream of income that helps fund the scikit-learn project. There are many other courses and certification providers out there that easily charge ten times our listed price while they contribute *nothing* to the maintenance of the project.
@marcbresson6548 Місяць тому
Could we access pandas and numpy's doc as scikit learn heavily relies on it ?
@calmcode-io Місяць тому
Technically, if you have access to a Jupyter enviornment you will always have access to the docstring via help(thing). Dunno about the rest of the docs though.
@probabl_ai Місяць тому
Alas, in principle folks only have access to the scikit-learn doc, not pandas. That said, we try to avoid making really elaborate pandas/numpy stuff. That said, and as mentioned below, you still have access to docstrings from Jupyter.
@GammaOmega-o3t Місяць тому
Nice illustration of the relationship between these metrics, and happy to discover SimPy that I did not know 😊
@probabl_ai Місяць тому
Oh it is a super cool project!
@williebsweet Місяць тому
What’s your setup for drawing on screen?
@probabl_ai Місяць тому
A cheap drawing tablet and an app called screenbrush.
@LuisCruz-pq1oy Місяць тому
What do you think about purposefully varying the random seed to verify your model's sensitivity to randomness in formal experiments? I've been discussing about this a lot with my colleagues recently and I have been doing this type of analysis specially with neural networks experiments, but some people advised me to not do this as to avoid the temptation of hand picking the "best" random state... however, some other people have been saying that a random seed is as much of a hyperparameter as any other, so it would be fine to hand pick it...
@probabl_ai Місяць тому
You might enjoy this TIL on the topic: koaning.io/til/optimal-seeds/ That said, lets think for a moment what the `random_seed` is meant for. The reason that it exists is to allow for repeatability of the experiment. When we set the seed, we hope to ensure that future runs give us the same result. That's a useful mechanism, but notice how it is not a mechanism that is meant as a control lever of an algorithm. That's why I find there is something dirty about finding an "optimal" algorithm by changing the seed value. It's meant to control consistency, not model performance. There is a middle path though: use randomized search. Adding a random seed here is "free" in the sense that it does not blow up the search space but it might allow you to measure the effect of randomness in hindsight. Does this help?
@LuisCruz-pq1oy Місяць тому
@@probabl_ai Makes sense to me! Thanks.
@JR-gy1lh Місяць тому
How do we study for this cert?
@tuscland Місяць тому
You can follow the official Scikit-learn MOOC. The Associate certification level is based on it.
@JR-gy1lh Місяць тому
Thank You!
@PenelopeGittos Місяць тому
We have a special LAUNCH20 discount code - valid until end of December... now is the perfect time to schedule 🚀
@marcbresson6548 Місяць тому
Super episode ! Une belle réflexion sur le fait que les apprentissages tirés tout au long du développement de imbalance learn sont en fait les plus grands trésors de cette bibliothèque
@kamranalisyed5553 Місяць тому
Finally, found someone who knows his stuff and more importantly knows how to teach. Thanks for sharing this much depth and breadth of knowledge.
@seedmole Місяць тому
This exposes how data with periodic components can cause resonances in analytical processes that likewise make use of periodic components. Randomness and complexity can appear the same on the surface, but if something appearing random was constructed by combining multiple relatively-prime roots (for example), those roots can then stick out in analysis. Using a modulo component like that can be a good way to do that.. ultimately this kinda edges into the territory covered by Fourier Transforms, in a way. Cool stuff!
@alanmossmusic Місяць тому
This is Pure gold!
@probabl_ai Місяць тому
Happy to hear it!
@andytroo Місяць тому
I ran into this recently - ran 400 models with a hyperparameter search, and then discarded the top 2 (by validation %) because they were super lucky, and failed to do anything special with the holdout test set ... ultimately i settled on the 4th "best" model out of 400, its parameters were "nice" in a particular way.
@probabl_ai Місяць тому
Out of curiosity, did you look at the hyperparameters and did these also show regions of better performance?
@victorlee6129 Місяць тому
Would nested cross validation help mitigate the effects of the optimizer's curse? Maybe I'm not understanding the material well - but this also reads like an issue of overfitting to the validation set.
@probabl_ai Місяць тому
It's more that you are battling the fact that luck may work against you.
@Sunnywastakentoo Місяць тому
Thought this was about DnD. Came for the dragon, stayed for the interesting science stuff.
@probabl_ai Місяць тому
How so? Is the optimisers curse a DnD thing?
@WindsorMason Місяць тому
@@probabl_ai I am not aware of one, but the name certainly sounds like it could be referring to a D&D optimizer. As someone with a lot of science and D&D content showing up in my feed, I honestly half thought it was related too. Haha.
@cunningham.s_law Місяць тому
for your split hash bloom vectorizer I don't understand how it won't get the same collisions again if you original hash h has a collision then taking sliding windows of that hash to make multiple hash will result in the same collisions no?
@probabl_ai Місяць тому
The original hash has *a lot* of entropy. The odds of getting a collision there is very small. But we reduce that entropy by a lot when we introduce a set size for the vocab. In the example here I take 10_000, which is a tiny set of values compared to the original possible values that the hash could give. The concern isn't so much that the original integer from the hash can collide, rather that the (hash_int % vocab) size might. When you look at it this way, do you still have the same concern? Now that I think more of it, you are right to be critical of what I proposed here. On reflection, I think that we can (and should) do much better than a sliding window because this still introduces a relationship from (window a) and (window a + 1). Instead it might be better to just calculate a big hash and to chop it up into non-overlapping segments.
@cunningham.s_law Місяць тому
@@probabl_ai I think I have more holes in my knowledge than I realised, I'll learn more about hashes
@probabl_ai Місяць тому
@@cunningham.s_law No worries, feel free to keep asking questions because it is totally possible that I have made a mistake.
@Mayur7Garg Місяць тому
MMH3 hash seems to only generate 16 bytes (128 bits). Isn't sliding window kinda limited in that case?
@probabl_ai Місяць тому
@@Mayur7Garg as mentioned before, the sliding window is indeed suboptimal. But the number of bits could still work for a bloom vectoriser. You need a few thousands of buckets, not a million.
@marksonson260 Місяць тому
Or someone might say that you fool yourself since you look at the trained models and assume that the best model is a part of that set when your optimization problem most likely is non-convex.
@probabl_ai Місяць тому
I've even seen this happen on convex models actually. Granted, when this happened it was related to the `max_iter` variable being too low so it isn't converging properly. Bit of an edge case, but devil is in the details.
@johnlashlee1315 Місяць тому
@ 4:50 "And this phenomenon actually has a name" I was 100% certain you were going to say null hypothesis significance testing, because that's what it's called
@probabl_ai Місяць тому
It's not a wrong perspective, but the phenomenon that is being described is the counterintuitive situation where adding more hyper-parameters may make the "best performance" statistic less reliable. Hypothesis testing also tends to be a bit tricky in the domain of hyper-parameters too, mainly around the question of what underlying distribution/test you can assume.
@marco_gorelli Місяць тому
love these, Sample Space is my favourite podcast out there
@Mr_Hassell Місяць тому
Quality video
@TheRonakagrawal Місяць тому
hey @vincent, in your WASM demo, you effectively generated all the data points on the notebook itself. How would one go about accessing the data from a source? would it be possible to include data and send while generating wasm link? or is there something else. Appreciate your inputs. Thank you. Have a great day.
@probabl_ai Місяць тому
If the datasource is open, say on Github, then you should be able to fetch it as you would normally in Python. However, if you need private keys then it is a different story. I would not do anything with private data here because everything you share with Marimo now is public on the internet.
@TheRonakagrawal Місяць тому
@@probabl_aiNoted. Thank you.
@punchster289 Місяць тому
is it possible to get the gradients of the hyperparameters?
@probabl_ai Місяць тому
Typically the hyperparameters do not have "gradients". To use Pytorch as an analogy, the weights of a neural network might be differentiable, but the amount of dropout isn't. Not to mention that a lot of algorithms in scikit-learn aren't based on gradient algorithms.
@timonix2 Місяць тому
It seems like running a bog standard, factor analysis after the tests would reveal this. It's basically what you are doing in your visualizer, except it can run on thousands of parameters more than you can visualize, and it feels more formal than "ey, this graph looks like it has a correlation".
@probabl_ai Місяць тому
Statistical tests can certainly be a good idea for sure, but confirming the results with a visual can be a nice insurance policy either way.
@joshuaspeckman7075 Місяць тому
Factor analysis finds linear relationships, which is good, but there are important nonlinear relationships between hyperparameters, especially for complex models and/or datasets (learning rate vs batch size for neural networks is one common example of this).
@probabl_ai Місяць тому
@@joshuaspeckman7075 Good point!
@justfoundit Місяць тому
It would have hurt if you didn't choose 42, so genuinely Thank You!
@minerscale Місяць тому
I'd be pretty suspicious if the result of my random tests looked like a bell curve over some random hyper parameters. I'd start to think that probably any hyper-parameter would do basically the same thing down to some natural variability in the score. I guess we can do hypothesis testing to determine whether our results are significant.
@probabl_ai Місяць тому
A uniform distribution might also be suspicious. But the main lesson is that you can indeed apply a healthy amount of doubt when looking at these stats. They can give you an optimistic version of reality and it can be very tempting to get fooled by good numbers.
@Frdyan Місяць тому
I know its not practical but this is one of the reasons I get really crosswise with introducing the concept of scoring to any decision makers higher up the pay ladder. Its SOOOOOO easy for these things to become THE measure of a model instead of A measure. Although that problem goes the other way as well. I've seen a professor fit a battery of models and just pick the highest score... which sort of defeats the purpose of the statistician. Idk how much value is lost in just withholding scores from anyone not trained up in the stats behind them
@probabl_ai Місяць тому
Goodhearts law is indeed something to look out for. When a metric becomes a target, it is not a great metric anymore.
@Mayur7Garg Місяць тому
Can this be used to define sort of reliability for the model? For a given RF with fixed hyper params, calculate the scores for various random states. Then use the standard deviation to depict the spread in score only due to randomness. The idea being that for a given model and data, a lower spread means that the model is able to model the data similarly in all instances irrespective of any randomness in it. If the spread is high, the model might be too sensitive, or the data needs more engineering.
@probabl_ai Місяць тому
I might be a bit careful there. By using a random seed you will sample from a distribution that is defined by the hyperparams. Change the hyperparams and you will have another distribution and I don't know if you can claim anything general about the difference of these two distributions upfront.
@Mayur7Garg Місяць тому
@probabl_ai As I said fixed hyper params. I am only updating the random seed. The idea is to establish some sort of likelihood of the model's score distribution just due to chance. So instead of saying that the model score is 0.7 for some specific random seed value, I can say something like model score is 0.7±0.1 where the latter is the std of the scores or that the model scores over 0.6 for 95% of random seed values.
@oreo-sy2rc Місяць тому
Is a random state similar to initial starting point? So with a bad random state you end up in a local minima
@sdpayot Місяць тому
I must say the first half left me with a uneasy taste, seeing vectors with "apples and oranges" features being thrown at a distance-based method like KNN. but was delighted to see you come out strong on the other side and address the issue in the second half. the Ridge trick is pretty slick. 👍 great video, great channel!
@probabl_ai Місяць тому
In order to make a point sometimes you first have to show how not to do it, that way it is easier to motivate the alternative route. Happy to hear you like the content!
@SHAMIKII Місяць тому
Thanks for the optuna exploration. Do you have the link to the notebook ?
@probabl_ai Місяць тому
d0h! My bad! Just added a link to the notebook in the shownotes of this video.
@PepegaOverlord Місяць тому
This is a common problem i think everyone experiences at some point, and understanding the model as well as having metrics that cover a wide variety of edge cases both seem to resolve this quite well. There also plenty of strategies to circumvent the issue such as the cross validation you showcased, but also more "stratified" approaches exist such as genetic algorithms or particle swarm optimization. My issue however is how to deal with this when you have a limited amount of compute on hand and wish to obtain a good result without having to spend a lot of time testing until you isolate the good hyperparameters and the more noisy ones ? obviously i don't expect a one size fits all solution, but i'd love to hear what solutions or workaround people use, especially nowadays when models are getting bigger and bigger.
@probabl_ai Місяць тому
There is certainly a balance there yeah. Not everyone has unlimited compute. The simplest answer is to just try and remain pragmatic. Make sure you invest a little bit in visuals and don't trust your numbers blindly. Really think about what metric matters and really try to think about what problem needs to be solved in reality.
@TheJDen Місяць тому
For deep models, initializing weights sampled from a relatively small variance Gaussian distribution has been shown to give faster convergence. Andrew Karpathy doesn’t touch on it in his making GPT video, but if you go to the GitHub code you can see the change. Also, adding a weight-size penalty to the loss can encourage the model to come up with more general parameters, but this can be very delayed (grokking). I have seen several gradient and double descent methods that basically “pick up the signal” early, though. Remember for nontrivial tasks and good architecture this is more of icing on cake tho.