#037
Вставка
- Опубліковано 2 чер 2024
- Connor Tann is a physicist and senior data scientist working for a multinational energy company where he co-founded and leads a data science team. He holds a first-class degree in experimental and theoretical physics from Cambridge university. With a master's in particle astrophysics. He specializes in the application of machine learning models and Bayesian methods. Today we explore the history, practical utility, and unique capabilities of Bayesian methods. We also discuss the computational difficulties inherent in Bayesian methods along with modern methods for approximate solutions such as Markov Chain Monte Carlo. Finally, we discuss how Bayesian optimization in the context of automl may one day put Data Scientists like Connor out of work.
Panel: Dr. Keith Duggar, Alex Stenlake, Dr. Tim Scarfe
00:00:00 Duggar's philosophical ramblings on Bayesianism
00:05:10 Introduction
00:07:30 small datasets and prior scientific knowledge
00:10:37 Bayesian methods are probability theory
00:14:00 Bayesian methods demand hard computations
00:15:46 uncertainty can matter more than estimators
00:19:29 updating or combining knowledge is a key feature
00:25:39 Frequency or Reasonable Expectation as the Primary Concept
00:30:02 Gambling and coin flips
00:37:32 Rev. Thomas Bayes's pool table
00:40:37 ignorance priors are beautiful yet hard
00:43:49 connections between common distributions
00:49:13 A curious Universe, Benford's Law
00:55:17 choosing priors, a tale of two factories
01:02:19 integration, the computational Achilles heel
01:35:25 Bayesian social context in the ML community
01:10:24 frequentist methods as a first approximation
01:13:13 driven to Bayesian methods by small sample size
01:18:46 Bayesian optimization with automl, a job killer?
01:25:28 different approaches to hyper-parameter optimization
01:30:18 advice for aspiring Bayesians
01:33:59 who would Connor interview next?
Connor Tann: / connor-tann-a92906a1
/ connossor
Pod version: anchor.fm/machinelearningstre...
this podcast concept is so nice! Damn.
ASAP Ferg for a ML podcast intro. Genius move
Ride with the mob, Alhamdulillah
I finished the episode. Thought you focused a bit too much on the philosophical advantages of Bayes. The more practical subjects you touched were discussed in a way that is a bit too disconnected from the practice nowadays (such as conjugacy, flat priors or certain versions of MCMC). And a lot of today's practices are not fully compatible with the supposed philosophical advantages (the idea that you should check your model and iterate on it goes against the pristine idea that "prior + likelihood = all that you can know about a data set").
Bring on Andrew Gelman (you talked briefly about him). He is very interesting to talk to. Very philosophical and practical at the same time.
Hiphop intro song for applied bayes discussion episode, first thing, "let's talk about socrates and eikos" LOVE IT
This is what I need about bayesian especially for my master. Thank you ❣️
Regarding hyperparameter optimization, there is a BOHB algo that really nicely combines bayesian optimization with succesive halving from bandit methods. Great episode.
Wow this episode is so rich. Thats such an useful content. Im not a Data Scientist/ML Engineer, but even for me its helps a lot.
Loving the host-specific intro edit. Looking forward to the episode as always.
Cheers Daven!! Duggar is a pro!
@@MachineLearningStreetTalk Umm ... I'm working on it ;-) and will improve. Keeping up with Tim is a seriously tall order. As for DavenH, thank you very much!
Badass intro music!
Seriously so much to think about with this one goddamn what a good podcast
Very interesting! Great work! 😍
Second channel that I clicked the bell Icon on :)
Thank you for your support, Teymur! That certainly puts us in a tiny elite group and one step closer to the sub big leagues ;-)
damn this is amazing, enjoyed the ride.
Damn! This ❤
In the chapter on Gambling and coin flips, I had the same problem some time ago and I needed a tool to determine two phenomena are from the same distribution and the only parameter was the probability of heads. So I had to derive what you mentioned for calculating the true probability of heads with the complex integration and the confidence measures. and I always wondered why this is never discussed. Is there any book that considers this kind of analysis?
After Zeroth!! So technically second 😛😛🤘
Is it possible that Bayesian methods could be applied for NLU problems?
As NLU problems are not truly solvable by a statistical method like deep NNs
Another great street talk, and as evidence I have a half page of new notes on the Bayes page in my ML wiki. Thanks so much.
So not coming from a math or stats background, I didn't follow where these infinite, unsolvable integrals come from. "To conduct marginalization" -- correct me if I'm mistaken but I believe this means the summing over certain dimensions of a multivariate probability density function, arriving at a marginal distribution, so as to make predictions when you ONLY have the marginal variables. With discrete variables this summation seems almost trivial. So is it only a computational problem when performing this operation analytically, trying to find a closed-form solution?
DavenH that's great that you were able to pull more notes for your wiki! Regarding unsolvable integrals, if one of us literally said "infinite" we were probably exaggerating with poetic license for effect. There probably are scenarios where one might want to integrate over infinite dimensionality, but we can understand the difficulty without going to that extreme ;-)
Where do these integrals arise? You are correct that they arise from marginalization to eliminate nuisance parameters, to compare models by integrating out all free parameters, to normalize distributions, to reduce dimensionality, to generate statistical summaries, and other purposes.
Why are they hard? Well first off, naive integration say by grid sampling, grows in complexity exponentially as K^N where K is the number of samples per dimension and N in the number of dimensions. The is simply a consequence of the search volume growing exponentially. With complex functions K is often quite large in practice, hundreds or even thousands of samples, which makes for a nasty exponential growth, and definitely not a schoolbook 2^N growth which is gentle by contrast.
Of course we try to be much smarter that simple grid sampling and deploy a variety of tricks to reduce that required sample count in aggregate. Unfortunately, those tricks can rapidly breakdown in higher dimensions. To get an intuition for why tricks breakdown, think of sampling f(x) as gaining information about function f in a small neighborhood around point x. A good concept of a small neighborhood around x would be the volume within a certain distance, r, from x. In other words, we learn information about f in a ball of radius r centered at x. Now the domain of x that spans that ball is obviously a hypercube with sides of length 2r.
So what portion of that domain is covered by the ball? Let's calculate. For one dimension a "ball" is just a line segment of radius r of "volume" (length) 2r and likewise the volume of a 1-d hypercube is also 2r. So a single sample covers 100% of the local domain around x. In two dimensions the volume of the ball (a disk) is pi*r^2 while the volume of the hypercube is 4 r^2 so we cover 3.1459/4 or 79% of the local domain. In 3-d it's (4/3)pi*r^3 / 8r^3 or 52%. You can guess where this is going; the higher the dimension gets a single sample tells us less and less even compared to a small local hypercube around the sample point! In fact given that the volume of an n-ball of radius r is pi^(d/2)/Gamma[n/2+1]*r^n and the volume of an n-cube of side 2r is (2r)^n, the ratio of n-ball/n-cube approaches zero *very* fast. At just 10 dimensions each point is covering less than 1% of the volume.
To recap, as the dimensionality increases our search volume grows exponentially and each sample tells us super-exponentially less. Ouch! In that sense one can roughly say that integration becomes "infinitely" hard since we are rapidly approaching a divide by zero number of samples ;-)
Cheers!
@@nomenec Keith, thanks a million for taking the time to spell this out so unambiguously. I get it, the curse of dimensionality subsumes the problem rapidly. I had no idea that such Bayesian methods were used on 100s of dimension problems; but that's a sampling bias on my end. Textbooks and blogs only ever seem to use at most 3 easily visualized dimensions. Cheers!
Integrations? :D Are you sure the exactness of the integral is necessitated by the uncertainty?
Zeroth
no mention of kolmogorov in the intro lul
Kolmogorov is for frequentists ;-). Bayesians rely on a more general concept of *conditional* probability first axiomatized by Richard Cox (1,2,3,4). Cox's axioms allow one to prove conditional probability theory as a theoretical result. In a Kolmogorov framework one must instead assume/introduce conditional probability as another postulate/definition (5). Here are Cox's original postulates for those interested (though the paper is a great read even for just the first section):
Let b|a denote some measure of the reasonable credibility of the proposition b when the proposition a is known to be true, . denote proposition conjunction, and ~ denote proposition negation.
I) c.b|a = F(c|b.a,b|a) where F is some function of two variables
II) ~b|a = S(b|a) where S is some function of a single variable
From those he derives a generalized family of rules of conditional probability where if we assume, by convention, that certainty is represented by the real number 1 and impossibility by 0 (and maybe another convention as well such as the lowest order polynomial function) then we get the conventional laws of conditional probability.
(1) jimbeck.caltech.edu/summerlectures/references/ProbabilityFrequencyReasonableExpectation.pdf
(2) stats.stackexchange.com/questions/126056/do-bayesians-accept-kolmogorovs-axioms
(3) en.wikipedia.org/wiki/Cox%27s_theorem
(4) en.wikipedia.org/wiki/Probability_axioms
(5) en.wikipedia.org/wiki/Conditional_probability