There’re other motivations for squared-error loss. For one, it arises from the mle estimator when using a model that assumes independent Gaussian errors. For another, squared-error loss has an exact algebraic solution.
That is not true. The exact closed form solution depends on the hypthesis/ prediction function, not the cost function. I.e. MSE for linear reg has a closed form solution whereas MSE for logistic regression is a non convex function
*Squared* loss is used because quadratic functions are differentiable and absolute value functions are not, not because it penalizes outliers more. Penalizing outliers is usually considered a downside to OLS, not a good thing or a “reasoning”
Agreed. MAP penalizes outliers much more than MSE. The latter is much more sensitive to outliers. I believe the nuance here is that MSE penalizes large errors (which is very different from saying it penalizes outliers).
@@edwardyudolevich3352 the parameters of linear regression can be estimated in several ways: one of them is via gradient descent. Gradient descent is nice because it is very generalizable and can be used to estimate the parameters of many other ML algorithms.
@edwardyudolevich3352 Whether penalizing outliers is good or bad is dependent on the application. It's not true that it's always good or always bad. It is just one feature of MSE that differentiates it from MAE.
some more detail as to why we square the residuals is that the squaring function is smooth and differentiable. This means is allows us to use optimization methods such as gradient descent to find the best fitting line. Another reason is that it leads to a convex surface, and convex surfaces have a single global minimum, no local minimum, which simplifies things a lot. great video!
At 16:10, it is said that independence of a collection of random variables is equivalent to pairwise independence, but that generally does not imply independence as a whole. Independence implies pairwise independence, but the converse is not true.
wow this is amazing. it doesnt seem like an interview at all more like two collogues having an intellectual conversation on a topic i don't know shit about lol!
A real question: what’s the use even for a ML engineer to know in such mathemtical details how algorithms work? As long as you know the intuition, the assumption, in which cases we can use or not . Because you’ll never have to build these algo from scratch. There’s always be a library around there. This question is not valid if you are of course a researcher working on developping new AI algorithms
An SDE need not really understand OS, DBSM etc. cuz it doesn't help in 99% of the day to day use cases, but in that 1% of cases where shit hits the fan a 10x developer who understands the minute intricacies of all the tech stack used, is absolutely essential to save the day. I think the same principle applies here
Background Clarification: I am an MS CS candidate preparing for such interviews The main reason to know the math is that ML is VERY VERY DIFFERENT from development. I have 2 years of development experience at a reputed bank and I can tell you that the steps in a normal development process change about 10 to 20% in most of the cases. If you know how to create one type of report you know how to create most of the reports. If you know how to solve certain "Patterns" of problems you know how to solve most of the problems belonging to them. I am not only talking about Leetcode but also about many applications. Unless I need special optimizations in performance/cost etc. I can use the same API or library irrespective of the use case. That is the very reason they exist. Coming to ML and Stats. You only know sample data points. You never know if the underlying distribution assumptions were true. Let me give an example, if the residuals in the video example are normal we get MSE loss but if they were assumed to be uniform distributed then we would have to take the smallest parallelogram having all the points (leave gradient descent, this convex hull from DP and CLRS!!!!!). If they were exponentially(earthquakes etc) or binomially distributed (maybe student scores) again a different MLE algorithm would be needed. Different MAP algorithms and regularization too. The fact that outliers screw up linear regression is essentially because they break/weaken the normal error assumption (normal distribution dies down at the ends). Besides this imagine someone told "Ohh! this Data has some time series features like the price of a good/ likes on a video in previous month". Then bang! You broke the iid assumption. Now correct for it. Finally if this wasn't enough, if you have too few data points then you can use the equation form, too many then lbfgs(faster than grad descent) and if you have too many features and Hessians are too large then gradient descent is the only savior. (Oh I forgot, did you know a solution could not even exist! Try linear regression with 2 columns having the same/ very close values. Now remember all this is needed after you have enough features to make sure the relationship of y and x is linear. The main problem is libraries like sklearn don't even address these issues (Statsmodels does for many cases but not all) Even after this you need to test multi-collinearity otherwise you won't know which features are not telling anything extra. Test p values for coefficients and assure they are normally distributed. For many CS students MSE loss and linear regression is like a hammer and we have been seeing everything as if it was a nail!!! Bang! Bang! Bang! Causing Zillow crash and other issues. Did you ever see something like this while making your second website?? At least I never did😢😢
Hey allenliu107! While there might be variations for different companies and seniorities, this is generally accurate for a data science / machine learning technical question interview. For this type of interview, you'll need to know your algorithms and models well because it'll be a deep dive rather than a touch-and-go on the concepts e.g. "How do you select the value of 'k' in the k-means algorithm?"
Don't leave your data science career to chance. Sign up for Exponent's Data Science interview course today: bit.ly/3IO5o2I
There’re other motivations for squared-error loss. For one, it arises from the mle estimator when using a model that assumes independent Gaussian errors. For another, squared-error loss has an exact algebraic solution.
That is not true. The exact closed form solution depends on the hypthesis/ prediction function, not the cost function. I.e. MSE for linear reg has a closed form solution whereas MSE for logistic regression is a non convex function
*Squared* loss is used because quadratic functions are differentiable and absolute value functions are not, not because it penalizes outliers more. Penalizing outliers is usually considered a downside to OLS, not a good thing or a “reasoning”
Also did he just say that linear regression parameters are estimated using gradient decent?
Agreed. MAP penalizes outliers much more than MSE. The latter is much more sensitive to outliers. I believe the nuance here is that MSE penalizes large errors (which is very different from saying it penalizes outliers).
@@edwardyudolevich3352 the parameters of linear regression can be estimated in several ways: one of them is via gradient descent. Gradient descent is nice because it is very generalizable and can be used to estimate the parameters of many other ML algorithms.
@@edwardyudolevich3352 typical silly computer science answer -- they never actually learn math, cs degrees gave up on that 30+ years ago
@edwardyudolevich3352 Whether penalizing outliers is good or bad is dependent on the application. It's not true that it's always good or always bad. It is just one feature of MSE that differentiates it from MAE.
some more detail as to why we square the residuals is that the squaring function is smooth and differentiable. This means is allows us to use optimization methods such as gradient descent to find the best fitting line. Another reason is that it leads to a convex surface, and convex surfaces have a single global minimum, no local minimum, which simplifies things a lot. great video!
At 16:10, it is said that independence of a collection of random variables is equivalent to pairwise independence, but that generally does not imply independence as a whole. Independence implies pairwise independence, but the converse is not true.
wow this is amazing. it doesnt seem like an interview at all more like two collogues having an intellectual conversation on a topic i don't know shit about lol!
I would not denote the Y as label, especially because it is "continuous" (the real Y). Otherwise, it is a bit confusing with Logistic Regression.
Wow awsome we want some more like dis
A real question: what’s the use even for a ML engineer to know in such mathemtical details how algorithms work? As long as you know the intuition, the assumption, in which cases we can use or not . Because you’ll never have to build these algo from scratch. There’s always be a library around there. This question is not valid if you are of course a researcher working on developping new AI algorithms
An SDE need not really understand OS, DBSM etc. cuz it doesn't help in 99% of the day to day use cases, but in that 1% of cases where shit hits the fan a 10x developer who understands the minute intricacies of all the tech stack used, is absolutely essential to save the day. I think the same principle applies here
this is not detail, this is a gross overview
Background Clarification: I am an MS CS candidate preparing for such interviews
The main reason to know the math is that ML is VERY VERY DIFFERENT from development.
I have 2 years of development experience at a reputed bank and I can tell you that the steps in a normal development process change about 10 to 20% in most of the cases. If you know how to create one type of report you know how to create most of the reports. If you know how to solve certain "Patterns" of problems you know how to solve most of the problems belonging to them. I am not only talking about Leetcode but also about many applications. Unless I need special optimizations in performance/cost etc. I can use the same API or library irrespective of the use case. That is the very reason they exist.
Coming to ML and Stats. You only know sample data points. You never know if the underlying distribution assumptions were true. Let me give an example, if the residuals in the video example are normal we get MSE loss but if they were assumed to be uniform distributed then we would have to take the smallest parallelogram having all the points (leave gradient descent, this convex hull from DP and CLRS!!!!!). If they were exponentially(earthquakes etc) or binomially distributed (maybe student scores) again a different MLE algorithm would be needed. Different MAP algorithms and regularization too. The fact that outliers screw up linear regression is essentially because they break/weaken the normal error assumption (normal distribution dies down at the ends).
Besides this imagine someone told "Ohh! this Data has some time series features like the price of a good/ likes on a video in previous month". Then bang! You broke the iid assumption. Now correct for it.
Finally if this wasn't enough, if you have too few data points then you can use the equation form, too many then lbfgs(faster than grad descent) and if you have too many features and Hessians are too large then gradient descent is the only savior. (Oh I forgot, did you know a solution could not even exist! Try linear regression with 2 columns having the same/ very close values.
Now remember all this is needed after you have enough features to make sure the relationship of y and x is linear.
The main problem is libraries like sklearn don't even address these issues (Statsmodels does for many cases but not all)
Even after this you need to test multi-collinearity otherwise you won't know which features are not telling anything extra. Test p values for coefficients and assure they are normally distributed.
For many CS students MSE loss and linear regression is like a hammer and we have been seeing everything as if it was a nail!!! Bang! Bang! Bang! Causing Zillow crash and other issues.
Did you ever see something like this while making your second website?? At least I never did😢😢
This is one undergrad course depth lmao, what are you talking about
Very good~ Thank you!!
Really good job!
Only question is how close is this type of mock interview compared to the real one?
Hey allenliu107! While there might be variations for different companies and seniorities, this is generally accurate for a data science / machine learning technical question interview. For this type of interview, you'll need to know your algorithms and models well because it'll be a deep dive rather than a touch-and-go on the concepts e.g. "How do you select the value of 'k' in the k-means algorithm?"
Andrew NG
Nice!
Fantastic interview, learn so much! More of this please
Glad you enjoyed it!
FWIW the closed form solution is m = (X^TX)X^Ty but you're close.
nice