This is awesome! I was not able to hold on to my papers. Its interesting to see why nobody thought of accuracy as a function of both skill and difficulty before.
So, to sum it up: "Better models will struggle less on harder test sets." I'd call this statement "the difficulty bias". I think this work does not prove that overfitting never occurs on imagenet. But it does show that the difficulty bias is a stronger effect than overfitting bias. So if overfitting to the imagenet test set does occur, it's probably not a particularly strong effect.
I agree this work doesn't *prove* overfitting doesn't happen, this work + a few other related works imply adaptive overfitting isn't a *huge* issue in ML. 1. papers.nips.cc/paper/9117-a-meta-analysis-of-overfitting-in-machine-learning 2. papers.nips.cc/paper/9190-model-similarity-mitigates-test-set-overuse
Strange plots in Fig. 1 @ 5:00 : Why did they not use the same axis-scaling for new and original accuracy? The XY-ranges are so similar that a non-skewed projection would have been no problem at all.
Yes, but that is obvious. Split the datapoints in the top-1 error into two sums (1 per dataset) and you see that you are just averaging the two error rates!
Hi, loved this content. But the base architecture at least half of it resembles tacotron2. could you pls make a detailed video on tacotron2 architecture. Thanks in advance.
Is there anything stopping cheating researchers from training on the test set (original) itself, to get more klout for models that perform well? I mean even a new test set like this would not reveal such cheating, if the underlying model is at least descent, because the cheater basically had a bigger dataset to work with, which should lead to better generalization to the V2 testset.
They should've "calibrated it"(by throwing away images) on some older models to make sure scores match FIRST, and only THEN do their comparison! Not an expert but can't see why this wasn't done.
This is awesome! I was not able to hold on to my papers.
Its interesting to see why nobody thought of accuracy as a function of both skill and difficulty before.
So, to sum it up: "Better models will struggle less on harder test sets."
I'd call this statement "the difficulty bias". I think this work does not prove that overfitting never occurs on imagenet. But it does show that the difficulty bias is a stronger effect than overfitting bias. So if overfitting to the imagenet test set does occur, it's probably not a particularly strong effect.
I agree this work doesn't *prove* overfitting doesn't happen, this work + a few other related works imply adaptive overfitting isn't a *huge* issue in ML.
1. papers.nips.cc/paper/9117-a-meta-analysis-of-overfitting-in-machine-learning
2. papers.nips.cc/paper/9190-model-similarity-mitigates-test-set-overuse
True, I just find it's generally not what anyone would have expected.
Strange plots in Fig. 1 @ 5:00 : Why did they not use the same axis-scaling for new and original accuracy? The XY-ranges are so similar that a non-skewed projection would have been no problem at all.
This was simply done for aesthetic reasons. Using the same axis scaling produces a lot of white space.
@@Vaishaal Surely, can't have that in a 72-page document!
Wow, thanks for putting it so succintly, saved me so much time
Shouldn't test set v1 and v2 indistinguishable?
I wonder if a third set produced by 50/50 randomly selecting instances from each set would fall half-way between the 2 linear relations.
Yes, but that is obvious. Split the datapoints in the top-1 error into two sums (1 per dataset) and you see that you are just averaging the two error rates!
Yes this is exactly what would happen.
So can one say that transfer learning is here to stay or overfitting to ImageNet dataset still a possibility?
Probably we're still not overfitting
Hi, loved this content. But the base architecture at least half of it resembles tacotron2. could you pls make a detailed video on tacotron2 architecture. Thanks in advance.
The super-holdout idea seems like a good idea, if it isn’t too costly. I hope people start to do that.
great summary. wouldn't an easy and revealing experiment here be training a binary classifier to discriminate between the old and new test set?
They are doing this in the paper appendix, they reach about 53% accuracy or so.
@@YannicKilcher That's just guessing, at that point.
Is there anything stopping cheating researchers from training on the test set (original) itself, to get more klout for models that perform well? I mean even a new test set like this would not reveal such cheating, if the underlying model is at least descent, because the cheater basically had a bigger dataset to work with, which should lead to better generalization to the V2 testset.
Awesome.. Thank you!!!
They should've "calibrated it"(by throwing away images) on some older models to make sure scores match FIRST, and only THEN do their comparison! Not an expert but can't see why this wasn't done.