@@Blutania It's standard for videos to be uploaded to UA-cam some time before they go live to everyone, so the uploader and, not infrequently, also patrons, channel members, or other privileged people who get given a link to the still private video can comment on it before it's published.
Its likely they used that method but a less math way of doing it is permissable. one ur going to be sending spys and spy planes to bases storage yards and depots and since a lot these things are big and in the open and since you can only form the number of tank units u have tanks for u likely have a decent count of the number of units they have at x time. if u know the serial numbering system of the enemy, then the rise in the serial numbers over time from captured equipment will tell u their rates if last month the highest serial numbers in the low 1500s but now they are in the upper 1700 it doesnt take a math phd to figure out ur looking at about 270 tanks also since the serial numbers tell number date and location it tells u something more important the lag in their logistical system. If u know how long it takes for the enemy to make and move stuff u can predict movements and actions to some degree
I've just simulated 10000 of this operation for number of tanks less than 100 and number of guesses between 10 and 50, and using the maximum value as the total of tanks gives 4 on an average error, while using the maximum value + average gap leaves 2.6 as average error. That method is simply 151% more precise!! Amazing!!!
@@vitriolicAmaranth this is exactly where my mind went. I feel like it should almost be equivalent to mean gaps, but I probably just haven't thought hard enough
I know it's irrelevant, but there's the old joke about letting three sheep loose in a field, but first labelling them "1" "2" and "4" so the person rounding them up spends ages looking for the 3rd.
It would also make sense in this case since the Germans wanted to make the appearance that they were building more tanks than they actually were. As such they could have skipped a couple of number in their serial. But I guess it would create to much chaos for the German mind to handle. xD
MI5 officer Peter Wright wrote in his book Spycatcher that MI5 bugged the Soviet embassy in Ottawa. So MI5 market all listening cable with number 1 and up. But in case Soviet would find these cable MI5 omitted some number hoping that Soviet what almost have to tear down the embassy in order to find the missing number. ( But trick did not work since Soviet had some spy within MI5 informing Soviet how many cable and what number they had. So Soviet never searched for the omitted number. )
@@SwedishNeo "New orders from Berlin: We are to skip a few serial numbers when imprinting parts, so our tank production looks bigger than it is..." - "But zat will bring dizorder to mein numbers!"
For several of my clients we incremented the serial number by some prime, rather than one, than in order to obfuscate the output somewhat. It also gave us some degree of parity checking on serial numbers later. Silly, really, but fun.
I would use a hash function. A secret number placed after the normal serial number, and then hash it and then use it as official serial number. Then every unit has its own official serial number, you have the secret and you can look it up, what the real serial number was and nobody is able to guess any different valid number. Even if he knows every number (except the secret) and your algorithm to create them.
@@AndreyCizov You would need to see a few machines bought at the same time. You can not trust, that there will be all numbers used. Often there is a gap when an updated version is used. And when at that time the prime is changed, have fun to reverse engineer the prime... There will be enough people who are able to spot it or even reverse engineer it, but that number shouldn't be that great (depends on the batch size, amount sold to individual customers and prize - more expensive it's more reward to get something free as warranty).
@@IDNeon357 lol what? First of all he said they deciphered the coding they were using but also how would you know this? and why do you say it like its established fact when it clearly isnt?
My first instinct was to use the Central Limit Theorem to assume that the sample mean would approximately equal the population mean. Since we know the distribution is uniform and the population mean of a population of size n is (n+1)/2, twice our sample mean minus one should approximate the population size. Here our sample mean was 17, so this method of estimates the population size as 2(17) - 1 = 33.
1:22 "after the war, the allies can go into those tank making factories" I like how knowing if our math was right is more important than having won the war.
The Cold War immediately took over international politics following WWII. Once the Allies realize the math was correct, it changes how they both conduct espionage and counteract it.
As my late Father would have said, "Not cessinarily!". It is also true that more of the lower numbered tanks would have been destroyed or broken down and replaced and no longer in service.
So the number line does not reflect a set of equally likely observations. Some of the serial numbers that are not yet observed are less likely to be observed than others. I think I am understanding this right, the not yet observed numbers between the maximum and minimum have a higher average probability of being observed than that of the numbers outside the bounds. And if the biases don't cancel each other out, the prediction is skewed. I'm sure this is a well known probability thing I'm just working this out
It’s a fairly obvious observation, I think. The mathematics being shown assumes that all objects appear at once, so no temporal complications. Presumably the mathematicians engaged in this work factor in the production dates where they were known. Another confounding problem is repair or rebuild. For example, Russia is taking old tanks and rebuilding them into modern configurations. So these tanks are not entirely produced from new - but serial numbering is going to be a mix of old and new, dependent upon components. (It’s very complicated in this case, because there are multiple variants and changes between foreign and domestic components. We know how many thermal sights Russia bought from Thales in France, but don’t know how many domestic equivalents are being produced, as an example. So if you get a Thales serial number it’s somewhat useful, but domestic ones require some time to aggregate the data. If you can’t capture enough data then it’s not going to work, but then there are other more obvious reasons for why the information isn’t necessarily helpful in this case.) ‘Spys’ typically rely upon stuff like observing rail shipments. This can be gamed (which Russia has a long history of doing, because they aren’t fools) to feed false information to your opponents, however. Serial numbers are much more solid, provided you can make sense of the systems being used. These are kept very secret, unsurprisingly.
Man, it's 11pm local time, I'm awake since 4am, my week was a rollercoaster, I'm mad about my job, I'm dealing with a woman that is getting in my nerves, my bank account is zeroed, I'm tired and pissed... But for some reason, his enthusiasm telling this story made me happy instantaneously. Thank you for this, God bless you and your beloved ones. Got a subscription.
Bro, in the future you will stumble upon your comment and you'll remember where you are at now in your life. You've made it this far, you'll keep going
A company I worked for made computers & peripherals and used 64 bit random serial numbers. They had multiple manufacturing sites, and calculated that the odds of selecting two identical numbers was smaller than human bookkeeping and errors trying to coordinate multiple product lines.
So, like UA-cam assigning video IDs, they decided that it was faster and more accurate to just check for duplicates, because the probability of the same number being assigned twice in the time it takes to check if it has already been used is extremely small.
@@ragnkja Even checking for duplicates would be unnecessary if cryptographic hashsums are used. The odds of getting randomly occurring collisions with them are so low that on average it would take much longer than the lifetime of the universe.
@@SaHaRaSquadYah, treat a huge range of numbers as a domain, split it into segments and assign a segment to each factory, ie. 64 bit number where the top 3 or 4 bits are specific to each factory, increment the value at each factory independently of eachother per product, assign a hash of that value as the product serial number 👍
Yep, if you use 64 bit numbers the probability of a single collision in the numbers starts to raise only after about 4 billion devices are manufactured. And even then: so what? Almost all numbers are unique.
My mind went to the same place. As the sample size increases, the average should approach the median number. I wonder if the methods in the video offer a meaningful improvement over simply doubling the observed average.
I did the same thing except I subtracted 1 to estimate 33. My reasoning is that if we have N tanks, all with equal probabilities, the expected average of the distribution is 1/N * (sum of 1 to N) = (N+1)/2 . Estimating this with the sample average μ, you get N = 2μ-1, which is why I subtracted the one.
@@williamnathanael412 If you'd typed that into google instead of the UA-cam comments, you'd have an answer immediately. But now, you have a sarcastic response 7 minutes later instead.
@@Nick-the-foxDude what is an "anti-gamer channel"? Is it just one that reports on game devs being overworked at fromsoft or that was anti-gamergate ten years ago?
There's a thing called "fixed-format cryptography" which can be used to make sequential numbers look random. The nice thing about it is that the encrypted number is in the same domain as the plain number (i.e. the original numbers range from 0 to say, 1 million, the encrypted numbers will also be in that range), so the attacker doesn't know they are encrypted and thinks it's just a plain sequential number. I've used that to protect against brute-forcing IDs on a system, while keeping the IDs short enough to be encoded as a barcode
And if you know where they start. For example, if the serial number was a date, this just wouldn't work (even though the numbers are sequential, they are not consecutive)
Yeah, real world probably has a lot of further problems. Like what if one month all new tanks go to front X, one month they all go to front Y, and your information and rate of capture/observation is different, for example ... ? But then, you might also have some rough indications from observation planes or train schedules or something that might help correlate some gaps in your data. Of course, there might also be decoys and whatnot ... well, I'm sure a lot can be done there.
I heard this story ages ago, but never understood how it worked. That "flipping the number line around" line makes so much sense; so simple once the trick's revealed. Lovely!
In arms production it's fairly common for factories to assign serial number ranges to particular products in advance, so the serial number ranges having gaps within them is relatively normal. It's also normal for them to start production at something like 10,000 if they expect to make in the tens of thousands of that particular item, that way they all the items are serialized, but they also maintain the same number of digits in their serial number for uniformity without using a bunch of leading zeros. Overrunning that serial range usually results in a letter prefix or suffix being added.
By what metric is that better than using leading zeros? Or, why the aversion to leading zeros? (Also, why not just use GUIDs? Fixed size, convey identity but no other information, never going to run out.)
@@halfsourlizard9319As for the GUIDs, because the use of serial numbers for arms predates the invention of GUIDs by 100+ years. So, in other words, tradition - why change when you already have a perfectly workable scheme.
@@halfsourlizard9319 When creating records people will often omit leading zeros when recording numbers possibly out of laziness, possibly by convention. Forcing the leading digit to be a non-zero digit prevents this deletion from happening, Why care about leading zeros? The zeros still have meaning. For instance the number of digits present can be helpful in indicating that a number in a record is a serial number specifically. Further whenever number codes get concatenated it's important to not omit digits or this will change the shape of the number code, i.e. if the serial number were a concatenation of year-month-number. Granted concatenated codes should be dash separated or similar, But if we can't trust the clerk to put the leading zeros on the number, why would I trust the clerk to bother writing dashes between numbers.
I used this method with serial numbers of accordions made in the late 30s by Hohner, a German company. Now I have a spreadsheet named "The German Accordion Problem" with more than 150 rows.
@@dragoncurveenthusiast Unfortunately they didn't mark the month in the serial number, but fortunately they didn't restart every month either. That means I could estimate the total number of accordions with serial numbers between 1934 and 1940 to around 860000.
@@JtotheAKOB I'm relatively certain. Out of the 150, I have ~20 serial numbers for which I also know the actual production date. If you plot the numbers vs the dates, you get a lovely almost linear (R^2=0.995) graph. The only way I can think of to get this relationship while preventing accurate estimates, would be to randomly skip numbers with a constant probability.
My immediate thought was that the average of a random subset should be the same as the average of the whole, so the number of tanks should be twice the mean of the picked ones 2*(1+15+16+23+30)/5=34 for the first pick and 2*(3+10+15+18+24)/5=28 for the second. My guess is that they used multiple estimate methods and weighted the results depending on inherent uncertainties/errors of the methods
@@Mayur7Garg The average is roughly half of the total since you have both low and high numbers. Average tries to arrive at the middle point of the number set when all the numbers are unique and in series. (1+15+16+23+30)/5=17, which we know is too little since we have the number 30 in the series.
@@yurie2388 Basically it stems from the fact that the median and the mean would be identical for such a series. So if you know the mean, then you can use it like a median to assume that the final number is at twice the distance. But in that case, using the median in the first step directly is more appropriate. Also, one issue that I have with all these solutions including the one in the video is that they do not seem to work if the serial numbers do not start from 1 but from let us say 100.
Cool. When I was studying population biology we were given a task to work out the number of taxis in a city and we used the capture, mark, recapture method, using the taxi number, rather than marking anything. So, just noting the numbers in a given time (capture and 'mark') and then noting the numbers in a given period, which was later (recapture v not seen before). There are all sorts of sample to population complexities and improvements to the estimate with longer observations (but issues with recounts if the obs period is too long). Also, an improvement if a third count period is used. I wonder if there are any seminal capture, mark recapture examples that Numberphile might comment on and re-create on brown paper?
11:57 - the other problem would be the oldest tanks (i.e. built pre-1939) were either destroyed, removed from service or rebuilt into something else (like AA or AT platform) by the end of war
I love the "German Tank Problem." There's a great video on UA-cam showing this method of counting the Commodore 1571 Disk Drives. Using this technique for "other real-world problems" is a fun exercise.
It's the same question we posed as kids: "How do you count a herd of sheeps?" - "You count the legs and divide the number by 4." At the time we found that funny.
(I apologize if this has been asked already!) Is there any benefit or issue (other than reducing observation size) in setting aside the highest and lowest observations to prevent the concern that you selected the highest and lowest, and how that affects the average gap? If you set aside the highest (30) and lowest (1) is your first go around, you would just have 15, 16, 23. Gaps are 14, 0, and 6. Average gap = 6-2/3. If you rounded to whole number, you'd get an estimated amount of 30. In the second example, you had 3, 10, 15, 18, and 24. Eliminating the highest and lowest, remains 10, 15, and 18. Avg gap = 5. Estimated max of 23 (18+5). Just having 3 used observations (4 gaps) creates for instability due to low sample, but seems to alleviate concerns that you don't know if you observed the max without knowing.
Someone read Cornelius Ryan's The Last Battle and his interview with Chuikov. I just noticed someone mentioned Downfall too. The book is the source material.
@@robinsparrow1618this is a common phenomenon I've seen over the years. Someone will watch the video, after watching a funny part, they click pause, then copy the timestamp, forgetting that this time stamp is after the clip
I think there may be a simpler and more accurate way to do the estimation. My first thought gave estimates of 34 and 28 for the two trials, beating Brady's estimates of 35 and 27.8 both times compared to the actual number 30. Assuming "everything is equal and random" (i.e., a uniform distribution), just take the average of the tank numbers and double it. This also balances all the potential gaps.
That is certainly a valid method as well, also unbiased (meaning on average you will be spot on). However, the "maximum plus average gap" method is more efficient, i.e., it has a lower mean squared error: the squared difference to the actual N will on average be smaller than using your method. And that is what you want from an estimator!
Hey everyone, I just started learning how to use octave and just for kicks I made a program to do this very estimation. (Thanks for sharing Numberphile, this is really neat stuff!) Here's the *script* if anyone wants to fiddle around with it: (I've added a percent error so you can see just how remarkably accurate this estimation is!) actualNumberOfTanks = ceil(rand*1000); disp(["actual number of tanks: ", num2str(actualNumberOfTanks)]); totalPoolOfTanks = [1:1:actualNumberOfTanks]; numberOfPicks = ceil(rand*100); disp(["number of picks: ", num2str(numberOfPicks)]); tankNumberPicks = [1:1:numberOfPicks]; for pick = [1:1:numberOfPicks] tankNumberPicks(pick) = ceil(rand*actualNumberOfTanks); end disp("tanks randomly selected: "); disp(tankNumberPicks); estimatedNumberOfTanks = max(tankNumberPicks) + ((max(tankNumberPicks) - numberOfPicks) ./ numberOfPicks); disp(["Estimated number of tanks: ", num2str(estimatedNumberOfTanks)]); percentError = round(((estimatedNumberOfTanks - actualNumberOfTanks)/actualNumberOfTanks)*100); disp(["percent error: ", num2str(percentError)]); %;D
Wondering about the statement about N=max yielding the highest probability for a given distribution. Wouldn't you have to go for the *expected value* of N? Assuming an equal distribution for N>=max, you'd end up with N_expexted = 30.6 and 38.6 for the seen examples. In the limiting case for high N, N_expected simply seems N/(pulls-2) larger than max.
I keep coming back to the Brady's question at 11:41 -- if in my distribution, lower numbers are more likely, would there be an easy correction for that?
You would have to come up with a formula for how much more likely the lower numbers are and calculate some kind of upward bias out of that. I don't think the answer could be considered "easy", no.
For the purposes of war estimations, I suspect you'd find the opposite true, particularly as time goes on - the data would become skewed towards newer serials for everything, as older models were destroyed or made inoperable. Probably best to hash military serial numbers at manufacture time though, regardless.
When calculating the average gap, you should just count the difference between tanks (e.g., between 15 and 1, count that as 14). That way, what you get is the actual (assimptotically) un-biased estimator of the number of tanks. If we do it your way, we get: MAX + (MAX - k)/k = MAX * (1+1/k) - 1, which when we let k->infinity, MAX->Acual_value and therefore we get the Actual_value - 1. It can be also proven that the estimator is biased for any k, outputting smaller values than the real one. If we don't add that -k to the formula (that is, we count the gaps as just the difference), the estimator we get is MAX + MAX/k = MAX * (1+1/k), which is the actual un-biased estimator we should use in this case. One may call this the adjusted Maximum Likelihood Estimator (MLE). As you said in the video, the MLE is just the MAX (the number most likely to be right), but it is biased. What we did with this trick was, as you explained, correct it. A more standardized way to compute this correction would have been to calculate the Expected Value of the MLE we got, to then apply the necessary multiplicative correction. That is, if it is necessary at all (MLE might as well be unbiased itself). This is one of the most used methods for estimating stuff out in the real world (when we are able to get an MLE).
I agree, the definition of the distance looked a bit fishy to me. But as k infinity when k -> infinity. So I think, you might need a more sophisticated reason to make your point.
Have you ever done a video on "Hyper Log Log"? We use it in massive data systems for efficiently estimating the number of unique values. It is very interesting, and freakily accurate.
Somewhere deep within my brain I'm pleased with this video because Dr Grime always reminds me of the young folk who went to ww2 saying they were adults when they weren't and this video is about tanks
it would also make sense to calculate the average of the samples and multiply it by 2 as the average of consecutive numbers starting at 1 would be about n/2 and the average of the samples would also approach the same value.
Close. The average of the observations is an estimator of the mean of the serial numbers in the bag. You got that much right. But the average serial number is (n+1)/2. So you have to double it and then subtract one.
The reason for those extra steps is that you usually have padding around the serial numbers, like just start counting at 1500 because the 15 means something else, and the last two digits are sequential. Which they did touch on, but not a lot.
This is the method I thought of too, but it turns out that the numbers you find other than the max aren't relevant. However many tanks there are, finding 1, 2, 3, 4, and 30 is the same as finding 26, 27, 28, 29, and 30 (as long as the serial numbers start at 1).
@@halbronk7133 "the numbers you find other than the max aren't relevant": This isn't precisely true. They are relevant in the sense that they produce a valid estimate for the maximum. The problem is that it ignores relevant information that we know about the problem (that the numbers are sequential without gaps). And usually when you don't use some piece of information to derive your answer then it is possible to do better.
This episode was great. If you come across more war history examples, please post them. My son loves war history and was fascinated by this. This helps to understand why math is important.
This is actually true (at least according to Richard Marcinko's autobiography). Now presently there are well over six SEAL Teams (8?) but when Marcinko created a specialst SEAL unit in 1980 here were only two other ones, and "Seal Team 6" was a deliberate attempt at deceiving the Soviets.
I always find it fascinating how these equations can be derived after rigorous application of a simple general concept, like at the beginning of the video you can feel that the frequency of smaller numbers (hence more smaller gaps) would affect the estimate but the quantifying part takes time to visualize in its precise form
For electronics with network cards, companies are assigned ranges of MAC addresses as they are supposed to be universally unique. The range could allow one to estimate the number of devices they sell.
The operative words here are "supposed to be". And nobody says they have to be assigned sequentially. Each organizationally unique identifier (OUI) can create 16 million unique MAC addresses. And you can have more than one OUI.
I got a solution to this problem in a maths challenge about eight years ago. My approach used conditional probabilities and the expected number of tanks in the bag would be: N = MAX(k -1)/(k-2) For five observations (k = 5), this gives N = MAX(4/3). This is higher than the average gap approach, which gives N = MAX(6/5) - 1
I was thinking about taking the average and doubling it. The idea being that the average would be approximately in the middle of the true number, so double the average would be close to the true number.
I think you can improve this estimate by subtracting 1 at the end, since the average of the numbers 1 up to and including N is (N+1)/2 rather than N/2. Denoting the sample average by X, your idea is that X should be approximately equal to (N+1)/2, which would imply that N is approximately equal to 2X-1. I'm actually curious to see how this performs (in general) compared to the method presented in the video.
@@akshaj7011 That's true, but the average cap wouldn't work either if they take account to the cap from 0 to the first element. If the starting point would be unknown, i would probably use standard deviation in the same manner.
Brady's final questions show amazing insight. My favorite anecdote involves SEAL Team Six. There was not a 5, they just used the number to make people think there were more teams. I don't know if this story is true, but I like it and it shows that you have to know the parameters of the numbers instead of assuming a sequence starting with 1.
This feels like one of those widely usable maths that I won't be able to find an application for anytime soon... then when the time comes, I'll remember there's a solution but not what it is 😅 Bookmarking it now for that future occasion, haha
OK, what about this: As the sample size increases, the average of the sample will approach the average of the population, so let's estimate the average like that. For a uniform distribution starting at zero the maximum is simply two times the average, but in this example the minimum is one, so we'll just subtract one from our average. Using this method I get 32 and 28 tanks, respectively.
Although these specific estimates has less error than the ones presented by James, on average his method will be better, at least for larger samples. I did some simulations, and for small samples, say three, it's pretty close, but James' method has a lot more bias.
@aksela6912 Funny thing is, no matter the prior you use, the posterior probability of N (the total number of tank) is just the prior truncated starting from M (the maximum of the observed serial numbers). In other words, a Bayesian answer, no matter the prior, should only depend on M (not even on the number of samples).
@@cryme5 For a uniform distribution the variance of the sample median will be greater than the variance of the sample mean, and as mean and median should be the same it will be better to use the one with less variance. I have to reiterate though, sample mean times two is a poor estimator, even if it feels more intuitive, and it feels like you're utilising the collected data better.
@@aksela6912 James's method is unbiased. If you observe n tanks and the maximum value you observe is m, then the minimum variance unbiased estimator is m + m/n - 1. Your estimator of twice the sample mean minus one is also unbiased, but its variance is higher. And it doesn't use the important information of the sample maximum, which means the estimate might actually give a value we _know_ is too small.
Would make a nice graph plotting your best guess of total tanks, pulling one tank at a time. Any time you get a new biggest number the plot would jump up, and when you get smaller numbers it will slowly decend as your average gap gets smaller. It would jerk up and down, approaching the actual total number.
agreed that would be nice to look at, though you'd want a larger set than 30. should start out a as a line jumping up and down but rapidly smoothing out. After it calmed down a bit you could probably do a bit of "eyeball extrapolation" to get a more accurate estimate than the last prediction.
I appreciate the 500% more description. A lot of people muddle that up and would say 600% more. They ship past the fact that 6 times as many is 600% OF or 500% MORE. I think my estimation was a little different technique. If we take the arithmetic mean of all the tanks we come up with a number that is half the total. So by taking the mean of the numbers on tracks pulled out of the bag, we can double it.
That was brilliant! your initial picks are exactly why it was so hard for me to grasp probability at school until I realized it is about multiple events and doesn’t work that great for a single event
At first, I was perplexed about the method of estimating monthly production with just serial numbers, but I am glad they explained they had a way to decode the month and factory of the tank as well. I assumed some of these numbers must have been intentionally hidden or misleading.
no, they were just contracted to different manufacturers and sub-models (Ausführung) and we're assigned specific number ranges the gearboxes, or rather specific the engines with the geartrain attached were often shared between different models, like the Panzer V Panther and Panzer VI Tiger shared the same engine platform, and only was different in minor details and power in the later stages of the war it was not uncommon to use what was in stock or repair tanks with parts from different models
i have other method of solving we will average the numbers of 5 random tanks we picked then the average will be close to the combined average of total numbers of tanks so avg of 5 = 17 = n(n+1)/2n {avg of all numbers on tanks , n=total number of tanks } we get n=33 =total number of tanks
If you take the simpler formula of twice the average value of the tanks, it actually gives better prediction in this case (34 and 28, if I can still perform additions)
This video is brillant! I knew about the story and always thought there is some really complicated math behind the scientists work. Nicely explained, thanks! :)
The Engine machines were for coded communications, I think. I think he meant that the serial numbers were coded, which isn't uncommon for different companies and favorites to have different ways of doing things
We were not given a key piece of info at the beginning - what year did this occur in? German AFV (including tanks, but also tank destroyer, etc) production varied widely across WW2. For example, 3600 in 1941 and 19000 in 1944.
Agreed. If you observe 1000 tanks, is it better (on average of course) to treat that as one big observation or to split it into 10 smaller observations ? I feel like that would be useful info to have.
Yes, probably. The probability to encounter uncommonly extreme gaps becomes smaller and smaller, the more samples you take. When you make as much observations as there are tanks, all leeway in gaps has vanished and your accuracy has reached 100%.
why just not do avarage of the numbers on tanks? i mean (sum of numbers divided by number of tanks times two) ((1+15+16+23+30)/5)*2=34 which is closer to 30 then his calculated 35, second try is ((3+10+15+18+24)/5)*2=28 which is again a bit closer to actual 30 then his 27,8. Am i just being lucky or is it better way to do it?
When i was doing stats at university the lecturer had us fill in a questionnaire on day one to give us some nice data to do analysis on (birthdays and such). It was all nice data except that there wasn't a single left hander in the class. Not one. There ought to have been about ten but there was zero. Credit to the lecturer, he rolled with it. His attitude was, "these things happen - we don't fudge our data". It was actually a great class.
Would be cool to see the mathematical derivation of calculating the expected value of the tanks using an infinite sum of the probability at the beginning
When I saw 23, I noticed that it was one of The Numbers. Then I saw 15, which also was one. Then, 16 followed, and it's also one of them. So, it started with half of The Numbers, although in a scrambled order
Before watching the video i had another method which i think works pretty well too. Because youre pulling random samples, the samples can be assumed to be distributed somewhat randomly along the whole range. So if you take the average of the numbers you have found, that can be assumed to be approximately the middle of the range. Multiply that by 2 and you should get close to the max of tbe range.
This raises 2 questions: 1) how are you able to leave an unresolved Rubik's cube loose on your shelf; 2) why are your tanks pointing their turrets backwards?
Actually... The cube has a different purpose. If you watch carefully. the cube has changed configuration 4 times through the video. Most of the time by 1 or 2 moves. I think it's encoded messages to pass military information (about British tanks, obviously) to the enemies without looking suspicious.
When tanks are being transported, whether by rail or on ships, they are almost always oriented with the turrets reversed to make more of them fit in a certain space. Not having the cannon barrel pointing forward usually means more of them can be crammed into a space.
I think the “failed” demo was perfect since you had to explain not only how it works, but also where the formula fails. Reminded me of school. The teacher would teach the easiest way to understand something, but then on a test it would be the hardest example/use of that formula. School failed, numberphile succeeded.
I came up with another way to estimate the number of tanks. Not sure which method is superior. First, I determined the general formula for the average from a set of numbers counting sequentially from 1 to N. I rearranged it to solve for N. Then you take your sample, determine the average and estimate N. Using the same samples as in the video I got N=33 and N=27. The average answer was 30!! I derived this as follows: N1=1 N2=1.5 N3=2 N4=2.5 N(x)= x-((x-1)*0.5)) = x - (0.5x - 0.5) = 0.5x + 0.5 Avg = (0.5x + 0.5)/x = 0.5 + 1/2x X = 2(Avg - 0.5) =2Avg - 1 (23, 15, 16, 1, 30) Avg = 17 X = 33 3, 10, 15, 18 24 Avg = 14 X=27
See brilliant.org/numberphile for Brilliant and 20% off their premium service & 30-day trial (episode sponsor)
The video: 38 minutes ago
The comment: 1 day ago
*_time travel confirmed?_*
@@Blutania It's standard for videos to be uploaded to UA-cam some time before they go live to everyone, so the uploader and, not infrequently, also patrons, channel members, or other privileged people who get given a link to the still private video can comment on it before it's published.
@@Blutaniawas private yesterday
Send this video to Ukraine 🇺🇦
Its likely they used that method but a less math way of doing it is permissable. one ur going to be sending spys and spy planes to bases storage yards and depots and since a lot these things are big and in the open and since you can only form the number of tank units u have tanks for u likely have a decent count of the number of units they have at x time. if u know the serial numbering system of the enemy, then the rise in the serial numbers over time from captured equipment will tell u their rates if last month the highest serial numbers in the low 1500s but now they are in the upper 1700 it doesnt take a math phd to figure out ur looking at about 270 tanks also since the serial numbers tell number date and location it tells u something more important the lag in their logistical system. If u know how long it takes for the enemy to make and move stuff u can predict movements and actions to some degree
You didn't just pull out the first and last, but also the middle tanks 15&16!
And 23. Iluminati!!!
there's a mathematician!
Thats’s Numberwang!
SPOILER ALERT!11
the luckiest draw at the unluckiest time!
I adore the fact that you left the initial pull in the video, because that is the truth in probabilities. I appreciate your videos!
True randomness is clumpy. That's why music streaming services often don't use true randomness-you'll get too much serendipity that feels unshuffled.
6:33 I love the way the turrets are pointing at their actual positions in the number line :)
Oh, I didn't notice it )
And how the treads are in motion on the tanks. Editor going above and beyond. Bravo!
4:47
It's such a little detail for nerds. Love it as well
@@miketothe2ndpwr I don’t think it’s exclusive for nerds. It’s for anyone who pays attention at details to appreciate.
I've just simulated 10000 of this operation for number of tanks less than 100 and number of guesses between 10 and 50, and using the maximum value as the total of tanks gives 4 on an average error, while using the maximum value + average gap leaves 2.6 as average error. That method is simply 151% more precise!! Amazing!!!
What about other methods, like mean value * 2?
@@vitriolicAmaranth this is exactly where my mind went. I feel like it should almost be equivalent to mean gaps, but I probably just haven't thought hard enough
@@vitriolicAmaranthseem to work as well!
I know it's irrelevant, but there's the old joke about letting three sheep loose in a field, but first labelling them "1" "2" and "4" so the person rounding them up spends ages looking for the 3rd.
I read about this prank in the book Show Me How or More Show Me How.
it's vaguely relevant!
It would also make sense in this case since the Germans wanted to make the appearance that they were building more tanks than they actually were. As such they could have skipped a couple of number in their serial. But I guess it would create to much chaos for the German mind to handle. xD
MI5 officer Peter Wright wrote in his book Spycatcher that MI5 bugged the Soviet embassy in Ottawa. So MI5 market all listening cable with number 1 and up. But in
case Soviet would find these cable MI5 omitted some number hoping that Soviet what almost have to tear down the embassy in order to find the missing number.
( But trick did not work since Soviet had some spy within MI5 informing Soviet how many cable and what number they had. So Soviet never searched for the omitted
number. )
@@SwedishNeo "New orders from Berlin: We are to skip a few serial numbers when imprinting parts, so our tank production looks bigger than it is..."
- "But zat will bring dizorder to mein numbers!"
For several of my clients we incremented the serial number by some prime, rather than one, than in order to obfuscate the output somewhat. It also gave us some degree of parity checking on serial numbers later. Silly, really, but fun.
I would use a hash function. A secret number placed after the normal serial number, and then hash it and then use it as official serial number. Then every unit has its own official serial number, you have the secret and you can look it up, what the real serial number was and nobody is able to guess any different valid number. Even if he knows every number (except the secret) and your algorithm to create them.
@@accountxabcdef
I would use a uuid and convert it to numbers via a bespoke translation - just have a check to avoid the rare collision
isn't it quite easy to figure out that all numbers are incremented by a prime number?
@@AndreyCizov
You would need to see a few machines bought at the same time. You can not trust, that there will be all numbers used. Often there is a gap when an updated version is used. And when at that time the prime is changed, have fun to reverse engineer the prime...
There will be enough people who are able to spot it or even reverse engineer it, but that number shouldn't be that great (depends on the batch size, amount sold to individual customers and prize - more expensive it's more reward to get something free as warranty).
At 6:36 you were right on! The gap below your minimum observation WAS equal to the gap above the maximum observation and the true number of tanks!
Amazing accuracy!
The tank serial numbers were all encrypted by both allies and axis powers making this story entirely false.
@@IDNeon357 He addresses that in the video. He said the encryption was cracked.
@@IDNeon357 lol what? First of all he said they deciphered the coding they were using but also how would you know this? and why do you say it like its established fact when it clearly isnt?
yea Im surprised he didnt really point that out :D
My first instinct was to use the Central Limit Theorem to assume that the sample mean would approximately equal the population mean. Since we know the distribution is uniform and the population mean of a population of size n is (n+1)/2, twice our sample mean minus one should approximate the population size.
Here our sample mean was 17, so this method of estimates the population size as 2(17) - 1 = 33.
That's what I did too!
I love how british "they have a bit of a spy" is
It's not just a British thing. Sometimes I have myself a bit of a spy as well.
Personally, I have a bit of a lookey-loo
Sounds like a Karl Pilkington story
I'm glad I had a bit of a spy before making this exact same comment.
A bit of a stickybeak
I did this exact maths problem at high school in 1991, what a real blast from the past! Thank you!!!
1:22 "after the war, the allies can go into those tank making factories"
I like how knowing if our math was right is more important than having won the war.
Of course!
The Cold War immediately took over international politics following WWII. Once the Allies realize the math was correct, it changes how they both conduct espionage and counteract it.
Always preparing for the next one.
@@michaelwright2986Or, as they say, "The generals are always fully prepared for the previous war".
To the mathematicians, checking their math was the motivation to win the war
His enthusiasm is so contagius and it's so cool! the formula is surprisingly simple!
Pointing out that lower numbers are more likely is such a good observation. Brady keeps highlighting his genius video after video.
I don't know about genius but he does ask some excellent questions.
As my late Father would have said, "Not cessinarily!". It is also true that more of the lower numbered tanks would have been destroyed or broken down and replaced and no longer in service.
Exactly. Another type of survivorship bias.
So the number line does not reflect a set of equally likely observations. Some of the serial numbers that are not yet observed are less likely to be observed than others.
I think I am understanding this right, the not yet observed numbers between the maximum and minimum have a higher average probability of being observed than that of the numbers outside the bounds. And if the biases don't cancel each other out, the prediction is skewed. I'm sure this is a well known probability thing I'm just working this out
It’s a fairly obvious observation, I think. The mathematics being shown assumes that all objects appear at once, so no temporal complications. Presumably the mathematicians engaged in this work factor in the production dates where they were known.
Another confounding problem is repair or rebuild. For example, Russia is taking old tanks and rebuilding them into modern configurations. So these tanks are not entirely produced from new - but serial numbering is going to be a mix of old and new, dependent upon components. (It’s very complicated in this case, because there are multiple variants and changes between foreign and domestic components. We know how many thermal sights Russia bought from Thales in France, but don’t know how many domestic equivalents are being produced, as an example. So if you get a Thales serial number it’s somewhat useful, but domestic ones require some time to aggregate the data. If you can’t capture enough data then it’s not going to work, but then there are other more obvious reasons for why the information isn’t necessarily helpful in this case.)
‘Spys’ typically rely upon stuff like observing rail shipments. This can be gamed (which Russia has a long history of doing, because they aren’t fools) to feed false information to your opponents, however. Serial numbers are much more solid, provided you can make sense of the systems being used. These are kept very secret, unsurprisingly.
Alternative title: Local British mathematician gets blindsided by sheer stupid luck
Watching James Grime explain mathematics is such a joy.
All my homies love James Grime
he's just that fun mixture of adorable, approachable, nerdy, and just proficient in his job
Due to Siivagunner I have this mental image of him approaching menacingly to tell me about *e*.
But I agree, he is a joy to watch.
yeah, he always seems to have so much fun doing it
Making Math awesome.
Man, it's 11pm local time, I'm awake since 4am, my week was a rollercoaster, I'm mad about my job, I'm dealing with a woman that is getting in my nerves, my bank account is zeroed, I'm tired and pissed...
But for some reason, his enthusiasm telling this story made me happy instantaneously.
Thank you for this, God bless you and your beloved ones. Got a subscription.
Keep it up! Better times are ahead pal
Bro, in the future you will stumble upon your comment and you'll remember where you are at now in your life. You've made it this far, you'll keep going
You need money brother? Any way i can help?
@@sirllamaiii9708such a kind Man U are bless you sir
Hope you're doing better now. And even if you're not, it's all gonna be alright ;)
A company I worked for made computers & peripherals and used 64 bit random serial numbers. They had multiple manufacturing sites, and calculated that the odds of selecting two identical numbers was smaller than human bookkeeping and errors trying to coordinate multiple product lines.
So, like UA-cam assigning video IDs, they decided that it was faster and more accurate to just check for duplicates, because the probability of the same number being assigned twice in the time it takes to check if it has already been used is extremely small.
@@ragnkja Even checking for duplicates would be unnecessary if cryptographic hashsums are used. The odds of getting randomly occurring collisions with them are so low that on average it would take much longer than the lifetime of the universe.
Lol, that's awesome. I love and hate it.
@@SaHaRaSquadYah, treat a huge range of numbers as a domain, split it into segments and assign a segment to each factory, ie. 64 bit number where the top 3 or 4 bits are specific to each factory, increment the value at each factory independently of eachother per product, assign a hash of that value as the product serial number 👍
Yep, if you use 64 bit numbers the probability of a single collision in the numbers starts to raise only after about 4 billion devices are manufactured. And even then: so what? Almost all numbers are unique.
I thought I was smart with my calculation of (1+15+16+23+30)/5x2 = 34 but this guy pulls out a giant sheet of paper and introduces probabilities.
My mind went to the same place. As the sample size increases, the average should approach the median number. I wonder if the methods in the video offer a meaningful improvement over simply doubling the observed average.
The x2 part didn’t make sense to me hmm 🤔
@@_..-.._..-.._ The first part of the equation calculates the average which is 17. To calculate the maximum you'd need to do x2 to get 34.
Same here
I did the same thing except I subtracted 1 to estimate 33. My reasoning is that if we have N tanks, all with equal probabilities, the expected average of the distribution is 1/N * (sum of 1 to N) = (N+1)/2 . Estimating this with the sample average μ, you get N = 2μ-1, which is why I subtracted the one.
No war thunder sponsor? Missed opportunity
THis is targeting a different audience
It's like a opera gx sponsor on a non gamer channel
What is war thunder
@@williamnathanael412 If you'd typed that into google instead of the UA-cam comments, you'd have an answer immediately. But now, you have a sarcastic response 7 minutes later instead.
Right here I have a bag of german tanks! Do you know where you can also find German tanks? WAR THUNDER!!!!
@@Nick-the-foxDude what is an "anti-gamer channel"? Is it just one that reports on game devs being overworked at fromsoft or that was anti-gamergate ten years ago?
There's a thing called "fixed-format cryptography" which can be used to make sequential numbers look random. The nice thing about it is that the encrypted number is in the same domain as the plain number (i.e. the original numbers range from 0 to say, 1 million, the encrypted numbers will also be in that range), so the attacker doesn't know they are encrypted and thinks it's just a plain sequential number. I've used that to protect against brute-forcing IDs on a system, while keeping the IDs short enough to be encoded as a barcode
This only works of the serial numbers are sequential. Knowing this, the US named the the third SEAL team "SEAL Team 6" to confuse Soviet intelligence.
And if you know where they start. For example, if the serial number was a date, this just wouldn't work (even though the numbers are sequential, they are not consecutive)
Dates would still reveal some info about how many tanks there are
He mentions that there's an encoding on top of this
Yeah, real world probably has a lot of further problems. Like what if one month all new tanks go to front X, one month they all go to front Y, and your information and rate of capture/observation is different, for example ... ?
But then, you might also have some rough indications from observation planes or train schedules or something that might help correlate some gaps in your data. Of course, there might also be decoys and whatnot ... well, I'm sure a lot can be done there.
Serial numbers are by definition subset of a series. You need to know the series.
I heard this story ages ago, but never understood how it worked. That "flipping the number line around" line makes so much sense; so simple once the trick's revealed. Lovely!
In arms production it's fairly common for factories to assign serial number ranges to particular products in advance, so the serial number ranges having gaps within them is relatively normal. It's also normal for them to start production at something like 10,000 if they expect to make in the tens of thousands of that particular item, that way they all the items are serialized, but they also maintain the same number of digits in their serial number for uniformity without using a bunch of leading zeros. Overrunning that serial range usually results in a letter prefix or suffix being added.
By what metric is that better than using leading zeros? Or, why the aversion to leading zeros? (Also, why not just use GUIDs? Fixed size, convey identity but no other information, never going to run out.)
@@halfsourlizard9319As for the GUIDs, because the use of serial numbers for arms predates the invention of GUIDs by 100+ years. So, in other words, tradition - why change when you already have a perfectly workable scheme.
@@halfsourlizard9319 When creating records people will often omit leading zeros when recording numbers possibly out of laziness, possibly by convention. Forcing the leading digit to be a non-zero digit prevents this deletion from happening,
Why care about leading zeros? The zeros still have meaning. For instance the number of digits present can be helpful in indicating that a number in a record is a serial number specifically. Further whenever number codes get concatenated it's important to not omit digits or this will change the shape of the number code, i.e. if the serial number were a concatenation of year-month-number. Granted concatenated codes should be dash separated or similar, But if we can't trust the clerk to put the leading zeros on the number, why would I trust the clerk to bother writing dashes between numbers.
It's normal now... but it wasn't normal then
@@AmiiboDoctor It was more normal than actually. Sequential serialization is fairly rare
I used this method with serial numbers of accordions made in the late 30s by Hohner, a German company. Now I have a spreadsheet named "The German Accordion Problem" with more than 150 rows.
Cool!
So, how many did they produce per month?
@@dragoncurveenthusiast
Unfortunately they didn't mark the month in the serial number, but fortunately they didn't restart every month either.
That means I could estimate the total number of accordions with serial numbers between 1934 and 1940 to around 860000.
@@eshed Thats a lot of accordions
@@eshed you sure, they did not encode them, so their counter Accordion producers can not estimate the amount of accordions? :P
@@JtotheAKOB I'm relatively certain.
Out of the 150, I have ~20 serial numbers for which I also know the actual production date. If you plot the numbers vs the dates, you get a lovely almost linear (R^2=0.995) graph. The only way I can think of to get this relationship while preventing accurate estimates, would be to randomly skip numbers with a constant probability.
My immediate thought was that the average of a random subset should be the same as the average of the whole, so the number of tanks should be twice the mean of the picked ones 2*(1+15+16+23+30)/5=34 for the first pick and 2*(3+10+15+18+24)/5=28 for the second. My guess is that they used multiple estimate methods and weighted the results depending on inherent uncertainties/errors of the methods
Yeahs that exactly what I was thinking! Although, I suppose that it might be more susceptible to outliers then the average distance method...
Same argument applies to the median. You can also get 95% confidence intervals for the mean and the median.
Why twice?
@@Mayur7Garg The average is roughly half of the total since you have both low and high numbers. Average tries to arrive at the middle point of the number set when all the numbers are unique and in series.
(1+15+16+23+30)/5=17, which we know is too little since we have the number 30 in the series.
@@yurie2388 Basically it stems from the fact that the median and the mean would be identical for such a series. So if you know the mean, then you can use it like a median to assume that the final number is at twice the distance. But in that case, using the median in the first step directly is more appropriate. Also, one issue that I have with all these solutions including the one in the video is that they do not seem to work if the serial numbers do not start from 1 but from let us say 100.
Cool. When I was studying population biology we were given a task to work out the number of taxis in a city and we used the capture, mark, recapture method, using the taxi number, rather than marking anything. So, just noting the numbers in a given time (capture and 'mark') and then noting the numbers in a given period, which was later (recapture v not seen before). There are all sorts of sample to population complexities and improvements to the estimate with longer observations (but issues with recounts if the obs period is too long). Also, an improvement if a third count period is used.
I wonder if there are any seminal capture, mark recapture examples that Numberphile might comment on and re-create on brown paper?
His reaction to tank 30 immediately raised suspicion and I would have said "yeah, that's 30 tanks in the bag."
Yep "Tank 30, oh, hmm, interesting..."
🤣
300th like
30 got the dinks
I'm on the spectrum and I still cannot see it
We're using similar techniques with serial numbers to investigate production numbers for relatively rare camera models from the early 1970s.
Tanks for sharing
You know destroyers for bases, get ready for
Came here to say this.
I would like to extend my tanks to Ukraine 🇺🇦
Thanks*
@@talananiyiyaya8912 r/woosh
11:57 - the other problem would be the oldest tanks (i.e. built pre-1939) were either destroyed, removed from service or rebuilt into something else (like AA or AT platform) by the end of war
I love the "German Tank Problem." There's a great video on UA-cam showing this method of counting the Commodore 1571 Disk Drives. Using this technique for "other real-world problems" is a fun exercise.
Upvote for Commodore 1571
Wow, not even from 8-bit guy!
What's the video called? 😊
found it: "How Many Commodore 1581 Disk Drives? The German Tank Problem"
I've done nothing but fail math all my life yet I find this video interesting enough to take notes and watch twice.
Spies be like "tank you very much" but the mathematicians be like "tanks but no tanks"
Tanks for the quantities
That joke blows
HAHHA
That joke tracks.
It's the same question we posed as kids: "How do you count a herd of sheeps?" - "You count the legs and divide the number by 4." At the time we found that funny.
Excellent. I did laugh at the #1 and #30 thing. Always like Dr Grimes in these videos, I could listen to him just tel me interesting stuff all day.
It's just the one Grime actually
He was probably talking about him and his brother, together, the Dr. Grimes. His brother is a gynecologist.
(I apologize if this has been asked already!)
Is there any benefit or issue (other than reducing observation size) in setting aside the highest and lowest observations to prevent the concern that you selected the highest and lowest, and how that affects the average gap? If you set aside the highest (30) and lowest (1) is your first go around, you would just have 15, 16, 23. Gaps are 14, 0, and 6. Average gap = 6-2/3. If you rounded to whole number, you'd get an estimated amount of 30.
In the second example, you had 3, 10, 15, 18, and 24. Eliminating the highest and lowest, remains 10, 15, and 18. Avg gap = 5. Estimated max of 23 (18+5).
Just having 3 used observations (4 gaps) creates for instability due to low sample, but seems to alleviate concerns that you don't know if you observed the max without knowing.
James: There are 30 German tanks in the bag.
Chuikov: We were aware of that.
50th like + first reply
Krebs: That seems unlikely.
(Downfall movie reference if you don't get it.)
Someone read Cornelius Ryan's The Last Battle and his interview with Chuikov.
I just noticed someone mentioned Downfall too. The book is the source material.
James is so good! Always a great video when he's in. Also: he always looks happy, even when picking bad samples.
The disgust in Dr. Grime's voice at 2:24 when he says "I'm NOT going to let you feel the weight of the bag! [Are you daft?]"
And rightfully so. Who gets to heft a German tank factory during a war?
the time code you put is after the moment you're talking about
2:16
@@robinsparrow1618this is a common phenomenon I've seen over the years. Someone will watch the video, after watching a funny part, they click pause, then copy the timestamp, forgetting that this time stamp is after the clip
@@PeterNjeimMaybe OP edits it in. Let‘s hope for the best :)
I think there may be a simpler and more accurate way to do the estimation. My first thought gave estimates of 34 and 28 for the two trials, beating Brady's estimates of 35 and 27.8 both times compared to the actual number 30. Assuming "everything is equal and random" (i.e., a uniform distribution), just take the average of the tank numbers and double it. This also balances all the potential gaps.
That is certainly a valid method as well, also unbiased (meaning on average you will be spot on). However, the "maximum plus average gap" method is more efficient, i.e., it has a lower mean squared error: the squared difference to the actual N will on average be smaller than using your method. And that is what you want from an estimator!
I just tell the german book keeper that I think his records are sloppy, and he shows me all of his work to prove me wrong.
That was in a movie, wasn't it?
This would never work! .... not unless he plays war thunder...
Hey everyone, I just started learning how to use octave and just for kicks I made a program to do this very estimation. (Thanks for sharing Numberphile, this is really neat stuff!) Here's the *script* if anyone wants to fiddle around with it: (I've added a percent error so you can see just how remarkably accurate this estimation is!)
actualNumberOfTanks = ceil(rand*1000);
disp(["actual number of tanks: ", num2str(actualNumberOfTanks)]);
totalPoolOfTanks = [1:1:actualNumberOfTanks];
numberOfPicks = ceil(rand*100);
disp(["number of picks: ", num2str(numberOfPicks)]);
tankNumberPicks = [1:1:numberOfPicks];
for pick = [1:1:numberOfPicks]
tankNumberPicks(pick) = ceil(rand*actualNumberOfTanks);
end
disp("tanks randomly selected: ");
disp(tankNumberPicks);
estimatedNumberOfTanks = max(tankNumberPicks) + ((max(tankNumberPicks) - numberOfPicks) ./ numberOfPicks);
disp(["Estimated number of tanks: ", num2str(estimatedNumberOfTanks)]);
percentError = round(((estimatedNumberOfTanks - actualNumberOfTanks)/actualNumberOfTanks)*100);
disp(["percent error: ", num2str(percentError)]);
%;D
4 8 15 16 23 42... he literally started drawing half the LOST numbers i was on the edge of my seat
Using the formula from this video, the Lost tank bag has 48 tanks.
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
4 8 15 16 23 42
Haha me too I literally just finished watching an episode
I was looking for this comment
@@GunNNife Too bad it's not 108
Wondering about the statement about N=max yielding the highest probability for a given distribution. Wouldn't you have to go for the *expected value* of N?
Assuming an equal distribution for N>=max, you'd end up with N_expexted = 30.6 and 38.6 for the seen examples. In the limiting case for high N, N_expected simply seems N/(pulls-2) larger than max.
I keep coming back to the Brady's question at 11:41 -- if in my distribution, lower numbers are more likely, would there be an easy correction for that?
You would have to come up with a formula for how much more likely the lower numbers are and calculate some kind of upward bias out of that. I don't think the answer could be considered "easy", no.
For the purposes of war estimations, I suspect you'd find the opposite true, particularly as time goes on - the data would become skewed towards newer serials for everything, as older models were destroyed or made inoperable.
Probably best to hash military serial numbers at manufacture time though, regardless.
When calculating the average gap, you should just count the difference between tanks (e.g., between 15 and 1, count that as 14). That way, what you get is the actual (assimptotically) un-biased estimator of the number of tanks. If we do it your way, we get: MAX + (MAX - k)/k = MAX * (1+1/k) - 1, which when we let k->infinity, MAX->Acual_value and therefore we get the Actual_value - 1. It can be also proven that the estimator is biased for any k, outputting smaller values than the real one.
If we don't add that -k to the formula (that is, we count the gaps as just the difference), the estimator we get is MAX + MAX/k = MAX * (1+1/k), which is the actual un-biased estimator we should use in this case. One may call this the adjusted Maximum Likelihood Estimator (MLE). As you said in the video, the MLE is just the MAX (the number most likely to be right), but it is biased. What we did with this trick was, as you explained, correct it.
A more standardized way to compute this correction would have been to calculate the Expected Value of the MLE we got, to then apply the necessary multiplicative correction. That is, if it is necessary at all (MLE might as well be unbiased itself). This is one of the most used methods for estimating stuff out in the real world (when we are able to get an MLE).
I agree, the definition of the distance looked a bit fishy to me. But as k infinity when k -> infinity. So I think, you might need a more sophisticated reason to make your point.
Have you ever done a video on "Hyper Log Log"? We use it in massive data systems for efficiently estimating the number of unique values. It is very interesting, and freakily accurate.
I've used that in BigQuery. APPROX_COUNT_DISTINCT is great for figuring out new data.
1:51 “Is that a German tank or?” *every tank enthusiast goes oof*
Somewhere deep within my brain I'm pleased with this video because Dr Grime always reminds me of the young folk who went to ww2 saying they were adults when they weren't and this video is about tanks
If I would have had a teacher like this when I was young, I'd likely have become a mathematician! Great video, thank you!
it would also make sense to calculate the average of the samples and multiply it by 2 as the average of consecutive numbers starting at 1 would be about n/2 and the average of the samples would also approach the same value.
Close. The average of the observations is an estimator of the mean of the serial numbers in the bag. You got that much right. But the average serial number is (n+1)/2. So you have to double it and then subtract one.
Was my first idea, too. They did basically the same with extra steps :)
The reason for those extra steps is that you usually have padding around the serial numbers, like just start counting at 1500 because the 15 means something else, and the last two digits are sequential. Which they did touch on, but not a lot.
This is the method I thought of too, but it turns out that the numbers you find other than the max aren't relevant. However many tanks there are, finding 1, 2, 3, 4, and 30 is the same as finding 26, 27, 28, 29, and 30 (as long as the serial numbers start at 1).
@@halbronk7133 "the numbers you find other than the max aren't relevant": This isn't precisely true. They are relevant in the sense that they produce a valid estimate for the maximum. The problem is that it ignores relevant information that we know about the problem (that the numbers are sequential without gaps). And usually when you don't use some piece of information to derive your answer then it is possible to do better.
This episode was great. If you come across more war history examples, please post them. My son loves war history and was fascinated by this. This helps to understand why math is important.
Seal Team 6 is named that to imply that there are at least 5 other Seal Teams. At least this is the common rumor.
This is actually true (at least according to Richard Marcinko's autobiography). Now presently there are well over six SEAL Teams (8?) but when Marcinko created a specialst SEAL unit in 1980 here were only two other ones, and "Seal Team 6" was a deliberate attempt at deceiving the Soviets.
I always find it fascinating how these equations can be derived after rigorous application of a simple general concept, like at the beginning of the video you can feel that the frequency of smaller numbers (hence more smaller gaps) would affect the estimate but the quantifying part takes time to visualize in its precise form
For electronics with network cards, companies are assigned ranges of MAC addresses as they are supposed to be universally unique. The range could allow one to estimate the number of devices they sell.
Life, including the Y-T algorithm, is strange indeed
The operative words here are "supposed to be". And nobody says they have to be assigned sequentially. Each organizationally unique identifier (OUI) can create 16 million unique MAC addresses. And you can have more than one OUI.
Murphy's original utterance: "I swear, if there is a wrong way to do something, we will find it"
Sod: Murphy was an optimist.
Apple serial numbers were sequential until about 5 years ago. They even contained information about which factory produced the item and when.
They still should, trackability is vital information.
I got a solution to this problem in a maths challenge about eight years ago. My approach used conditional probabilities and the expected number of tanks in the bag would be:
N = MAX(k -1)/(k-2)
For five observations (k = 5), this gives N = MAX(4/3). This is higher than the average gap approach, which gives N = MAX(6/5) - 1
I was thinking about taking the average and doubling it. The idea being that the average would be approximately in the middle of the true number, so double the average would be close to the true number.
That's what I did, lol.
Same here. That method gives very similar estimates in these examples.
I think you can improve this estimate by subtracting 1 at the end, since the average of the numbers 1 up to and including N is (N+1)/2 rather than N/2. Denoting the sample average by X, your idea is that X should be approximately equal to (N+1)/2, which would imply that N is approximately equal to 2X-1.
I'm actually curious to see how this performs (in general) compared to the method presented in the video.
That wouldn't work if the serial numbers didn't start from 1
@@akshaj7011 That's true, but the average cap wouldn't work either if they take account to the cap from 0 to the first element.
If the starting point would be unknown, i would probably use standard deviation in the same manner.
Brady's final questions show amazing insight. My favorite anecdote involves SEAL Team Six. There was not a 5, they just used the number to make people think there were more teams. I don't know if this story is true, but I like it and it shows that you have to know the parameters of the numbers instead of assuming a sequence starting with 1.
This feels like one of those widely usable maths that I won't be able to find an application for anytime soon... then when the time comes, I'll remember there's a solution but not what it is 😅 Bookmarking it now for that future occasion, haha
Ukrainians might find it useful
i missed this guy a lot, i remember binge watching his entire channel when i was in high school, brings me back
OK, what about this: As the sample size increases, the average of the sample will approach the average of the population, so let's estimate the average like that. For a uniform distribution starting at zero the maximum is simply two times the average, but in this example the minimum is one, so we'll just subtract one from our average. Using this method I get 32 and 28 tanks, respectively.
Or double the median. It would have been 32 and 30. Not sure which is usually closer, I feel like you need a Bayesian analysis with a prior.
Although these specific estimates has less error than the ones presented by James, on average his method will be better, at least for larger samples. I did some simulations, and for small samples, say three, it's pretty close, but James' method has a lot more bias.
@aksela6912 Funny thing is, no matter the prior you use, the posterior probability of N (the total number of tank) is just the prior truncated starting from M (the maximum of the observed serial numbers). In other words, a Bayesian answer, no matter the prior, should only depend on M (not even on the number of samples).
@@cryme5 For a uniform distribution the variance of the sample median will be greater than the variance of the sample mean, and as mean and median should be the same it will be better to use the one with less variance. I have to reiterate though, sample mean times two is a poor estimator, even if it feels more intuitive, and it feels like you're utilising the collected data better.
@@aksela6912 James's method is unbiased. If you observe n tanks and the maximum value you observe is m, then the minimum variance unbiased estimator is m + m/n - 1. Your estimator of twice the sample mean minus one is also unbiased, but its variance is higher. And it doesn't use the important information of the sample maximum, which means the estimate might actually give a value we _know_ is too small.
IIUC, this is also a problem where frequentist and bayesian techniques arrive at different answers. I'd love to see an explanation of that.
Tanks for sharing!!!
Every part of this video - from finding out the numbers to objections raised - was brilliant. I love this video.
Would make a nice graph plotting your best guess of total tanks, pulling one tank at a time. Any time you get a new biggest number the plot would jump up, and when you get smaller numbers it will slowly decend as your average gap gets smaller. It would jerk up and down, approaching the actual total number.
agreed that would be nice to look at, though you'd want a larger set than 30. should start out a as a line jumping up and down but rapidly smoothing out. After it calmed down a bit you could probably do a bit of "eyeball extrapolation" to get a more accurate estimate than the last prediction.
I appreciate the 500% more description. A lot of people muddle that up and would say 600% more. They ship past the fact that 6 times as many is 600% OF or 500% MORE.
I think my estimation was a little different technique. If we take the arithmetic mean of all the tanks we come up with a number that is half the total. So by taking the mean of the numbers on tracks pulled out of the bag, we can double it.
First Enigma, now these tanks. Sometimes it feels as though James is gearing up for a time travel mission.
Obviously not...
@@talananiyiyaya8912 nice try, MI6
He is winding down from one. He went there, helped Britain win, and came back.
@@sandekv He's slowly revealing that to us.
That was brilliant! your initial picks are exactly why it was so hard for me to grasp probability at school until I realized it is about multiple events and doesn’t work that great for a single event
At first, I was perplexed about the method of estimating monthly production with just serial numbers, but I am glad they explained they had a way to decode the month and factory of the tank as well. I assumed some of these numbers must have been intentionally hidden or misleading.
no, they were just contracted to different manufacturers and sub-models (Ausführung) and we're assigned specific number ranges
the gearboxes, or rather specific the engines with the geartrain attached were often shared between different models, like the Panzer V Panther and Panzer VI Tiger shared the same engine platform, and only was different in minor details and power
in the later stages of the war it was not uncommon to use what was in stock or repair tanks with parts from different models
Information and mathematics once again showing their overwhelming and seemingly timeless relevance. 🙂
I really like Brady's talent for asking "good questions"
i have other method of solving we will average the numbers of 5 random tanks we picked then the average will be close to the combined average of total numbers of tanks so avg of 5 = 17 = n(n+1)/2n {avg of all numbers on tanks , n=total number of tanks } we get n=33 =total number of tanks
6:24 scary camera pan
Awesome concept of a video, I love how you explain each part slowly of the puzzle and the graphs, it helped a lot. Im sucribing right now
If you take the simpler formula of twice the average value of the tanks, it actually gives better prediction in this case (34 and 28, if I can still perform additions)
This video is brillant! I knew about the story and always thought there is some really complicated math behind the scientists work. Nicely explained, thanks! :)
12:17 "But we broke that code, okay? That's another story." Well, now we've got to hear it! Enigma, or something else?
The Engine machines were for coded communications, I think. I think he meant that the serial numbers were coded, which isn't uncommon for different companies and favorites to have different ways of doing things
We were not given a key piece of info at the beginning - what year did this occur in? German AFV (including tanks, but also tank destroyer, etc) production varied widely across WW2. For example, 3600 in 1941 and 19000 in 1944.
13:30
@@techheck3358 But it was not given at the beginning. When we were asked to guess which group was more accurate.
Question: is the formula getting more precise for more observation or for a bigger Numbers of tanks ?
Agreed. If you observe 1000 tanks, is it better (on average of course) to treat that as one big observation or to split it into 10 smaller observations ? I feel like that would be useful info to have.
Yes, probably. The probability to encounter uncommonly extreme gaps becomes smaller and smaller, the more samples you take. When you make as much observations as there are tanks, all leeway in gaps has vanished and your accuracy has reached 100%.
It should be. Law of large numbers, the statistical noises averages out.
why just not do avarage of the numbers on tanks? i mean (sum of numbers divided by number of tanks times two) ((1+15+16+23+30)/5)*2=34 which is closer to 30 then his calculated 35, second try is ((3+10+15+18+24)/5)*2=28 which is again a bit closer to actual 30 then his 27,8. Am i just being lucky or is it better way to do it?
UA-cam recommended me a short video by Hannah Fry about this very thing just this morning: I don't recall how old the video was but life is strange!
Thank you Brady for making these videos, every time I watch it motivates me to do my job better as an engineer/computer scientist
Love ur passion, professor
When i was doing stats at university the lecturer had us fill in a questionnaire on day one to give us some nice data to do analysis on (birthdays and such). It was all nice data except that there wasn't a single left hander in the class. Not one. There ought to have been about ten but there was zero. Credit to the lecturer, he rolled with it. His attitude was, "these things happen - we don't fudge our data". It was actually a great class.
Would be cool to see the mathematical derivation of calculating the expected value of the tanks using an infinite sum of the probability at the beginning
Its not an infinite sum when is finite. It has N elements
@@Last_Resort991to properly calculate the expected value, it would be an infinite sum from the max number seen to infinity
When I saw 23, I noticed that it was one of The Numbers. Then I saw 15, which also was one. Then, 16 followed, and it's also one of them. So, it started with half of The Numbers, although in a scrambled order
I literally saw the Hannah Fry video about this yesterday and kind of assumed that this would be a Hannah Fry numberphile video.
Before watching the video i had another method which i think works pretty well too.
Because youre pulling random samples, the samples can be assumed to be distributed somewhat randomly along the whole range. So if you take the average of the numbers you have found, that can be assumed to be approximately the middle of the range. Multiply that by 2 and you should get close to the max of tbe range.
Nothing better than James talking WW2
This raises 2 questions: 1) how are you able to leave an unresolved Rubik's cube loose on your shelf; 2) why are your tanks pointing their turrets backwards?
Actually... The cube has a different purpose. If you watch carefully. the cube has changed configuration 4 times through the video. Most of the time by 1 or 2 moves.
I think it's encoded messages to pass military information (about British tanks, obviously) to the enemies without looking suspicious.
When tanks are being transported, whether by rail or on ships, they are almost always oriented with the turrets reversed to make more of them fit in a certain space. Not having the cannon barrel pointing forward usually means more of them can be crammed into a space.
“I will do one”
Lo and behold, one he proceeded to do
Another great and entertaining video. Thank you. I would love to read the paper about the why the spies were so wrong.
Me too. I thought maybe they were being fed false information?
The way we communicate with others and with ourselves ultimately determines the quality of our lives.
I think the “failed” demo was perfect since you had to explain not only how it works, but also where the formula fails.
Reminded me of school. The teacher would teach the easiest way to understand something, but then on a test it would be the hardest example/use of that formula. School failed, numberphile succeeded.
This guy is the absolute best and has been for years!
10:24 speedrun
I saw this comment at 10:23, dang..
I came up with another way to estimate the number of tanks. Not sure which method is superior. First, I determined the general formula for the average from a set of numbers counting sequentially from 1 to N. I rearranged it to solve for N. Then you take your sample, determine the average and estimate N. Using the same samples as in the video I got N=33 and N=27. The average answer was 30!!
I derived this as follows:
N1=1
N2=1.5
N3=2
N4=2.5
N(x)= x-((x-1)*0.5))
= x - (0.5x - 0.5)
= 0.5x + 0.5
Avg = (0.5x + 0.5)/x
= 0.5 + 1/2x
X = 2(Avg - 0.5)
=2Avg - 1
(23, 15, 16, 1, 30)
Avg = 17
X = 33
3, 10, 15, 18 24
Avg = 14
X=27