A Mathematician's Guide to the World Cup
Вставка
- Опубліковано 16 лис 2022
- In 2010 Paul the Octopus 'correctly' predicted results in the 2010 World Cup. However, these days the experts are the analysts who trawl through the reams of data about players and teams. And where there is data there is mathematics. And, particularly, mathematical models.
Joshua Bull is a mathematical modeller. He was also the winner of the 2020 Fantasy Football competition from over eight million entrants. So when it came to the Oxford Mathematics 2022 World Cup predictor, Josh fitted the bill perfectly. Honing in on the data, applying his modelling skills, and adding a pinch of the assumptions that inform modelling (disclaimer: he is an Ipswich Town fan), Josh has come up with the answers - or rather, likely outcomes. See what you think.
PS: some people have commented that Josh has got the last 16 wrong because those combinations cannot come out of the groups. However, he explicitly says in the video that this is an overall prediction not a specific one. For a specific one and more forecasts and thoughts, please go to our social media pages via this link: www.maths.ox.ac.uk/ for links).
I think the most efficient way to test this is using the exact model for last world cups and see if it works
I think they have done it.
@@manaharchowdhury2402 and what was the results from those past cups.
According to them the focus is data since 2018 of all international matches as well as xG
Yeah that’s a backtest of the model.
Thing is though - just because an outcome is most likely doesn’t necessarily mean that it will eventuate. So you really need to backtest it over a long enough period of time for Central Limit Theorem to apply on the normally distributed noise he spoke of.
You should also avoid overfitting
Great video, my highly scientific analysis is similar- it's called "The Gabriel Martinelli factor". It works like this- you analyze the team and figure out whether it has Gabriel Martinelli on the squad roster. If it does, it means you're going to win the World Cup.
Arsenal fans been down bad so long this is how you fanboy your good players 😂
Arsenal winning the World Cup 👏 replace Martinelli with Depay
Martinelli não fez nada no primeiro jogo haha, sinto muito
@@advancewarstournamentseries teve menos de 10 min em campo e ainda fez algumas boas jogadas... o q vc ta falando?
Stupidity model
There is a HUGE mistake
Each group should provide 2 teams to the round of 16
In your model Argentina group only one team
Brazil group 3 teams
This means you have to redo the round of 16th and beyond
But really
Great work
What do you mean? Argentina and Mexico + Brazil and Switzerland.
*Edit* - I think there is an updated table floating on linkedin where there is belgium and brazil in finals
@@norf8 Yes, Josh makes the same point towards the end of the video - this is just a general guide. You can find his final model here: twitter.com/UniofOxford/status/1593564445715881984
@@OxfordMathematics Man where is Poland or Mexico here??? you mised that?
@@8YvY Mexico is there on the right.. poland doesn't advance from group stage
Yes and there is another issue some teams from the same group are facing in stages as early as quarter final while they should be on opposite sides of the table -- for instance France and Danemark
I really appreciate the way you built assumptions and improved them when something unlikely showed up.
The model i believe can be more deterministic
Deterministic is not subject to gradation. It either is or isn’t. 🤓
Cheers,
S. Cooper
The next step is finding teams with best betting payoff to prediction and layer several bets to maximize expected payout. Cool stuff. Thanks for posting.
Mathematician: Look at all this data and predictions we can draw from it, isn't it pretty.
The internet: Cool lets use it to make bets
@@bengoacher4455 might aswell make some money from it. Otherwise, what's the point 🤷♂️
Thats why i love youtube. Just random videos like this make my day. Great work by the way, fantastic analysis
This is a really insightful and entertaining video -- a rare combination when it comes to explaining probabilistic models!
First of all xG-(xG allowed) would have been more useful than straight xG in capturing defensive capabilities of teams. But frankly, international team xG is just not a large or uniform enough (given significant split between friendlies and competitive matches) data set for this to work as discussed. The overvaluing of Belgium and undervaluing of England are two examples of this
using current player market value would almost definitely be more predictive than weighting straight xG
That said, it’s a fun video and teaches the iterative process of this kind of modeling very well, so A+ for maths communication
but if anyone is here looking for a betting edge, lol I guarantee the odds makers have thought about this more and modeled it more thoroughly, so maybe don’t
Agree with your diagnosis but don't agree with player market value being a more valuable measure. Firstly, English players are priced higher on average at equal skill level due to the EPL homegrown rule amongst other things. Secondly, value says nothing about how a team will perform together on a pitch, hence why a midfield of Scholes Beckham Gerrard Lampard never won anything. Thirdly, value says nothing about a fella wearing a waistcoat putting all the best attacking players on the bench in favour of 12 defenders. 😄
@@jpa_fasty3997 rather than market value I would suggest something like the median wage of a 16 player rotation or so 🤔
@@smooth_operator65 Doesn’t work either.. Because English players are also severely overpaid for the same reasons as JPA mention.
@@frederikbrandt424 yes sure, then again this would be easier to controll for than in case of market value
Use them all! xG, player value, scores, away/home, weather etc. And throw it into a nice regression model (maybe xgboost) and validate out of sample and out of time.
N.b. It still won’t beat the bookies.
Thank you for putting this together and for sharing it, great intuitive explanations and graphs :)
Mathematical modelling in a clear and fun way! Very good
14:45 that chart aged well
Joshua Bull sabe mucho de investigaciones matemáticas pero cero conocimiento de fútbol.Saludos desde ARGENTINA
As a long-suffering Wolveerhampton Wanderers fan, the first few slides of your excellent presentation got me thinking about our recent 4-0 capitulation at home to Leicester City. According to the statistics I found online, the xG for that game was Wolves 1.62 - 0.99 Leicester, despite the real scoreline!
So judging by the distribution model given @ 5:36 there appears to have been a 1-2% chance of Leicester scoring four goals that day, and indeed around 0.7% [35% x 2%] chance of the game ending 4-0. When your luck's out...!
Great video and great job Joshua putting all this together!
I really enjoyed your video despite being highly sceptical from the start. To get an accurate prediction there are several other factors that must be included and could realistically be modelled:
1) The individual player ratings and their performance as a group (e.g. Liverpool had a terrible defence up until the signing of V. van Dijk. The Liverpool defence was improved again by Alisson, this made them into a league-winning team). Individual ratings can be crudely taken from transfermarkt via their market value, which is a reflection of both the team they play for and the league in which they play as well as their positions. The individual ratings could then be further modified by performance stats - goals, assists, key passes, chances created, tackles per 90, interceptions, clean sheets, etc.
The key point is one player being added or removed can radically change the performance of a collective. Take the star player and give them a "star rating" (weighting) in each area of the pitch (GK, Def, Def Mid, Att Mid, Creator/Winger and Striker/goal scorer). Obvious examples in this area include Kane for England, Kane having scored proportionally more goals for England than any other player for any other major nation. Take away Kane and England's xG and win % is going to fall off a cliff, similar to how England somewhat collapsed when Rooney got injured in previous major tournaments. If Thiago Silva is taken out of Brazil, VvD taken out of Netherlands, Mbappe taken out of France, Messi from Argentina, KDB from Belgium, etc, these are going to create disproportionately large movements in their respective tournament chances. (see form below)
2) The continuity and form of the players and the collective - e.g. Liverpool after van Dijk got injured and after his return became defensively brittle and conceded the first goal in 9 out of 11 premier league matches. If the star players are out of form or have been recently injured, these factors will weigh heavily on team performance. We have numerous examples from history but again Rooney and Ronaldo (Brazil) have had disproportionate influences on their team's chances of success. This may be particularly useful when measuring defensive prowess as defenders require continuity and a run of games in order to develop an understanding and water-tight defence. Changing either the GK or the key centre back can weaken defensive continuity/form significantly.
3) The leadership and mentality of the players. This could be measured by how often they contribute to a team gaining points from a losing or drawing position and how often they lose points from a winning/drawing position. In this respect Liverpool, Man City and the Man Utd teams under Fergie could be used as benchmarks for the positive (coming from behind, scoring late goals). The Arsenal, Barcelona, Netherlands, Brazil and Argentina teams of various eras could be used as negative examples of a collective of excellent individual players who lack leadership, leading to them dropping points from winning positions, having a number of aberrant results (losing against weak teams) and performing particularly badly in matches against the strongest opposition and main rivals (e.g. Arsenal getting bullied by Chelsea or Barcelona losing to Real Madrid and rivals in the Champions League, Netherlands, Brazil and Argentina collapsing in World Cups)
4) The ability of the manager/coaches. This could be measured in a number of ways including player development under their leadership, the consistency of their form, performance against weaker and stronger teams, points gained/dropped from losing/drawing/winning positions late in games, performance against rivals and performance in important matches. All of these criteria and more have been legislated for (often poorly) in games such as Championship/Football Manager. There are also ratings for managers from transfermarkt and fourfourtwo amongst others.
5) A measure of the UEFA and FIFA coefficients is a must - it's virtually certain the winner of the World Cup will come from the top 11 currently ranked teams (Germany are historically low at 11th). In the last 70 years there have only been 7 WC winners and one or more of those 7 teams have made the final in every WC except for the notable exception of the Netherlands (never winners) and of course Croatia who unexpectedly made the final in 2018. If we are using the past to predict the future then there are only 8 possible WC winners - Brazil, Argentina, Germany, Spain, England, Italy, Netherlands and France. That immediately reduces your odds of a random choice winner from 1/32 to 1/8. The UEFA coefficients could be incorporated by assigning a player in each national team the coefficient of their team and league. This would apply even to Argentina and Brazil as the majority/ALL of their major players play in European competitions or did so in the past
6) The odds given by the major bookmakers in each country - self explanatory but the bookies rarely get it wrong
7) There are numerous other fine-tuning factors such as the performance of star players vs star players of other teams - e.g. does Messi have VvD on toast, does Fernando Torres perform well against Nemanja Vidic. If a star player has long-shooting prowess is this negated by a star GK? E.g. is KDB more of a danger when shooting against Pickford (a smaller GK) or against Neuer (a much larger GK)? Which systems temporally perform better against other systems? Does 4-4-2 perform better against 3-5-2, 4-5-1, 3-4-3, etc?
8) New rules for this WC in particular the added time. There's never been so much added time in football matches in history. A game used to average around 94 minutes but in this WC they are averaging 106 minutes. This will highly favour teams with greater fitness as the majority of goals are scored late in games when players are tired and make mistakes. Fitness and the strength of the bench will be more important at this WC than any football tournament in history.
9) The weather/climate is a major factor with certain teams better adapted to playing in hot/arid/humid/cold/wet conditions. The WC winner and finalist statistics heavily favour nations with similar climatic conditions to the host WC nation - i.e. when the WC is played in South America/North America or Asia, then South American teams win. When the WC is played in Europe then European teams win. There have only been two exceptions since 1930 when Germany won the WC in Brazil (2014) and when Brazil won the WC in Sweden (1958), although Brazil did have by far the best team including Pele, Garrincha and several other exceptional players in 1958
Based on the above I would pick Spain based on what I've seen so far, although Brazil are favourites for a reason.
Great job. It was very fun to watch.
Loving this ! I think what this mostly proves is that England supporters tend to overestimate the chances of their team ! Be honest, they got the lucky side of the table in both last tournaments, that's why I think the model before you made the tweaks is actually closer to reality than the later ones. Same with my country Belgium btw and that proved to be correct as they went out in the group stage :D
currently getting my butt kicked by an intro to probability class and you mentioning poission/normally distributed and seeing how mean/variance showed up in graphs was actually really refreshing and gave me some hope that it isn't just theory
Frrrr really cool to see it in practice even if this is just a quick simulation
Well it's still just theory since you're only trying to predict the future, there's no garantee it will turn out like that, but yeah, the results will most likely be among those lines, so it has practical use in the end (though I would argue anyone that follows soccer a little bit could predict just as good, if not better, the results)
Everyone and everything may now be working against this model including VAR Decisions and officiating officials. Injuries, red cards, weather and so many other unforseen factors may play big part too in who will eventually win the world Cup. I wonder of biases or noise was introduced to account for all these but we are keen to see the model performance. Great effort 👌 Joshua and team University of Oxford .
Those noisy elements are assumed to follow a normal distribution, I.e. over a long enough sample of matches they average out to zero.
That’s how the simulations account for it.
@@Grimeyhoob Yeah kind of agree with you. Thanks 😊.
Spot on!
@@randellberry6846 Hahahhaha
bruv the offside calls in qatar v. ecuador today would have been called at maybe half the frequency in a normal match, in my opinion.
I appreciate your effort in adjusting parameters but this simulation is still biased as it didn't consider many other parameters such as the team combinations for the knock out stage up to the final, the dates of those matches, and many other parameters. From my prediction Portugal should get to the final and face one of Brazil, Netherlands or Argentina. My last pick for the final would still be Portugal given the fact they would have one more day to rest and considering they would have faced weaker opponents up to the final compared to those other 3 teams I mentioned above.
Awesome! Is there a chance to get hold to the model? I'm quite curious of its technical parts. Please let me know.
This aged poorly
Fascinating, thanks for sharing.
Im sure it would be hopelessly complicated, but for this world cup the form of individual players in the last few months is going to be very important so it would be cool to factor that in
Is it possible to look at a previous set of games as a fixed output and then put in bunch of data sets (possession, tackles in opponents 3rd, etc) and have the model vary their weighting to get the fixed outcome? That could be interesting to see how the model would define the significance of each stat and could provide a relatively accurate model?
Is it possible to incorporate a Bayesian Bradley-Terry model ?
Good start! We also now need to introduce a new term to factor the effects of VAR.
Nice video. How about comparing how well this model agrees with results from previous tournaments
Amazing video!!!! Congrats
The residuals are nice in the xG and Rating difference. Not a lot of outliers. Any chance the p-values that were calculated could be shared?
You'll be re-iterating the model at the group stage ? Squad composition /injuries may affect your model too.
Where do you get this type of data? Is it straight xG data, or did you find data sets of shots and locations and whether or not they went in?
Hi Josh, great video! Simple yet coherent model for sure. As a Brazilian, I'll take those odds!
Did you went so far as to test it against the previous world cups? I wonder how the actual results fare against the expected outcomes! Cheers, mate!
Josh's focus is data since 2018 of all international matches as well as xG.
@@OxfordMathematics May be interesting to see if there’s any mean reverting behaviours over a longer time horizon with teams. E.g. how we see Brazil and Germany eventually come back on top after any fallow period. Some kind of behavioural element and pedigree.
Estava vindo perguntar isso
@@Grimeyhoob Yeah, that could be really interesting, the cyclical behaviour on performance of each team
Saudi Arabia: hold my kebab
if you used your model on passed tournaments, how often were your correct?
Almost 50% prediction has become wrong of R16 😂😂😂.... Nothing to say for the rest
I love that despite how complex this model is, when you look at your final model, intuition gets your the exact same outcome.
no
argentina suck and england are the best
Shows that our brain is more complex than any model and 'intuition' is one of the greatest calculations we make without noticing
@@lucasng4712 I mean if you know football obviously. And by final model I mean the one they put on twitter.
@@JohnM-ch4to no
Great job. Well done!
What I’d like to see when the World Cup ends is how likely the actuel iteration of the tournament was to happen, if that makes sense. Maybe not so precise as to exact amount of goals scored in every match but in terms of which team won against which other team. I’m interested because there have been pretty unexpected results this World Cup !
I want to know if you are planning to open source the prediction model? Thanks
was r the primary thing you used to make the graphs from the data?
Great analysis and at least identified one of the two final contestants. It was surprising to see Brazil losing losing so early, but what can you do when penalty kicks define the outcome?
Not be a sore loser?
Are you going to update the forecast after group matches?
Great explanation and analysis. I found an error in the final tournament outcome prediction, there are 3 teams from group G and only one team from group C advancing to the play-offs. Wonder if this would change a lot in the ladder
It’s because it’s not a prediction of matches, it’s a prediction of likelihood of getting to certain stage of the tournament. His model predicted 3 countries of a group having more chances of going through to the round of 16 than the second of the other group. He definitely should add that restriction. That results only tells us group G is more competitive than group C.
Josh does acknowledge this. A more precise prediction here:twitter.com/OxUniMaths/status/1593933134256553989
loved this! thank you
I am guessing there is a similar method the Oakland's Athletic used on the movie Money Ball, which supposedly is based on real life. Anyhow, I think football results (or any other sports) can be predicted on a short term with a fairly high percentage, but for a long tournament like the world cup is almost impossible to predict a winner, especially when you have knock out games, which have a lot of different factors that influence on the score e.g. luck, a referee bad decision, a red card, fatigue, injury, player's emotions, etc. I think any of these can't be predicted. Anyway, thank you for this, I learned something new and I appreciate it.
Yeah what they did was basically heavily use the xG equivalent stat to build a team. Their idea was “Why spend $20M on 1 guy who has 1.5xG when you can get 3 players who have 1.5xG combined and cost $15M total and cover more positions” In baseball that stat is Slugging rate for batters and ERA for pitchers
(20:13) You have three teams coming out of Group G (Brazil, Switzerland, and Serbia) and only one coming out of Group C (Argentina).
I'm 3 minutes into the video and can tell already how irrelevant and embarrassing the results were going to be. Thank you for showing me there's no point watching more.
Maybe stick with it and let it unfold. It is about how models need changing but by how much is the challenge. take care
Don't think FIFA are going to have 3 teams qualify from one group and only 1 out of another though...unless England finish 3rd 😂
@@OxfordMathematics I get what you mean. Everything else in the knockout stage looks similar to what I predicted on my office pool.
I was just pointing out that something went wrong in the group stage.
@@EB-zn4hs Yes. thanks and understood. We just don't want people reading the comments to think Josh is making a mistake. He explicitly says it is a general prediction later in the video. Enjoy the games.
I'm brazilian and I liked your simulation very much :)
I'm cheering for your simulation!
hehehe
Hello, from a footballing point of view, it is interesting to see england so low down, despite having not terrible results in the 2018-2022 period. I feel this may be as they are a more defensive team, so won't score a high xG and "xG against" should be considered in the model - perhaps by plotting "xG for" team A vs "xG against" team B over the dataset and finding a correcting factor, similar to 14:13
Fun video and good insight into how to develop models from the ground up!
They have recently had some pretty terrible results, notably in the nations league where they got relegated.
How does xG realistically take into account goalkeeper positioning and quality for each shot? Does it consider that? Or how other players block the goalkeepers vision, distract him etc.
how did you find the data on international mean xG over a period of time. I can only find tournament based xG
Please update this throughout the tournament maybe once or twice maybe after group stage
i can now say that your major is not math, its art
This was such an excellent presentation. Finally, I understand what xG really means and how betting websites calculate their winning odds.
Great analysis, I loved it. I wanted to suggest that there is a multiplier that affects how much a match result should be taken into account. For example if it is a world cup game we take the actual xG fromt he match result, but if it is a Nations league game it will be xG times 0.8 and if it is a friendly game then maybe xG times 0.4, because teams tend to not play at their maximum effort when the game is not as important. In that way, if England had beaten Portugal in Nations League it will be shown in the final model that it was more important than if they beat Portugal again, in a friendly for example.
Yes, these are important issues and, we hope it makes people appear of how models, which want to hone in on the important things, have a lot to choose from. Watch our social media for updates and analysis after the round of 16.
Another thing missing is accounting for the fact that some teams have players who exceed the xG. A Kylian Mbappe will score more often from a given location on the field that might Jonathan David (of my country's team, Canada). Teams with better defenders and keepers will lower the success rate of shots that on average have a high xG.
I'm assuming that stats exist that compares a player's shot success with the average from a given location relative to the goal.
I think adding an expected goals conceded aswell as xG would be better, some teams will be focused on not conceding instead of scoring. Very cool video tho dude.
‘Ill remain unbiased here’ While wearing an Ipswich shirt 😂😂
COYB
Hi, I hope this is a problem with the flags in the picture and not with the model.... in the graphics at 19'38'' we have just one team of the group C (Argentina) advancing from the first phase and playing agains Switzerland - and we have three teams from group G (Brazil, Serbia and Switzerland) advancing...
Josh acknowledges this later in the video. This is not a specific game by game prediction. That is here:for the last 16: twitter.com/OxUniMaths/status/1593933134256553989
this was so entertaining, making me love maths
Which statistical packake are you using.
Holding up pretty well so far
Great idea to make more people interested on math, we love the world cup so we do who can predict the results.
Nice, but if you follow 538, they saw that draws are ~10% more likely than expected by a poisson model.
Also I'd consider a blend between xG and actual goals and do some testing on if the model is well calibrated.
Can we download the model document?
After all that brilliant analysis I mostly enjoyed the fudge FCH factor
question, is predicting 9/16 teams to the round of 16 a good result? Also, on of your finalist was already was already knocked out. Although, you have two quarter-final games predicted correctly.
including the conversion rates relative to xG for different countries would also improve the prediction - for example harry kane scores about 90% of penalties, which are 0.75 xG. but another player may score below 75% of the time
Great job for explaining that in simple way. I found some inaccuracy in model. You predicted that from Group G three teams advanced to the next round. Brazil, Switzerland and Serbia. Just one too much. Except that it seems very likely :)
Josh acknowledges this. Fuller prediction: twitter.com/OxUniMaths/status/1593933134256553989
Your Model Has an obvious error. You have 3 teams that qualify from brazil group(serbia brazil and swiss) but only argentina from argentina group. This cannot happen as per the rules of the game.
Yes, Josh says later in the video that he is not being match specific in this video. Go to our social media for the precise prediction.
this video aged really well
Actually, this was a brillant presentation ! I dont see how the results could be further improved, appart from very complex calculation according to each players.
Any chance you could share some of your data and code on github ?
Is there a detailed write-up for this model?
20:04 England are playing Belgium in the Last 16 in your example, but Group A and Group B teams are set to play each other in the Last 16 (so England will play Netherlands, Senegal, Ecuador or Qatar). You might need to reconsider the structure of the knockout bracket as part of your prediction.
He says that these don't represent match results but rather which placings the teams reach
Yes, Josh says later in the video that he is not being match specific in this video. Go to our social media for the precise prediction.
Very interesting! I bet the betting brokers use something like this.
But few remarks:
1. Group “Argentina - Mexico” - only 1 team qualified to round of 16 as per your model
2. Group “Brazil - Cameron” - 3 teams qualified
3. The match-ups of round of 16 do not match (e.g. Brazil should have played against Uruguay in round of 16 as per your model)
Josh says this in the video - this is the chance of each team going to each stage, not the specific group results. Take care
Questions: Why use ELO to adjust xG instead of xG to adjust ELO, and then use ELO as your primary predictive variable without Poisson sampling? Or use both in multivariate calculation? If you’re using xG to predict scores and outcomes and weighting more recent results, why not update xG during each round of the simulation? I would put a heavy weight on those results since they not only take tournament momentum into account, those results would be based on current rosters. Also, I think shots on goal is a better stat than xG. Lots of shots on goal lead to rebounds or continuations that result in goals.
If I understand correctly, for each game you model a stochastic XG (based on ELO ratings) and then stochastic goals scored, based on modelled XG. So you have two levels of stochastic-ness. Is there any analysis that this gives better predictions than simply estimating stochastic goals directly from ELO?
I had similar question. Why not use xG to adjust ELO and then use ELO as the primary independent variable? Or better yet, multivariate regression using both? xG seems quite flawed for many reasons but one not mentioned is that many goals in soccer are scored on rebounds. I don’t see how xG takes that into account. There are lots of other stats that could be useful: # of shots on goal, time of possession, # of corner kicks. Also why not have your model update xG during the tournament? This would not only account for momentum, but would help mitigate the problem of your dataset being largely built on outdated team rosters.
Hi Josh, thanks so much for this video! I really enjoyed it. I would be very grateful if you could shed some light on how the xG values were adjusted given the rating difference? I understand conceptually what is being achieved by doing this, but I'm struggling to visualise the computational steps taken from xG to adjusted xG.
He won't reply to you if you support Norwich.
Josh may put out his full model if he has time. Keep an eye on his social media: @JoshuaABull on Twitter
Could you try to make a model with xG + xGA, please? It could be interesting to watch if something will change.
Is there somewhere I can see the results and number of goals for all matches?
Which program is it based on? Can you share the coding?
Hey! Where can I find the code?!
Can I have the slides of this presentation? I have a presentation coming up this week and my topic is the same. Please send me the ppt.
Great stuff. Can you say why you assume a team's games will be in something like a Poisson distribution rather than a normal distribution? Or is the Fish distribution just a version of the normal distribution that allows for discrete variables? (As you can tell I know very little about statistics).
Source: Physics degree.
"All models are wrong, but some are useful" -George Box.
Theoretical answer:
Poisson distributions are good for measuring "counting statistics" - it counts the number of events that happen in a time frame - Physics uses it to count photons entering detectors.
I tried to write a better explanation from memory talking about combining Bernoulli Trials. Roughly, The model pretends "either a goal happens in this minute, or it doesnt" again and again, combining results after 90 minutes.
The Poisson distribution is the limit of the Binomial distribution when you make sensible assumptions for this football problem.
Poisson is the screwdriver, this problem looks like a screw. Watching the goal line and counting how many goals go in, this is the right tool for the job.
'Mutual Information' on yt has a bunch of really good videos about how different distributions relate to help choose the right tool for the job.
Experimental answer:
Go to your Google account and open a collab notebook.
Go to Google dataset search
datasetsearch.research.google.com/
and grab international goals scored per game.
Take the international goals scored per game data and split it 80%/20%.
For the 80% training set, plot the graph of goals scored per game in Python & matplotlib, fit the parameter for a poisson to the data, plot the graph using python's statsmodels library, and take a Kolmogorov-Smirnov test to measure 'goodness of fit'.
We have the goodness of fit for the poisson distribution to 80% of our data.
Now put this block of code inside a big for loop that measures goodness of with the other 50 distributions in the statsmodels library.
Choose the distribution from the 50 with the highest goodness of fit, and then test this holds up on the remaining 20% data held back from earlier.
I reckon if you actually did this, there would probably be models with better goodness of fit, but are less parsimonious for a 20 minute maths communication lecture on youtube :)
@@CoombesJD Thanks! Very helpful. I also just realized that what I said above was nonsense because obviously lots of discrete variables are normally distributed.
what a great mathematical tool the FCH factor is, incoming Fields medal predicted
do you need to retrain the model after Argentina first game against saudi arabia? how does the prediction change?
Brazilian banker here. I just LOVED the concept of the FCH constant. But you have to adjust it to the FPAEOTT (Football prefers anything else other than tea), add the CITB (Cachaca is the BEST) and NWR (Neymar will Rock). I measured it. Funny is that my numbers ended in a 123,667% chance of yellow jersey winning.
KKKKKKKKKK
@@00vulture Oppps The cachaça just took a Kale on the head. Congrats Croatia. This WC is killing any math. Just like football should be. Now I am for the Gin (The Dutch are the real inventors of Gin, for those who do not know)
Joshua, do you update this model with each match played?
He doesn't (though he could). It is early days.
thanks for the vid, i feel like this model could be more robust by implementing more parameters instead of focusing on only xG
One more factor that should come into play is that matches in the Euros and the World Cup should be weighted more than friendlies because certain teams do better in those tournaments than in friendlies
This prediction cannot resist the curse of cats. 😂 Cats win.
Does Betting companies use same kind of models?
Would it not be useful to also factor in a side's probability of conceding alongside their probability of scoring?
Quarterfinals, Netherlands vs Argentina, on the other side Brazil vs Spain. So according to your calculations. Semifinal Argentina - Brazil. This means that the other finalist must come from France and Belgium. But here the margins are fairer.
Very interesting .. and huge respect to Joh for winning fantasy football .. no mean feat from 8 milliin players ..I think the model is largely accurate but we must also take some very important things into account ... It doesnt really matter what's happened in the last 4 years .. all that matters is the results next 4 weeks,. In a world cup momentum is key,.. and a good start is essential ... anything can happen on the day , and any of the top teams could win or lose on penalties ,.. and what about injuries to key players , red cards , VAR disallowed goals / offsides etc ..for example if Kevin De Bruyne gets injured then Belgium aren't getting to the final .. but absolutely fascinating nonetheless
But every team is just as likely to get a key injury/red card. Doesn't affect the odds too much.
That’s a solid point: this comes back to how the model is very sensitive to the latest and most recent form.
So the model can give very different outputs if it ingests results 2 matches down the line from 2 matches before.
It depends on what you mean by momentum. Argentina was stronger with Maradona, less before and after. That is a real effect and is covered by the 4 years range and the time based weighing he's talking about.
The other momentum some talk about, like winning streaks or NBA players with hot hands, do not exist. There are studies that show that these streaks just happen. Plain statistics explains them.
@19:40 MINOR ERROR. Pause the frame at 19:40. Winners and Runners-up of Group A and B play each other in the 'Round of 16', but this simulation here does not follow that trend. For example, the simulation shows Ecuador playing Brazil in 'R of 16" (but Ecuador is in Group A; Brazil is in Group G; and so these two should not meet in 'R of 16'). The simulation takes Iran & England from Group-B, and makes them play with Denmark & Belgium (Group D) respectively in the 'R of 16'. So, essentially the program has minor error in how group-stage to knockout-stage happens, but this is critical since the computer-calculations are based on wrong matches. A minor fix, and we might see different result. who knows !! cheers
Josh acknowledges this later in the video. This is not a specific game by game prediction. That is here:for the last 16: twitter.com/OxUniMaths/status/1593933134256553989
@@OxfordMathematics got it. all good then. cheers.
Wasn't expecting paul dano explaining to me how brazil is mathematically going to win the world cup
Hi Joshua, thanks for the knowledge shared. It was really an eye opening one.
I don't know if there's an error in the knockout stage that you modelled. Teams in the same group are not supposed to meet each other until the Final I see the image where you have some teams in the same group meeting each other in the Quarter Finals. Some groups have up to 3 teams representing in the knockout stages.
Other than that, it was a fun video and that shows the power of mathematics used in real world representation
Yes. Josh acknowledges this in the video. Full last 16 here: twitter.com/OxUniMaths/status/1593933134256553989
I wonder if there’s been a last minute recalculation after the Argentina Saudi Arabia result
Only seeing an expert explaining his decision making to create an AI model worth every single minute of it. Thanks for the fun explanation, I want to know why the xG gets a Poisson distribution, I did not quite understand why was not a normal distribution. Thanks!
Expected number of goals is a positive integer, and so Poisson (rather than normal)
The normal distribution is a continuous probability distribution. It does not make much sense to predict that a team will score 3.17 goals I guess xD Draws would also be impossible in that way. But no idea why the Poisson distribution would be the most obvious discrete distribution here.