Monday, June 30, 2014

Average Runs Scored Given Vegas Over/Under Odds


When you look at the Over/Under, often referred to as the "Run Total" for a major league baseball game at a Sports Book you will see the run total given with a number like "7 runs" with juice looking something like -120/100 with the -120 being the pay out for the over and the +100 being the pay out for the under. Juice looking like -120/100 is telling you that the Sports Book thinks it is a little bit more likely that the game will go over than under. In fact, the Sports Book is telling you there is a 52.38% chance that the game goes over and a 47.62% chance that the game goes under. Here is my algorithm and calculator showing you how to convert from a Sports Book odds (Example: -120/100) to a percentage.

How about a game where the Sports Book thinks there is a 50/50 chance of the game going over or under (-110/-110)? If the run total was "7 runs" on such a game how many runs would you expect there to be scored if this game was played thousands of times? You might think the answer would be 7, but it is not. Seven runs would be the median or the the run total where you would have the same number of overs as unders. But what about the mean or the average number of runs scored per game. Since run totals are skewed, such that the most likely final score for almost any game with a 7 run over/under is the home team winning by a score of 3-2 (5 total runs) we see a mean that is different than the median. How do you calculate the mean?

It's not easy to calculate, the best way is to look at the empirical data. Look at games and track the run total, over juice and under juice and see what the average number of runs scored for each game with the same values for each of the three parameters. Quickly, the problem you run in to is a sample size problem. There are just not enough games out there (162 per year). So this won't work very well. The solution is to create a larger sample size and the way I did this was to use my simulator to create games with an average of 5.5 up to 11.5 runs with gaps of 0.05 runs per game. For example I created two teams that averaged 5.5 runs per game when playing each other a million times. I then adjusted the two teams to create an outcome that averaged 5.55 runs per game, all the while recording the percentage that this game went over or under the nearest run total.

For example, I created a game and simulated it one million times that outputted an average runs scored per game of 5.9902. The run total that was closest to 50% on the over/under for this game was 5-1/2 runs. The chances of this game going over was 48.64% and going under was 51.36%.

Once I get enough samples at each over/under I can get a best fit equation (y = mx + b) for each run total given that I know the chances that the game goes over and under. My simulator tells me this and in the Sports Book example the over/under odds tells me this. So once I have the equation built from the simulators empirical data, I can use those equations with the Sports Book odds once I calculate the over and under chances from the odds and juice.

So below is the table that shows you the equation for each Run Total. In the equation "x" is the percent chance (ie - 51.92) that the games goes "over".

Let's take the June 30th game between the Indians and Dodgers as an example. The Vegas Odds on the "Run Total" look like 7-1/2 +115/-125 which translates to an over chance of 45.45% and an under chance of 54.55%.

The equation for a game with a Run Total of 7-1/2 is: y = (0.087176)(45.45) - 3.87196653

Which tells us the average number of runs for this game (given that the Vegas Odds are true odds) is... 7.59 runs

An interesting side note is that let's say you have a Run Total of 7-1/2 runs with Vegas giving us a 50/50 chance of both the over and under hitting, that would give us an average run total of 7.99 runs.

Steps
1. Get Vegas Run Total
2. Use the table below to determine your slope(m) and offset(b)
3. Use Vegas odds on the Run Total to determine percent chance the game goes over(x)
4. Calculate average runs scored per game by running data through the equation y = mx + b


Equation To Calculate Average Runs Scored per Game

Run TotalSlope(m)Offset(b)
5.50.080769-3.441105344
60.075334-3.300033236
6.50.085624-3.943390956
70.076109-3.405558745
7.50.087176-3.87196653
80.082359-3.752582135
8.50.088259-4.16674672
90.079836-3.677004771
9.50.094069-4.284308333
100.084646-3.923914961
10.50.091977-4.398507575
110.0816-3.78090425
11.50.095217-4.329096995
.

2 comments:

MP said...

Thanks for this. Just curious if you could post calculated values based on empirical data, for comparison to your model. I ask because there are some unusual jumps in your coefficients. E.g., for a game with a 8.5 total, your model says the median will be 97.1% of the average. But for a game with 9.5 total, the median is only 95.6% of the average. One would expect the opposite -- that the median/average value would get closer to 100% for higher totals -- right?

Xeifrank said...

Great question. I'm not sure I know the answer but let me take a stab at it.

First off, the whole numbered medians have their meadian/average value growing closer to 100% for higher totals. So we see what you would expect there.

But the fractional meadian totals do not. Or so it appears. If you break the fractional medians down by odd (5.5, 7.5, 9.5, 11.5) and even (6.5, 8.5, 10.5) you will see the trend that you are looking for without the skips.

So it may have something to do with the distribution of game scores. Now why that might be the case would take some further studying.

As far as using the "real" empirical data from major league games and Vegas odds you will run in to small sample size issues. For example there are only 207 games so far this year with a run total of 8.5 and 44 with a run total of 9.5 and that isn't even taking into account the different odds that each of those games has. You could use previous seasons too and that would help a little bit but still a tiny sample size compared to my one million games. But I agree, it would be a good exercise to do this with the actual empirical data but I am afraid of all the noise you will get, especially on the less popular run totals.