The Crystal Egg is a system based in R and Python that scrapes rugby data from the web, puts it into workable format, and calculates team and player scores, which are then used to model and predict the outcome of games and simulate sets of games. This page is an attempt to summarise the processes and principles involved in the software, which is inspired by FiveThirtyEight / ESPN’s predictive modelling for soccer.
I won’t be sharing the actual code just yet – I’d like to in principle, but at the moment it is very messy, uncommented, probably inefficient, and subject to change. As I say though, I don’t have objections in principle, so it’s something I might consider for the future.
The first stage in the process is actually gathering the data. This is scraped from ESPN match reports such as this one, using the Python module Beautiful Soup. The aim is to create two lists from the data for all games – one dataframe where each row represents a game for each team, and another where each row represents an individual player’s performance within a game.
For some games, player data is either not available, or just not easily scraped due to the vagaries of the ESPN website – I estimate this is about 10% of games (higher for older seasons, lower for newer, which is reassuring). For these games, team data only is used. For the team data, factors recorded include the score, the number of tries for each side, the home and away team and the lineup for each team. For individual players, the list is longer, and includes all the usual stats of number of tries, assists, metres run, tackles made, etc. as well as minutes on the pitch (calculated from the timeline using a parsing function) and more.
As before this season, ESPN didn’t collect data on games from France’s Top 14 (or didn’t publish it anyway), this season is the first where I’ll have data – for that reason I won’t be attempting to predict Top 14 games this season!
After the initial bulk download, individual games can easily be appended to each of these datasets. At this stage, the two datesets are ready for the next element, calculating team scores.
The aim of the this section is to develop a meaningful score for each team and player that can be used when modelling the results of games. The first stage is to look at the outcome of a game and see put that in context. For example, Leinster beating Zebre 10-3 is a victory for Leinster, but not a very convincing one considering the quality of the team and the dearth of the opposition.
The solution to this is to use a formula that compares how well a team does against it’s own existing offensive score (how many points it normally scores) and its opponents defensive score to calculate its new offensive score. Similarly, how many points a team concedes can be adjusted by how many points it normally concedes and how many points its opponents usually score.
Overall, this process gives an adjusted score for the game that puts in context how well the team has done in a given game. In the example above, the adjusted score would be more like a 5-6 loss for Leinster than an actual victory.
For each game then, adjusted score is given. These scores are combined using a time-weighting, such that more recent games are more important, to give an average adjusted points scored and an average adjusted points conceded for each team for each point in time.
A similar process is done with players. For each game, each player gets an individual offensive and defensive score, effectively representing how many points they made for or cost their team in terms of the adjusted score. This is done using their actions on the field, divided up as follows:
Defensive points can be won by turnovers (worth 0.5 points), tackles (0.25 points), and winning defensive lineouts (0.25 points), or lost by missing tackles (-0.25 points), conceding penalties (-1 point), receiving a yellow card (-2.5 points) or a red card (-10 points).
Offensive points can be won by tries (2.5 points), assists(1 point), kicks(0.6 points for a penalty kick and 0.4 for a conversion) or lineout steals (0.5 points).
Finally, all remaining points that have been scored above the average for the league, and all remaining points that have been conceded below the average for the league, are doled out as offensive and defensive points respectively to each player based on how long they were on the pitch. This process can actually mean that a player receives a negative rating for a quiet game, even if his teammates have a stormer and win the game by a mile.
These scores can be combined in a similar way to team scores based on how long ago a game was, and how long the player was on the pitch for, to give a score for each player at each point in time.
(As an aside at this point, player ratings are very much at a beta stage – currently out-halves still tend to be rated highest, despite the low weighting given to kicks, with lineout specialist locks also probably rating too high)
The output at this stage of the process is a current set of scores for each team and player at the current moment, but also a “trainer” file which lists each game played on the database, the offensive and defensive rating of each team before the game, and the average offensive and defensive score of the players in each team before the game (where available). This is matched with the ultimate result of the game. This file is invaluable for the next stage of the process, the actual prediction…
The final stage is actually the most straightforward. The trainer file is fed into R, which uses the nnet and caret packages to calculate a multinomial logistic regression model for predicting the outcome of games, based on the teams’ offensive and defensive ratings, and where available, the average ratings of the players.
This model currently accurately predicts 70-75% of past games correctly, depending on whether information on lineup is available. This is significantly better than the “no information” model of just betting on a home win – this will tend to predict about 60% of games. Interestingly, the model currently has a slight bias towards home teams – there certainly feels like room to improve this a little.
Finally, given two teams’ current scores, and the average scores of the lineups, the model can give probabilities for a range of outcomes – not just win-loss-draw, but also “Bonus point home win, with losing bonus point for away” and so on. However, for the latter type of prediction, accuracy naturally drops a little bit.
The model is very much in the early stages, and in need of a good tweak. Player ratings in particular are problematic, as while better players generally tend to come out near the top (and the bottom of the list is populated mostly by Italians…), there are some types of player where their contribution is harder to quantify – props especially seem to fall into this category.