Predicting the Melbourne Cup Winner (edition 3)...

It's that time of year again…

The spring carnival is here to deliver the race that stops a nation in a year that at times has seen the entire world come to somewhat of a halt.

Despite the madness that has been 2020, the White Box team has still dedicated some time to build yet another Melbourne Cup case study following the 2018 and 2019 builds, testing some new variables as a data science exercise.

This year the model build was aided by our new Melbourne-based intern, Rohail, who was tasked with collecting and organising the data from both last years dashboards, and from the recent form of all the 2020 runners, which became the base of our model.

Why do we keep making these models & visualisations?

Before going into the model, we'd like to quickly mention the importance of going through processes like these as a data science and analytics team. Given the complexity of trying to build a model that picks a winner in such an event, it encourages a creative and collaborative approach to the entire process, from data exploration, auditing, analysis and visualisation. This enables us to combine old tricks with new tools, to share knowledge with each other, and ultimately have a bit of fun with the data we're using - something we believe is integral to any operation leveraging their data to improve business outcomes and the achievement of goals.

Similar to last year, we’ve built our model on finding a “place”, so a horse that will finish either 1st, 2nd or 3rd.

Here are the final results from our modelling:

The top 3 horses are key but we’ve lowered the threshold to the top 5.

As you can see, we’ve built a selection of models to get different view points and used the output accuracies to help weight the scoring.

In terms of the variables used, we built all sorts of interesting features but the last 5 race positions were key (normalised to the number of runners) with the total prize and horse weight also proving significant.

The chaid model incorporated other variables like race form trend, so looking how they’ve performed relative to the races before.

The Random Forest had a similar accuracy and feature importance as the Logistic regression model.

Our models were seeing around ~70% accuracy (using the F-Score statistic).

In summary,

As with the previous year, trying to predict the outcome of a race with limited data is always going to be challenging, so take everything here with a pinch of salt!

Let’s see what happens at 3pm…

Post 3pm:

Another year, another dose of lessons to learn!

Our predictions were not worth their weight in gold. From a quick post analysis of “what went wrong”, using multiple models may have been a downfall.

The Chaid model skewed our predictions. The Random Forest was actually much better than the rest of the models.

The other major factor is how we modelled the horse form. We looked at the last 10 races for each horse and modelled the prior race to the Melbourne cup, which for most horses, was in different races.

Next year?

Our new strategy is to build up a more comprehensive dataset of form data for the horses leading up to a race that they’re all in. This will be a more time consuming data gathering exercise but worth the test.

As always, our advice is not to gamble!

If you have any questions, please don’t hesitate to reach out, we’re here to support you!

Featured