Representing races geometrically for a GNN

Hi guys, this is my first GNN project and I am hoping someone could kindly share some feedback on how I have chosen to frame my problem for a GNN.

Essentially, I am trying to predict horse races. I have 100k historical races where each race can have up to N runners. I have framed this problem as a graph classification problem. The objective is to, predict which horse is going to win a race by the starting gate of that horse. The output vector is of size N, for up to N runners in the race, or N starting gates. For now, I am ignoring trainers, owners and jockeys. As such, I have defined the following entities:

Nodes

  • Horse (static info on a horse incl. colour, breed etc.)
  • Race (info about a race such as weather, year, month, day of week, distance, track etc.)

Edges

  • (Horse, Run, Race) (info about a horse running in a race PRIOR to the race occurring. Info such as age at race start, starting gate etc.)
  • (Horse, Run Outcome, Race) (info about a horse run in a race AFTER the race occurring. Superset of Run with info such as finishing time, finishing position etc.)
  • (Precedes, Race, Precedes) (edge to encode race chronology, measures number of days by which race i precedes race i + 1.)

For a given race that I am trying to predict, I firstly get all the runners in that race via the pertinent incoming Run edges. I then get every historical race for each runner and for each of those races, get the corresponding competitor horses nodes to create a full picture of each horse’s history. For connections between all horses and historical races, I use Run Outcome edges. This way, the interactions between the runners and the current race I am predicting on does not include future information (relative to said race.)

I am not sure if this is the best representation of this data universe. Perhaps I shouldn’t sequentially connect races by the Precedes edge and instead, connect them all to the current race being predicted. I am also unsure if framing this as a graph classification problem is correct. Currently, all features are encoded as a matrix features attribute on each node/edge rather than flattened into individual attributes.

I apologise in advance if this is too vague or my ignorance is too salient - but I’d really appreciate some thoughts on this. Thank you.

REFERENCE IMAGE: yellow node is the current prediction race (still of type Race but highlighted to aid visualisation), blue nodes are horses, orange nodes are races, orange edges are Precedes, green edges are Runs, pink edges are Run Outcomes. In this graph, only 4 horses have historical races (1 each.)

Interesting problem. I have a question though.

So do you think the ranking of a horse is dictated by the starting gate? This seems counter-intuitive. I guess the horse’s own attributes should matter more. Starting gate sounds like an irrelevant factor to me at best.

Assuming the horses’ performance are independent to each other (meaning a horse’s ranking wouldn’t be affected by other horses), I would probably model this as a regression problem and use MLP/XGBoost rather than a GNN, assuming that the time of a horse completing the run is only dependent on the feature of the horse itself. If you horse features are not good predictors but a horse takes multiple races, one way is to treat each horse’s performance as a time series and use time series models.

You might also want to check out works such as GNNRank that recovers global ranking from pairwise comparisons. There’s a series of works preceding it as well.

1 Like

Perhaps I need to reframe this. Basically, a horse in a race can be identified by its starting gate. Therefore, I am trying to predict which starting gate (i.e. horse) will win. In other words, will the horse running from gate X win for each gate in N. The output vector will be of size N for N starting gates and up to N runners in that race. Each element in this output vector should represent the probability of that gate number (i.e. horse) winning that race. There will always be N starting gates in a race, but some gates may be empty in a given race.

This is surprisingly very much not the case. Let me add some context. I have tried a suite of other approaches (with reasonable success) including, XGBoost on historical aggregates to predict the winner of all runners, regress to finishing time for horses independently of other runners and most recently, an LSTM approach. In the LSTM approach, I had N LSTM heads that shared the same weights and effectively, a given head would take a sequence of all the past performances for the corresponding horse. I then concatenated the outputs of these heads into a multiclass MLP. This approach gave me the best accuracy.

However, the LSTM approach was missing a key factor. Consider a sequence of past performances for a horse. Each element of this sequence is a vector containing (starting gate, finishing time, weight etc.) However, each historical performance vector was independent of competitors. With a graph representation, I am hoping to capture competitor data for historical races. For example, take Historical Race A and Historical Race B that both have an identical finishing time. The LSTM approach would not consider that Historical Race A had much stronger competitors than Historical Race B and therefore, the horse’s performance in Historical Race A > performance in Historical Race B.

To give an example of competitor interactions. If a race has a small prize pool, and Horse B is the strongest competitor, the jockey riding Horse A will deliberately run Horse A more slowly and not try to eek out a very unlikely win. Rather, he’d prefer come top 3 at a more comfortable pace so the horse is in condition to run the following week. Another example would be, horses tend to run along the inner circumference of the track, so a fast vs slow horse on the inner gate will impact the other runners who tend towards that line. You also have all sorts of interactions between jockeys and trainers and owners etc. but I want to model them later. Keep things simpler for the moment. I hope this brings some clarity. Cheers

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.