How far back does historic_runners_insights go?

0 votes

Hey there, new here. Also relatively new to SQL (but not data science) so I might be doing something wrong.

It looks like historic_runners_insights only goes back to 2011? I'm currently looking at Nathaniel's form:

SELECT * FROM historic_runners_insights
WHERE (historic_runners_insights.runner_id = 1501842)

And it only goes back to 2011 (his 3 year old season). There is also one NULL row.

Am I doing something wrong? The other beta tables contain his complete form. I tried running the updater too and his 2010 form is still missing

If historic_runners_insights is indeed missing data pre-2011, then are there any plans to append the older data?

asked Jun 11, 2021 in Smartform by micjrc Plater (170 points)

1 Answer

0 votes
The runners_insights tables - historic and daily - do indeed start from Jan 1st 2011.  We'll add some notes to the documentation to make this clearer.

There are a number of reasons for this, the principal ones are:

1.  The insights tables are not the primary source for the raw data, but instead are derived or engineered features on top of the raw data, so should not be considered as being the same as the beta or original runners and races tables - all the source data tables do indeed begin from 2003 .

2. Since all the engineered features rely on aggregating previous form, it is useful to include a lag between the start date of the raw data and the beginning of the derived features.

3. There are a number of changes in the racing between pre and post 2011, eg. changes in stalls numbering on right handed courses, that mean data is more consistent and useful for modelling post 2011.  

4.  Since the purpose of creating the insights tables is to enable an easier start with modelling / system building / machine learning tasks without having to build all the features from scratch, there needs to be a sufficient history for training and testing, whilst ensuring consistency of data and sufficient history with the derived features.   10 years' history is adequate for this, and arguably going back before 2011 means we are no longer making useful comparisons, given the way racing changes over time.  (Users can of course create their own features using any of the raw data for any time period needed)

5.  We plan to add more derived features in future, where using a lag prior to the start of the insights tables will be even more important - eg. trainer strike rate over last 5 years -  so 2011 remains a useful starting point for ensuring the data is populated consistently for modelling purposes.

Hope this helps!
answered Jun 12, 2021 by colin Frankel (19,320 points)
Note that we have just extended historic_runners_insights to begin from 2008-03-01, the same date that the daily_runners table begins.

We have also added a new field, normalized_stall.  This converts the stall numbering for right-handed courses from before stall numbering was reversed in 2011 to be the same system as used post 2011.  For example, the highest stall at Sandown before 2011 becomes the lowest stall (1).  Additionally, empty stalls are compressed so that the numbering indicates the ranking of horses in stalls, not just the stall numbering.  For example, if stalls 1, 3 and 5 are occupied but stalls 2 and 4 are non runners, then the stall numbering becomes 1,2 and 3.

Download and install historic_runners_insights and daily_runners_insights to benefit from the changes and automated updates.
...