Fellow researchers - anyone interested in these topics ?

–1 vote
I've been researching the historic database for a few weeks now and would like to find out if there are members out there who have met with the same areas of interest or difficulty as I have and would like to discuss their or my findings.

I'll outline 3 topics :

1) Data quality.  Some of the data, especially from 2003 to early 2007, is incomplete or incorrect.  So far I have corrected the allweather flag and the race age range among others and am in the process of reconstructing the form figures ( hairy sql here ! ).  GIGO still applies !

2) Technical. ( I'm using SQLworkbench - quite a good tool ).  So far I'm managing without a PL/SQL equivalent but have bumped up against the MySQL restriction of not allowing reference to an update table in the subquery ( ridiculous ! ).  I have found and used a ( pretty sordid ) solution to this if ayone is interested.

3) Ideas for profit.  Naturally I've tested all the conventional queries which, predictably, make a loss. I've one or two new ideas, I think, but too few.  ( For instance, does the size of the gap between the favourite's price and the next make a difference? - no, it doesn't ).

Talk to you soon ?
asked May 31, 2015 by delphi Plater (130 points)
edited May 31, 2015 by delphi

1 Answer

0 votes
Hi - some answers/ observations:

1.  Yes - we supply data "as is" from the feeds we have contracted for license purposes, and are aware of the issues especially pre-2007 (though should point out that these issues tend to be things like flags for all weather rather than anything fundamental such as result, going etc).   We are in the process of correcting these on a permanent basis, though are talking to source data providers to do so rather than fixing what we think is the case (the truth is that all racing data has some errors).  Having said that if there are particular issues you are aware of that you would like to see addressed please do use the http://answers.betwise.net/feedback address.

2.  Typically users are doing some scripting (Perl, Python, R etc) to bash the data into the shape that they want as well as using SQL alone.  SQLWorkbench, Heidi and (our favourite) SequelPro are all good tools on the SQL front though.

3.  Though some conventional queries on the raw data can make a profit, there are very few or they throw up few contenders.  The key to using the raw data to look for profitable angles is to start looking at creating derived variables from the raw data, then use these derived variables in combination with the conventional data.

For example, try creating rolling trainer strike rates (overall, by course, by distance and by race type) or sire statistics (ditto on categories) within a separate table that can be queried in combination with the raw data.  Also, profit from some of the unique data elements that are in Smartform to do this - such as betting movements and betting ranking.

Hope this helps
answered May 31, 2015 by colin Frankel (19,280 points)
edited May 31, 2015 by colin
Thank you, Colin, for your very interesting reply.  

I particularly like your approach of new tables of derived variables which hadn't really occurred to me yet, although I have added my own variables in order to perhaps 'improve' the existing data. For instance, in my view the actual age range of the horses running in a race is more valid than the stipulated range for the race so I have added that.

I think you are right in that some corrections are a matter of judgement, whereas some are straightforward.  An example of the latter would be the incorrect foaling dates and ages which can sometimes be deduced.

Off to have a look at SequelPro now, dredging up more of my old Oracle knowledge ....

Cheers
Colin's completely right that all racing data contains some errors. You can only ever trap and filter out so much of it, and at some level you have to work around it.

For example, I tend to use a horse's 2nd best speed rating, rather than its absolute best speed rating, when I need a speed figure to use in analysis. This tends to be a more stable and useful number and gets rid of the 'outlier' problem where a horse has an unrepresentatively good top number that's a product of either a data or computation error.

I actually make very, very few corrections to the data itself and these are invariably for analytical ease rather than because the data is factually 'wrong'. Examples:
- renaming the meeting "Royal Ascot" to plain "Ascot" so analysis doesn't treat it as a separate venue
- changing the trainer ID of Richard Hannon senior to match that of Richard Hannon junior so that the stable can be analysed consistently across multiple seasons
...