Why do some sires have two sire IDs in Smartform

0 votes
Some sires seem erroneously to have two sire_id numbers within Smartform.

For example: "Big Bad Bob" uses sire_ids 254221 and 2238007, Raven's Pass has 679073 and 1879307.

The pattern seems to be that one sire_id has the majority of coverage with only a small handful of references using the other sire_id. I have not seen any instances with more than two sire_ids used for the same sire.
asked Oct 30, 2018 in Smartform by hyoung Novice (410 points)

1 Answer

0 votes
This is due to some inconsistencies in the source data we licensed in labelling IDs over the past 15 years.  We have overcome this in the paste by creating "canonical IDs" based on "collapsing" the names of horses - removing all whitespace, punctuation and converting to lowercase.  Future versions of Smartform will likely see us using internal IDs that remove references to the source IDs, such that we handle the inconsistencies internally.
answered Oct 31, 2018 by colin Frankel (19,280 points)
The problem with the canonical ID approach is you also need a country code to handle situations such as Shamardal (IRE) and Shamardal (USA) both of which have been sires. Any idea about a time line for the "future versions" of Smartform mentioned ? That will help me to determine the amount of effort to invest in my own work-around.
This is available in the bred field, and you can also use foaling date and sire name to disambiguate, although for sires this data can be quite old and therefore availability varies.

In any case, the date for new release will be Q1 2019.
Great to hear about the upcoming release. I hope you'll look at the AW inconsistencies, especially at Lingfield on days when they run a mix of turf and AW which really seems to throw a spanner in the works. I have a script that pretty much sorts out the main inconsistencies with the exception of the Lingfield mixed track meetings. I hope you'll find this useful - it took some pretty detailed analysis and checking against an independent data source ... whoops I just noticed this is outdated because of reference to Great Leighs rather than the now Chelmsford City. Anyway you can still see from this what needs fixing up ...

USE smartform;
SET SQL_SAFE_UPDATES = 0;
# Correct flat races incorrectly tagged as all weather (count=24 on 17 July 2016)
UPDATE historic_races
    SET race_type = "Flat", race_type_id = 12, all_weather = 0
    WHERE race_type_id = 15
    AND going NOT IN ("Slow", "Standard", "Standard to Fast", "Standard to Slow")
    AND course NOT IN ("Dundalk", "Kempton", "Lingfield", "Southwell", "Wolverhampton");
# Correct all weather races incorrectly tagged with flat going (count=412 on 17 July 2016)
UPDATE historic_races
    SET going = "Standard", all_weather = 1
    WHERE race_type_id = 15
    AND going NOT IN ("Slow", "Standard", "Standard to Fast", "Standard to Slow");
# Correct all weather races (excluding Lingfield) incorrectly tagged as flat (count=180 om 17 July 2016)
UPDATE historic_races
    SET race_type = "All Weather Flat", race_type_id = 15, all_weather = 1
    WHERE race_type_id = 12
    AND going IN ("Slow", "Standard", "Standard to Fast", "Standard to Slow")
    AND course IN ("Dundalk", "Great_Leighs", "Kempton", "Laytown", "Newcastle", "Southwell", "Wolverhampton");
# Correct Lingfield races that are incorrectly tagged.  Note Lingfield often holds meetings with a combination of flat and all weather flat races
# Argh cannot do these without correct data on the individual races.
# There are 458 Lingfirld races with race_type_id = 12 and going = Standard.
# Some of these should be Flat, some should be All Weather Flat. Original confusion looks to have been mixed meetings.
# I looked at the all_weather flag - sometimes right sometimes wrong seemingly with no reliable pattern.
SET SQL_SAFE_UPDATES = 1;
Try running this query then scroll through the results:

select sire_name, sire_id, count(*) as race_count from historic_runners group by sire_id order by sire_name, race_count

You will see some sires with more than just two sire_ids. I also noticed other random issues Such as "Refuse To Bend" and "Refuse To  Bend" (two spaces) being treated as separate sires.

I found that most instances had the vast majority of historic_runners entries using one sire_id with only a few runners using the alternatives, for a specific sire_name. So I started working on a script to correct the database based on this assumption. Then I found Rock of Gibraltar has 7 sire ids, with counts of 1, 2, 13, 1043, 8408, 6 and 1. The low counts are ok and obviously correctable. But the 1043 count threw me as unexpected and I am less sure about updating all sire_ids for the same sire_name to be the majority instance, the concern being that I'm unsure which sire_id would be used in the future. Maybe the best thing is to wait for the Jan 2019 release and see how the land lies then. I've been using the data as is for several years, so a few weeks more won't hurt.
Thanks for this.  Ultimately the race type at Lingfield is incorrectly reported in our feeds from time to time which makes it difficult to cater for without inferring from going.  Sometimes this is obvious but sometimes not.  And yes mixed meetings are Lingfield are the worst!  We're aiming to correct for this where possible and will update historic data accordingly.
...