The runners_insights tables - historic and daily - do indeed start from Jan 1st 2011. We'll add some notes to the documentation to make this clearer.
There are a number of reasons for this, the principal ones are:
1. The insights tables are not the primary source for the raw data, but instead are derived or engineered features on top of the raw data, so should not be considered as being the same as the beta or original runners and races tables - all the source data tables do indeed begin from 2003 .
2. Since all the engineered features rely on aggregating previous form, it is useful to include a lag between the start date of the raw data and the beginning of the derived features.
3. There are a number of changes in the racing between pre and post 2011, eg. changes in stalls numbering on right handed courses, that mean data is more consistent and useful for modelling post 2011.
4. Since the purpose of creating the insights tables is to enable an easier start with modelling / system building / machine learning tasks without having to build all the features from scratch, there needs to be a sufficient history for training and testing, whilst ensuring consistency of data and sufficient history with the derived features. 10 years' history is adequate for this, and arguably going back before 2011 means we are no longer making useful comparisons, given the way racing changes over time. (Users can of course create their own features using any of the raw data for any time period needed)
5. We plan to add more derived features in future, where using a lag prior to the start of the insights tables will be even more important - eg. trainer strike rate over last 5 years - so 2011 remains a useful starting point for ensuring the data is populated consistently for modelling purposes.
Hope this helps!