
The results of the competition are now official and the winners - determined, but the game Is still on, and, moreover, some solutions have been published (therefore more possibilities to improve the first, very basic, solution from previous post).
There are many things to do with the original script, and ideas to implement, essentially:
- trying different models (other than OLS method - in published solutions I’ve seen XGBoost, Huber regression, Weighted Linear regression, LTSM, RNN, etc.)
- feature engineering (which breaks in three categories - “by ID”, “by product”, “by date” )
- separating volatilities data from returns data and treating them separately
- ensembling / stacking
Different models that I’ve tried included linear weighted regression (as in this (https://github.com/FrancoisPierre/CFM/blob/master/Starting_kit.ipynb ) baseline solution), XGBoost and Huber regression (both mentioned in http://datachallenge.cfm.fr/t/proposed-solution/105 this solution).
For the last two - I am surely not tuning them in a correct way, as their results for me are far from those mentioned in the article, and even quite far from weighted linear regression, for that matter.
For now the returns are completely dropped out of prediction (as with them for this model the prediction is way worse then just with volatilities).
Finally, the most important part is generation of new features based on “ID” dimension. The “basic” are: