Time Series Importance based Selection and FeatuRe Extraction on basis of Scalable Hypothesis tests.
The algorithm incorporates a combination of feature selection based on their
importance and feature generation using the [tsfresh] lib.
By adopting this approach, I achieved on real data from the stock exchange the increase in prediction
accuracy by 12%, while improving model performance with a speedup of 24%.
At the first step, the algorithm tries to understand which features of one time series can be useful.
To do this, it generates a huge number of statistical features using the tsfresh library.
Then it selects them using statistical hypotheses and feature importance values.
All values are calculated using block Cross-Validation schema.
At the second step, the algorithm uses information about which features were selected from the previous stage. For all correlated and available time series (other currencies on the exchange), these features are also calculated. After that, they also go through two stages of selection - statistical and selection based on importance values.
Compared to the situation where we only use target currency data, we have the 24% speedup and 12% increase in accuracy!
| Time (s) | RMSE (mean) | |
|---|---|---|
| only target table | 1.3 | 0.118 |
| with the features of other tables | 10.04 | 0.096 |
| with selected features of other tables | 1.0 | 0.104 |
The [dataset] size has order of several hundred million records.
To reproduce my result You can extract it in data/raw folder and use .ipynb from /notebooks.
