High-frequency trading (HFT) has transformed the landscape of financial markets. By leveraging sophisticated algorithms and cutting-edge technology, HFT allows traders to execute thousands of transactions within fractions of a second. This rapid trading strategy exploits minute price discrepancies, thereby providing liquidity and efficiency to the markets. The speed and volume of HFT necessitate a deep understanding of the market microstructure, particularly the limit order book (LOB).
In this context, the paper "Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data with Machine Learning Methods" makes a notable contribution. It introduces the first publicly available benchmark dataset specifically designed for mid-price forecasting in high-frequency limit order markets. This dataset encompasses approximately 4,000,000 time series samples collected from five stocks traded on the NASDAQ Nordic stock market over ten consecutive trading days.
In this article, I am going to address the issues that bothered me the most while working with this dataset. The first issue is that of how the training and testing data is structured.
Anchored Forward Cross-Validation Protocol
When dealing with high-frequency trading (HFT) data, ensuring robust evaluation of predictive models is crucial. The authors of the paper implemented a specific technique called the "anchored forward cross-validation protocol" to achieve this. Let's break down this concept in more detail.
Cross-Validation in Machine Learning
In general, cross-validation is a method used to assess how the results of a statistical analysis will generalize to an independent dataset. It is mainly used to prevent overfitting and to provide insights into the model's performance. Traditional k-fold cross-validation involves partitioning the data into k subsets, training the model on k-1 subsets, and testing it on the remaining subset. This process is repeated k times, with each subset used exactly once as the test set.
However, time-series data, such as high-frequency trading data, present unique challenges:
- Temporal Dependency: Observations in time-series data are not independent. The order of data points matters because past values influence future values.
- Data Leakage: Using future data to predict past events (even inadvertently) can lead to overly optimistic performance estimates.
Anchored forward cross-validation is a method designed specifically for time-series data to ensure that the model evaluation is realistic and unbiased. Here’s how it works:
Sequential Data Splitting: The data is split into training and test sets based on time. Unlike traditional cross-validation, where data points are shuffled, this method respects the temporal order.
Anchoring the Training Set: The training set is anchored, meaning it starts from the beginning of the time period and progressively increases by adding more recent data points.
Rolling Window Testing: The test set is a fixed-size window that rolls forward in time. After each evaluation, the window moves forward by a fixed period (e.g., one day), and the training set is expanded to include the most recent data up to that point.
Example Workflow
Consider a dataset spanning 10 days. Here’s a simplified workflow of the anchored forward cross-validation:
Day 1: Train on data from Day 1, test on Day 2.
Day 2: Train on data from Day 1 to Day 2, test on Day 3.
Day 3: Train on data from Day 1 to Day 3, test on Day 4.
...Day 9: Train on data from Day 1 to Day 9, test on Day 10
In our dataset, timestamps are expressed in milliseconds starting from January 1, 1970, and are adjusted by three hours to align with Eastern European Time. This means the trading day in the data spans from 7:00 to 15:25. The ITCH feed prices are recorded to four decimal places. However, in our data, the decimal point is removed by multiplying the price by 10,000, with the currency being in Euros for the Helsinki exchange. The tick size, which is the smallest permissible gap between ask and bid prices, is set at one cent. Additionally, the quantities of orders are constrained to integers greater than one.
I have constructed a specific time shot of the limit order book which is given in the picture below.
In the figure given below, we see how the price at the different ask and bid levels evolve in time.
In the next part of the blog, we would dive into advanced visualization methods for limit order books and work with machine learning as well.
Top comments (0)