DEV Community

DolphinDB
DolphinDB

Posted on

Introduction to backtesting strategy: Historical data replay in DolphinDB

In DolphinDB, we can import historical data into a stream table in chronological order as “real-time data” so that the same script can be used both for backtesting and real-time trading. Regarding streaming in DolphinDB please refer to DolphinDB Streaming Tutorial .

This article introduces functions replay and replayDS and then demonstrates the process of data replaying.

Image description

1. Functions

replay

replay(inputTables, outputTables, [dateColumn], [timeColumn], [replayRate], [absoluteRate=true], [parallelLevel=1])
Enter fullscreen mode Exit fullscreen mode

Function replay injects data from specified tables or data sources into stream tables.

  • ‘inputTables’ is a table or a tuple. Each element of the tuple is an unpartitioned table or a data source generated by function replayDS.
  • ‘outputTables’ is a table or a tuple of tables, or a string or a string vector. The number of elements of outputTables must be the same as the number of elements of inputTables. If it is a vector, it is a list of the names of the shared stream tables where the replayed data of the corresponding tables of inputTables are saved. If it is a tuple, each element is a shared stream table where the replayed data of the corresponding table in inputTables are saved. The schema of each table in outputTables must be identical as the schema of the corresponding table in inputTables.

  • ‘dateColumn’ and ‘timeColumn’ are strings indicating the date column and time column in inputTables. If neither is specified, the first column of the table is chosen as ‘dateColumn’. If there is a ‘dateColumn’, it must be one of the partitioning columns. If only ‘timeColumn’ is specified, it must be one of the partitioning columns. If information about date and time comes from the same column (e.g., DATETIME, TIMESTAMP), use the same column for both ‘dateColumn’ and ‘timeColumn’. Data are replayed in batches determined by the smallest unit of time in ‘timeColumn’ or ‘dateColumn’ if ‘timeColumn’ is not specified. For examples, if the smallest unit of time in ‘timeColumn’ is second then all data in the same second are replayed in the same batch; if ‘timeColumn’ is not specified, then all data in the same day are replayed in the same batch.

  • ‘replayRate’ is a nonnegative integer indicating the number of rows to be replayed per second. If it is not specified, it means data are replayed at the maximum speed.

  • ‘replayRate’ is an integer.

  • ‘absoluteRate’ is a Boolean value. The default value is true.

Regarding ‘replayRate’ and ‘absoluteRate’:

(1) If ‘replayRate’ is a positive integer and absoluteRate=true, replay at the speed of ‘replayRate’ rows per second.

(2) If ‘replayRate’ is a positive integer and absoluteRate=false, replay at ‘replayRate’ times the original speed of the data. For example, if the difference between the maximum and the minimum values of ‘dateColumn’ or ‘timeColumn’ is n seconds, then it takes n/replayRate seconds to finish the replay.

(3) If ‘replayRate’ is unspecified or negative, replay at the maximum speed.

  • ‘parallelLevel’ is a positive integer. When the size of individual partitions in the data sources is too large relative to memory size, we need to use function replayDS to further divide individual partitions into smaller data sources. 'parallelLevel' indicates the number of threads loading data into memory from these smaller data sources simultaneously. The default value is 1. If 'inputTables' is a table or a tuple of tables, the effective 'parallelLevel' is always 1.

replayDS

replayDS(sqlObj, [dateColumn], [timeColumn], [timeRepartitionSchema])
Enter fullscreen mode Exit fullscreen mode

Function replayDS generates a group of data sources to be used as the inputs of function replay. It splits a SQL query into multiple subqueries based on 'timeRepartitionSchema' with 'timeColumn' within each 'dateColumn' partition.

  • ‘sqlObj’ is a table or metacode with SQL statements (such as ) indicating the data to be replayed. The table object of “select from” must use a DATE type column as one of the partitioning columns.
  • ‘dateColumn’ and ‘timeColumn’ are strings indicating the date column and time column. If neither is specified, the first column of the table is chosen as ‘dateColumn’. If there is a ‘dateColumn’, it must be one of the partitioning columns. If only ‘timeColumn’ is specified, it must be one of the partitioning columns. If information about date and time comes from the same column (e.g., DATETIME, TIMESTAMP), use the same column for both ‘dateColumn’ and ‘timeColumn’. Function replayDS and the corresponding function replay must use the same set of 'dateColumn' and 'timeColumn'.
  • ‘timeRepartitionSchema’ is a TIME or NANOTIME type vector. ‘timeRepartitionSchema’ deliminates multiple data sources on the dimension of ‘timeColumn’ within each ‘dateColumn’ partition. For example, if timeRepartitionSchema=[t1, t2, t3], then there are 4 data sources within a day: [00:00:00.000,t1), [t1,t2), [t2,t3) and [t3,23:59:59.999).

Replay a single in-memory table

replay(inputTable, outputTable, `date, `time, 10)
Enter fullscreen mode Exit fullscreen mode

Replay a single table using data sources

To replay a single table with a large number of rows, we can use function replayDS together with function replay. Function replayDSdeliminates multiple data sources on the dimension of 'timeColumn' within each 'dateColumn' partition. Parameter 'parallelLevel' of functionreplay` specifies the number of threads loading data into memory from these smaller data sources simultaneously. In this example, 'parallelLevel' is set to 2.

Image description

Replay multiple tables simultaneously using data sources

To replay multiple tables simultaneously, assign a tuple of these table names to parameter ‘inputTables’ of function replay and specify the output tables. Each of the output tables corresponds to an input table and should have the same schema as the corresponding input table. All input tables should have identical 'dateColumn' and 'timeColumn'.

Image description

Cancel replay

If function replay was called with submitJob, we can use getRecentJobs to get jobId, then cancel the replay with command cancelJob.

Image description

If function replay was called directly, we can use getConsoleJobs in another GUI session to get jobId, then cancel the replay use command cancelConsoleJob.

Image description

2. How to use replayed data

Replayed data are streaming data. We can subscribe to and process the replayed data in the following 3 ways:

  • Subscribe in DolphinDB. Write user-defined functions in DolphinDB to process streaming data.
  • Subscribe in DolphinDB. To conduct real-time calculations with streaming data, use DolphinDB’s built-in streaming aggregators such as time-series aggregator, cross-sectional aggregator and anomaly detection engine. They are very easy to use and have excellent performance. In section 3.2, we use a cross-sectional aggregator to calculate the intrinsic value of an ETF.
  • With third-party client through DolphinDB’s streaming API.

3. Examples

Replay level 1 stock quotes to calculate ETF intrinsic value

In this example, we replay the level 1 stock quotes in US stock markets on 2007/08/17, and calculate the intrinsic value of an ETF with the built-in cross-sectional aggregator in DolphinDB. The following are the schema of the input table ‘quotes’ and a preview of the data.

Image description

Image description

Image description

Image description

(1) To replay a large amount of data, if we load all data into memory first, we may have an out-of-memory problem. We can first use function replayDS and specify parameter 'timeRepartitionSchema' to divide the data into 60 parts based on the column 'time'.

Image description

(2) Define the output stream table ‘outQuotes’.

Image description

(3) Define a dictionary for the ETF components weights and function etfVal to calculate ETF intrinsic value. For simplicity we use an ETF with only 6 component stocks.

Image description

(4) Define a streaming aggregator to subscribe to the output stream table ‘outQuotes’. We specify a filtering condition for the subscription that only data with stock symbols of AAPL, IBM, MSFT, NTES, AMZN or GOOG are published to the aggregator. This significantly reduces unnecessary network overhead and data transfer.

Image description

(5) Start to replay data at the specified speed of 100,000 rows per second. The streaming aggregator conducts real-time calculation with the replayed data.

Image description

(6) Check ETF intrinsic values

Image description

Image description

4. Performance testing

We tested data replaying in DolphinDB on a server with the following configuration:

  • Server: DELL PowerEdge R730xd
  • CPU: Intel Xeon(R) CPU E5–2650 v4(24cores, 48 threads, 2.20GHz)
  • RAM: 512 GB (32GB × 16, 2666 MHz)
  • Harddisk: 17T HDD (1.7T × 10, read speed 222 MB/s, write speed 210 MB/s)
  • Network: 10 Gigabit Ethernet

DolphinDB script:

Image description

When replaying at maximum speed (parameter ‘replayRate’ is not specified) and the output table is not subscribed, it only takes about 100 seconds to replay 336,305,414 rows of data.

https://github.com/dolphindb/Tutorials_EN/blob/master/historical_data_replay.md?source=post_page-----497e24af596d--------------------------------

Top comments (0)