Data Filtering : Daily data vs Tick data
22 June 2022
In the following article, we will focus on high frequency data filtering. What is the main difference between daily data and tick data?
When running a trading strategy, making sure to have a valid and cleaned data is mandatory. Any error in your sample can cause severe losses when running this strategy in production. For these reasons, when dealing with financial data, most of the time is spent cleaning the samples rather than exploiting them.
Let’s take the example of a trader who already owns a working strategy fueled by daily data. As the use of tick data is growing, this user would like to switch from daily to tick data and adapt its strategy. Is it possible to swap that easily between daily data and tick data?
Daily Data vs Tick Data
As it was the case for daily data, before running its strategy, an important cleaning and filtering step is necessary. However, you cannot simply apply the same filters that you used on daily samples, to tick samples. Most of the reasons remain in the structure of tick data.
First, the average volume of data does not range the same when comparing daily data with tick data. In the case of tick data, some instruments can reach up to half a million ticks per day. Moreover, it is important to notice that, contrary to daily data, the average volume of data varies from one instrument to another. More particularly, the number of ticks often seems directly related to the capitalization of the company. It is not surprising to reach around 500 000 ticks for the highest caps on the Nasdaq 100 where the lowest caps on this index would “only” reach 50 000 ticks on average per day.
|Name||Apple Inc||Micron Technology||Comcast Corp A|
|Capitalization||2320 $B||77.7 $B||0.606 $B|
|Average Trade Ticks (last 300 days)||675,364||171,473||113,942|
Average Ticks (Trades) for AAPL, MU and CMCSA. Retieved from ICE DataVault and processed on Ganymede API
Due to these important amounts of ticks and the variability from one asset to another, the storage required to process that data can get so important that an event-driven approach of the information is required rather than processing a whole data frame at once.
Another important difference between daily and tick data is the information contained within the data. Indeed, while daily data only contains few information, mostly numerical, tick data however is provided with several additional fields providing details on the tick context (trade ID, trade condition ...)
These many differences emphasize the fact that an extra caution is required to swap from daily data to tick data as there is much more information and volume within the latter, which also implies many more sources of errors. For these reasons, any data filtering and cleaning workflow used for daily data might require a more refined approach for tick data.
To illustrate this idea, let’s consider the situation presented in introduction. The user requires to compute a Vwap on a stock for a specific day. To set these ideas down, we’ll consider the input parameters below:
- Instrument: AAPL / XNGS (Nasdaq primary exchange)
- Analytics required: vwap, small number of shares
- Time Constraints: 2022-02-18 from 8am to 8pm
When querying our API with the parameters specified above, the vwap service returns trades and vwap represented below:
VWAP for AAPL/XNGS – 2022-02-15 from 8:00 to 20:00. Computed on Ganymede API. Raw data from ICE DataVault
The first observation when obtaining this chart is that it seems like some points are out of the trend and might require to be filtered. To make things clear, when speaking of filtering the tick data sample, we only mean to narrow the scope to make it fit better to the context of our strategy. But in no way do we imply that these points are some errors, as the market feed is always right.
Naturally, when confronted to such output, we would directly attempt to apply a set of numerical filters (moving average, Kalman filter...) to clean the raw output, as we would probably do for a daily data sample. However, rushing into numerical filters for tick data would be missing the information contained in tick data. For instance, since our scenario is related to small size trades, it might be a better solution to filter data according to this context.
Our API enables us to query for tick data with several filters. Thus, we can add to our previous vwap request a filter on trade conditions to whitelist only Odd Lot Trades (defined as trades with number of shares < 100). We chose to display simultaneously the previous values alongside the new output to compare the differences.
VWAP for AAPL/XNGS – 2022-02-15 from 8am to 8pm. All Trade Conditions vs Odd Lot Trades Only. Computed on Ganymede API. Raw data from ICE DataVault
At first sight it appears that many of the “off charts” ticks are gone (13:38 and 18:45 notably). Also, it is important to note that the global number of ticks to build the vwap was almost halved (from 438,087 to 252,910). This means that if we zoomed on any part of the graph, there are high chances that we filtered the same way a lot of marginal values.
In this use-case, filtering the data according to tick conditions serves multiples purposes: it allows to filter in first step many off-chats value without applying a numerical filter, but it also helps us to filter other ticks than Odd Lot Trades, in our example, which allows to slim the dataset to stick better to the context of the strategy. If necessary, we can in a second time run a numerical filter as it is certainly used for a daily data use-case, to get rid of the remaining off charts points.
This brief example on its own gives a glimpse at the complexity and richness of tick data, and that it is not advised to apply similar transforms that work for daily samples, on tick samples. When looking at the example presented in this article, the “spikes” encountered in the chart would have probably gone under the radar if we were on a daily data use-case. This testifies that due to the high frequency of tick data, the variations of volume or different context from one instrument to another, and depending on the use-case, it is necessary to tailor data re-processing according for each dataset.