Why does Market Data Quality matter for your business?
03 October 2022
Is there a way to identify high quality market data providers? This article presents the results of the comparison of multiple providers of data, and reveals the few trends we can capture out of it.
When dealing with trading strategies, risk management and regulatory reporting, market data quality stands as one of the key elements. Regardless of the data type, using poor quality data often leads to poor results and thus a loss of money. For these reasons, a typical data scientist might more time cleaning and validating data rather than processing the data itself to prevent such outcomes.
Financial data can be acquired in several ways, should it be directly from the market data provider, resellers, or simply public websites. Although we might all agree on the fact that the quality of these types of sources will vary a lot from one provider to another, all those market data providers will naturally pretend to deliver reliable data.
This leads to ask ourselves the following questions: how do we compare two providers of data? Can we blindly trust providers which are globally acknowledged as “good” in terms of reputation? Is there a way to give a rating to them based on the accuracy and consistency of the data they offer? Does the quality of data vary, for a given provider, from one type of data to another?
Keywords : daily data, clustering, data quality, financial data provider, multidimensional scaling, aggregation
Table of content
- Presentation of the market data consolidation workflow
- Building trust through comparison of providers
- From collection to aggregated data
- Focus on comparison process
- Conditions of observation
- Focus on a basic case: US daily prices
- General observations
- Spatial representation
- Multiple depth comparison
- Comparison with other data types
- Closing thoughts
- See more
Presentation of the market data consolidation workflow
Building trust through comparison of providers
The first rule of thumb is not to trust anyone. Rather than directly using the highest reputation provider that we have access to, we prefer to apply some precautionary principles: We believe that confronting different sources is the most sure-fire way to build a quality indicator and obtain accurate data.
Figure 1: Trust building through confrontation of different market data providers
For these reasons, we collect market data from as many providers as possible. Since we focused on daily data, the types of information collected will be bars, prices, splits and dividends. The data feeds might also be of different types, such as data providers, market exchange websites, financial websites, etc.
Consequently, our expectations on the data quality might vary highly from one source to another, and from one type to another within the same source.
The data we considered here is daily data rather than tick/intraday data, for several reasons:
- The volume of daily data is much lighter than with tick data, where you can quickly become overwhelmed with terabytes of files. This allows us to compare the data in a convenient amount of time
- It is easier to find, which means we’ll have more samples to compare for a better study
Nevertheless, the comparisons we have made for daily data would also be transferable to tick data.
From collection to aggregated data
To provide a clearer understanding of our observations, the global idea of the financial data consolidation process is briefly explained in this part. The main steps of our workflow is defined as follows:
- Collection : collecting financial data from various providers and extracting timeseries
- Filtering: each timeseries runs through sets of rules to check if it is valid on its own. This includes for instance no date duplicates, numerical values are not textual, bar high and low order is coherent, no large gap between dates, etc.
- Comparison: comparing timeseries and generating clusters (and thus a rating).
- Aggregation: decision process using clusters and other customizable parameters in order to “merge” the timeseries.
Figure 2: Market data aggregation workflow
Now that the global workflow is explained, we will now focus on the third step where the data comparison process occurs.
Focus on comparison process
For each instrument in our scope, we iterate through the dates in the timeseries and compare the information from one provider to another. Below is an example of how we compare a pair of daily bars of AAPL, from sources A34 and B46.
Figure 3: Comparison process illustration for daily bars
This comparison process between two inputs, at the root of our workflow, is specific to each type of input we process. The reason for this is that the different types of market data do not have the same fields. Thus, the similarity threshold should not be the same.
An example might clarify this idea:
When comparing 2 daily prices for a given date, we compare the price value itself as well as the volume. The difference tolerated for the price value might be 0.1% since it is the most sought-after information.
For volume, the threshold is set to 10% since there is apparently less consensus on this value (see next sections).
If both the prices’ value and volumes are within this accepted variation threshold, the entries are considered similar and belong to the same group. However, if at least one the field has too large a variation, the entries are considered different.
- The case of daily bars shown in the example above, is quite similar, but with more parameters. Besides the volume, it also requires the pairs of open, high, low, close to be within 0.1% range.
- The distance comparison for corporate actions is stricter: Numerical information (dividends value, split old/new ratio) and textual information (dividend type) must be identical in order to put two inputs in the same group.
These variations are parameters that we select according to preferences or what we assume to be the most important. Tweaking these input variables to give or withdraw importance to a specific field will ultimately lead to different rankings.
If we resume our comparison explanation, for a given date, we compared the possible pair of values. This led us to one or more clusters. In the great majority of the cases, we get the following distribution:
- One big group with one single value, containing most of the inputs.
- One or more other smaller groups, with inputs that provided different values.
To illustrate this, we can examine the case below, presenting comparisons of Apple dividends for a few providers. Two entries of the same color are matched in the same group.
Since the entries are the exact same except for 07/08/2022, the clustering leads to only one big group for those dates. As of the 07/08/2022 dividend, the value from source B11 leads it to be isolated since the amount differs from all others. (on a side note, what seems to be an error is likely to be due to input error, since the 0.205 value appears the row above).
Figure 4: Dividends clustering example: AAPL daily bars for multiple providers.
If we were to discuss the decision process in case of disagreement, the group of highest cardinal would most likely be selected as the correct value as most providers seem to agree on it. This appears to be a reliable method as most of the time there is a consensus on one single value. Nevertheless, we could also decide to tweak the decision by giving more trust to a specific provider than others.
Going back to the clusters, we then build ratings based on the results as follows:
- Pairs of data sources that belong to the same clusters are granted “bonus” or similarity points as they appear to give the same value.
- Pairs of data sources that belong to different clusters are granted “malus” or difference points as they do not agree on that specific day.
This process being iterated over all the timeseries (one for each asset in our scope), we have the cumulated bonuses of maluses and for each pair. Then, we can compute for each pair a ratio, simply defined by ratio = #bonus / (#bonus + #malus)
Once stored in a matrix, it should give us insights on the similarity between each pair of market data provider on large scale.
Conditions of observation
We chose to separate the analysis based on several factors, drawing on experience manipulating the data but also common sense:
- Region and exchange: all providers do not cover the same scope (ex: US/EU/asians). Moreover, even if a providers covers all regions, it might be specialized in one region than another (in the sense of better coverage and accuracy in values).
- Data type: once again, a provider could have more expertise in one specific type of data than another.
- Depth of the timeseries: not all data sources offer the same historical depth.
The idea was to find a compromise between separating the market data that looked too different, but also keeping samples of sufficient sizes for results to be meaningful. As a result, we settled on the following elements:
- US Equities: S&P 500
- Daily prices, daily bars, dividends, splits
- 39 daily data providers used at most
We chose to run the study on those 3 depth horizons:
- All (>1990): a classic sample to compare with more specific use-cases below. We decided not to check beyond 1990. From this point on, too few providers provide information to make our study relevant.
- Recent (2021+): to see the influence of closer dates.
- Old (1990-2010): to check if the information loses quality as we go far back in time.
Finally, for obvious reasons we have decided to anonymize the names of the providers, as the idea is not to point out low performing ones, but rather to identify similarities, global trends, and the influence on quality of the parameters listed above.
Focus on a basic case: US daily prices
After running our comparison for US daily prices, for the All-dates use-case (>1990), we get the following similarity matrix.
Figure 5: Ratio matrix between all providers - DailyPrices US Equities - 1990-2022.
The global trend suggests that a lot of providers agree on most of the values, which was hopefully expected. The median value is around 97.6% of similarity and three quarters of the scores are above 95%. Some pair of sources, such as C39/C21 among others, agree on all their values as we have some perfect 1 scores on the matrix. Even though that is the result that we should expect normally, it sometimes reveals sources that have a common provider at the root .
Some data sources, such as C30 or C37 for instance, have a particularly “green” row which means an overall high similarity with many providers. This would mean that even when there are disparities, it more consistently provides the most popular value, which we could consider as the correct value. These providers could be considered as quality ones.
On the other hand, we can see some cases of marginal sources (C28, and C40 on a smaller extend) that disagree most of the times with others. Those would be assimilated as “bad data providers”. These sources drag down the global average down to 92%.
Finally, if we observe in details the table, with a few permutations, it is possible to identify clusters of providers, that is to say a group of sources that agree with each other more than the average. Those clusters are easier to identify if we isolate a portion of the matrix as shown below.
Figure 6: Zoom on a cluster of similar providers - DailyPrices US Equities - 1990-2022.
This zoom makes it easier to consider those 6 data providers as a cluster since the lowest similarity between the group is 99.2.
A matrix representation as above is quite explanatory since it provides all similarities between each pair, however what could we do if we wanted a more graphical explanation to achieve an overview of similarities, especially if we wanted to see the clusters of similar providers in a more practical way than extracting a sub-matrix?
To do so, we used an open-source library to display the previous results on a graph. This library allowed us to run a multidimensional scaling. The idea behind this is to use a proximity matrix to switch from a n to p-dimension problem (where n>p). In our case, the starting dimension is 36 (number of distances for a given source) and the target dimension is 2, as the aim is to carry out a projection of the problem on 2 dimensions.
The multidimensional scaling process allowed us to associate a pair of coordinates for each provider. This led us to the graph below, where dot sizes represent the volume of data, and distance between two dots the similarity of the information.
Figure 7: 2D projection of distance matrix after multidimensional scaling - DailyPrices US Equities - 1990-2022.
Sometimes a graphical representation makes it easier to achieve a high-level understanding of results. This first graph allows us to clearly identify in the green circle on the top left what seems to be a large cluster. The red-circled values, mostly C28 but also C40, are the few marginal sources we identified above with the matrix.
To achieve a clear view, we decided to run the same process after filtering the inputs that clearly seemed to be off the mark (C28, C13, C40, etc.)
Figure 8: 2D projection of distance matrix with marginal providers filtered - DailyPrices US Equities - 1990-2022.
Firstly, this new graph is not technically a zoom on the green group identified above since we removed the influence of some providers (C28, C13, C40, etc.). Still, the X-axis and Y-axis range clearly show that it is as if we did one. Indeed, the first one went from a 0.20 range to 0.07, and the second one from 0.50 to 0.07.
Moreover, this filtered view allows us to identify more clearly some sub-clusters of similar providers within the group of the first figure. Here, we can clearly see a high-level group of look-alike providers identified as a purple dot. If we go deeper in detail, we can see 2 smaller groups of similar sources, circled in red and green, the green one being the same cluster identified on the matrix. The remaining sources seem to be isolated around the central group.
Finally, we should note that such a view does not allow us to find back the exact distances when isolating two dots on the figure, for the simple reason that we would need N-1 dimensions to do so as is the case in a matrix. The sole purpose of this graph is to visualize the redundancy between data providers via clusters: This could prove to be useful if you aimed to reach a specific coverage with as few providers as possible, or if you wanted to identify some “backup” solutions.
We have seen the results with some fixed parameters, but what if for instance we changed the timeseries depth?
Multiple depth representation
To compare the influence of the history depth, we ran the same process with the 3 different historical depths (recent, all, old) and plot the results on the same graph below. For the sake of clarity, we decided to filter the marginal values as we did before.
Figure 9: 2D projection of distance matrix with marginal providers filtered - DailyPrices US Equities - Multiple depth horizons.
Our first observation when comparing these scenarios is that dots are much more spread out when the historical depth is greater. This confirms what we talked about in the conditions of the study paragraph: there are much more disagreements between providers as the period becomes larger.
This might be more even more visible if we plot the same map with some annotations:
Figure 10: 2D projection of distance matrix with marginal providers filtered, zoomed on C31 - DailyPrices US Equities - Multiple depth horizons.
In the graph above, the growing radius of circles reveals the increasing spread of the information as the history depth widens. The focus on C31 is a good example to illustrate this trend, as we clearly see the source going from the center to the borders. Note that this figure is the superimposition of the 3 graphs on a same scale, meaning that it is possible to get an interpretation of the growing spread with the depth, but the distance between dots of different colors would have no significance.
We could obviously argue that the conditions in the 3 scenarios are not the same as many providers are missing when we go back in time, due to the difference in depth from one input from another. The idea in this comparison was more to identify trends over long periods of time rather than giving providers an accurate rating.
At this point we have seen that on top of identifying some clear differences in quality from one provider to another, the overall market data quality tends to diminish when we move away from recent dates. One could now wonder how it would behave if we were to consider other types of data collected.
Comparison with other data types
To illustrate whether or not there are providers with more “expertise” for specific types of data, here are the scenarios we chose to run:
- Same instruments scope: US equities S&P 500.
- 3 different historical depth: recent (2021+), all (1990-2022), old (1990-2010)
- 4 types of market data: daily prices, daily bars, dividends, splits.
To obtain an overview of the results, rather than displaying the same type of graph as we did for daily prices, we attempted to achieve a single score for each provider, and for each type of instrument. This score was custom made for the purpose of comparison of sources. For a given source x, and data type t, the score is built as follows:
- distance(x) is the Euclidian distance based on the coordinates generated by multidimensional scaling.
- min(t) is the minimum of the Euclidian distances of all sources of data.
The results are presented in these 4 adjacent arrays, one for each type of daily data:
Figure 11: Score comparison by financial data provider for each asset type.
The main observations and interpretations of the table above are the following:
- Splits and dividends are a scarce resource: most of the time this information might be directly taken into consideration in the stock’s price, but the raw information is often not directly available or missing. (Delivered prices are already adjusted).
- The phenomenon of quality loss when going back in time occurs also for the other types of data as testifies the global color shift from left to right in each array. This is the case for the 3 examples (A25, C35, B39) highlighted in the table, where the quality drops as the history range grows. And this happens for each type of information. A plausible explanation to this linked behavior is that, since the odds to “miss” a dividend, or a split are higher for older dates, it would consequently have repercussions on the price’s corrections with corporate actions.
- The quality ranking and provider hierarchy seem to be respected from a global point of view. Indeed, for a given source, there are few abrupt quality shifts when we swap from one type of data to another. In other words, a source with above average data for one type of information, is most likely going to offer above average quality data for another type.
- There are some cases of providers not matching the trends presented above. Sometimes the estimated quality remains the same, such as A27 dividends, or it can improve with time, as is the case with B16 splits. Still, those cases remain exceptions.
Before concluding, we must keep in mind that there is no true meaning in comparing, for a given provider, ratings for type A and type B. As explained before, the requirements to be matched in a same cluster vary due to the different nature of the data we handle. Therefore, we can be more discerning in general during the group’s construction process for a specific type of data than for another, ultimately leading to different average rankings for each type of data.
The idea of this process was to provide a critical analysis of the quality of different types of daily data. We’ve seen with our metrics that, as is assumed in the world of market data, some providers are clearly more accurate than others . However, even the best sources of information we have identified do not reach a perfect score: crossing the information with other feeds appears to be a safe method.
Some reservations could be made with regards to a potential ranking created by this comparison process. Among others, although we defined a scope of instruments for the study, the size of the dots in the 2D projections testify that the coverage for each source is not exactly 100% each time. One side effect of this coverage gap is that it might slightly favor smaller sources. Smaller sources are indeed more likely to stick to the top capitalizations assets (as they are more common and sought-after), for which providers tend to be more unanimous. As a consequence, the rating for those smaller sources might be inflated.
Another detail that could be misleading in the interpretation of the matrices and ratings is the distribution of errors among financial instruments.
Consider a 100-asset pool with an 100-day depth, a provider X that is 100% accurate on 99 values and consistently inaccurate on a single value will result in the same rating as a provider Y that is accurate on 99 days out of 100, for each one of the 100 assets in the pool.
Both of them will end up with the same rating, but it seems that source X seems more suitable for use, especially if we consider that the specific error for the/a source might be a punctual mismatch in data collection.
If we were to go deeper into the analysis of the causes of conflicts, we would see that the main sources of errors are often missing corporate action corrections (leading to invalid prices/bars), absence of consensus on the effective date of dividends, but also human input errors (missing comma, duplicated row in files, etc.). As the final table testified, there are many more errors for corporate actions than merely for prices/bars. The reason behind this, aside of the different thresholds used for clustering, might be the different channels used to generate the information: While bars/prices are almost 100% automatized, corporate actions still require a lot of manual input.
Finally, even if such a study allowed us to rank market data providers according to the quality of the information they provide, we should not jump to conclusions: Although quality is a fundamental requirement of data, there are other key aspects to consider in this domain, notably accessibility and cost-effectiveness, in order to accurately balance the rankings of the providers.
If you want to replicate our observations or visualize by yourself our data completion and scoring results, feel free to apply for a Ganymede trial on this link.