How to compare your data in/with Spark

zejnilovic profile image Saša Zejnilović ・6 min read

Table of Contents


Apache Spark, as is, provides quite a lot of different capabilities and features, but it is missing one that I, as a self-proclaimed SDET, find pretty valuable. The comparison of data. I'm talking about the comparison of complex data, complex structures and generating a report that can be used to see where the problem lies; more than just a normal true/false comparison.

The problem

The main problem that we are trying to solve is that when using standard solutions, running a comparison of sorts on a large dataset returns a basic true or false result, after which you then need to comb through all of the data and try to find the root cause.
There is no fast response. Fast feedback loops are essential, but that's for a different article. You also need some basic metrics about the dataset to be provided. Testing without proper results is putting your trust in hope, and hope alone cannot build your big data solutions.

The solution

For these reasons, my teammates from AbsaOSS and I have written a tool called Hermes. Hermes consists of three modules, and one of its modules is a data comparison tool which works either as a Spark application or as a library, and it can compare whichever format is supported by Apache Spark. This tool is written in Scala, so it should be possible to use within any JVM application of your own. (I have even seen people using py-spark use our libraries, so it's not only JVM compatible. I am, however, not an expert on that, and I am not sure how "clean" of a solution that is.)

In this article, I would like to give a brief overview of the features of this Spark comparison tool and how to use it as a Spark app. Usage as a library is a bit more complex, and I believe it deserves a full article of its own. Let me first explain who we are.

Who exactly is behind this project

AbsaOSS is an initiative of Absa Group Limited, a South African bank that wants to go open source. You want to standardize your data, move it from COBOL to something current or track what and how your data is handled? We do that (Enceladus), that (Cobrix) and that (Spline). And some other interesting stuff.

Hermes and all other projects are under the Apache License. Meaning, feel free to use it and contribute. The projects are active, and we spend almost our whole days on GitHub, so we are usually quite fast to respond. All of our projects are in some way currently used by ABSA in production.

Hermes' significant advantage is that even though it is used in production, it is quite young and still looking for ideas. It is still growing.

Current real-world usages are:

  • A testing framework for the Enceladus project.
  • A data check tool that gives us an assurance that new tools work as well as the old ones that are being decommissioned.

Hermes dataset comparison Features

This feature list should be the same for the people who use it as a library as for those that use it as a Spark app. The features are as follows:

  • Can be used as an Apache Spark application or Spark library
  • Compares virtually any data type if you provide the needed library for the source type on the classpath. Spark already supports a lot of source types, but you might need to read Oracle, Hive or Avro. Just provide the application with proper packages, and you are good to go
  • JDBC, Spark and other packages are not packaged together with the application. They have a provided dependency. This allows us to keep the jar to 150 Kb and provide users with more flexibility
  • Can compare two different source types
  • Writes output as parquet (this is planned to be configurable. Issue #72)
  • Only compares data sets with the same schema. We have a complex schema comparison so the schema does not have to be aligned, but it has to be the same. (We have a plan for selective comparisons in the future)
  • Will write _METRICS file at the end (this will be written next to the parquet)
    • If you passed or failed
    • How many rows were processed
    • If any duplicate rows were found
    • Number of differences found
  • Provides a precise path to what was wrong in the datasets. Even if the structure is complex (arrays of structs and the likes). This is written to the final parquet
  • Final parquet holds only the rows that were different
  • Prints summary to STDOUT

Usage - Spark application

Disclaimer: I will try to cover all of the tool's functionalities, but I will be skipping over spark-submit configurations. That is beyond the scope of this text. I will also not cover how to set up your Hadoop and Apache Spark.

In this use case, I will try to show possibilities of Hermes's dataset comparison. This use case covers usage as a Spark application. For usage as a library, look forward to a second article.

To use Hermes's Dataset Comparison, you just need to know how to run spark-submit, your data types, their properties/options and where it is. Let's start with an easy example:

Example 1

spark-submit \
<spark-options> \
dataset-comparison-0.2.0.jar \
--new-format csv \
--new-header true \
--new-path /new/path \
--ref-format xml \
--ref-rowTag alfa \
--ref-path /ref/path \
--out-path /out/path \
--keys ID

Example 2

spark-submit \
<spark-options> \
dataset-comparison-0.2.0.jar \
--format xml \
--rowTag alfa \
--new-path /new/path \
--ref-path /ref/path \
--out-path /out/path \
--keys ID

Now, let's go over what these are. The job has one independent parameter, and that is --keys. Keys refers to the set of primary keys. You can provide either a single primary key or a number of keys as a comma-delimited list in the form ID1,ID2,ID3.

Next up is --out-path. For now, out-path can only be configured to specify the destination path for the parquet file which will contain the output differences and metrics. This is planned to change (#72), and it will have the same rules as --ref and --new prefixes.

Last and (probably) hardest to grasp are the --ref and --new parameters. These are only prefixes to the Spark source type's standard options. Just add -format to specify the source format (type). Add -path to get the input or output path, unless you are using JDBC connector, then use -dbtable and then any other options prepended with the correct prefix (--ref or --new) depending on if it is reference data or the new data that you are testing.

These options can also be generalized. Taking a look at Example 2, it has only --format; no --new-format or --ref-format. This is because both source types are XML and both have the same rowTag.
In this case, there is no need to specify this twice. If both source types were XML but had different rowTags, then the --ref-rowTag and --new-rowTag options would need to be specified.

After running this, just run hdfs dfs -ls /out/path and take a look at the results. If there were any differences, you should find a parquet file that has a new column added called err_col. This error column will be filled with paths highlighting differences in your structure.
Its schema is (pretty simple):

 |-- errCol: array (nullable = true)
 |    |-- element: string (containsNull = true)


Hermes should be easy to use testing tool and framework. Its dataset comparison module currently holds the most value, even outside of AbsaOSS, and I hope it can help you solve an issue or two. If you have any question about this or any of our projects, just send us a message or create a Question issue on GitHub.

I am looking forward to your comments and see you in the next article - usage as a library.


Editor guide