Igor Lukanin for Cube

Posted on Jun 2, 2022 • Originally published at cube.dev

DeWitt Clause, or Can You Benchmark %DATABASE% and Get Away With It

#discuss #webdev #database #programming

TL;DR: Please see the list of database vendors who will either praise or punish you for doing a benchmark as well as an entertaining digression into the recent history of computer science.

About 6 months ago, we witnessed what Andy Pavlo wittily characterised as an "old school database benchmark gang war". The timeline of events goes like this:

Databricks, a data lakehouse company founded by the creators of Apache Spark, published a blog post claiming that it set a new data warehousing performance record in 100 TB TPC-DS benchmark. It was also mentioned that Databricks was 2.7x faster and 12x better in terms of price performance compared to Snowflake.
Snowflake, a data warehousing company founded by ex-Oracle and ex-VectorWise experts, responded with a blog post that critically reviewed Databricks' findings, reported different results for the same benchmark, and claimed comparable price/performance to Databricks.
After two days, Databricks followed up with another blog post that confirmed that they stand by the initially reported results.

While this back-and-forth between respected data vendors was informative and amusing, its collateral damage was even more substantial. Databricks shared that they have eliminated the anti-competitive DeWitt clause from their service terms. Snowflake followed up with a similar update to their acceptable use policy.

But what is the DeWitt clause?

More importantly, why would you need to know if a DeWitt clause is included in terms of service in case you'd like to do and publish a performance benchmark? Let's explore.

Backstory of the DeWitt Clause

In 1982, at the very inception of relational databases, David DeWitt, a researcher at the Department of Computer Sciences at the University of Wisconsin-Madison, was working on measuring database performance. His team wrote the Wisconsin Benchmark, tested a number of databases, and published the results.

Apparently, some folks were displeased with the results. Michael Stonebraker, a researcher at the UC Berkeley and co-creator of Ingres, called DeWitt and expressed how upset he was. However, they remained friends anyway as DeWitt recounts: in his talk on YouTube:

Oracle also didn’t perform very well and reacted hastily. According to DeWitt, Oracle CEO Larry Ellison was so displeased that he called the department chair and insisted, "You have to fire this guy.” Oracle also inserted a clause in their terms of use that boiled down to the fact that one can’t publish benchmarks without getting an explicit approval from Oracle.

DeWitt Clause vs. DeWitt Embrace Clause

Nowadays, the DeWitt clause is a common provision in end-user license agreements for proprietary software that prevents from publishing information about the software without an explicit approval from the vendor. Over time, it have become widespread in the software industry and its implications are drastic:

People avoid publishing benchmarks about software products with the DeWitt clause.
Moreover, some won’t even bother doing a benchmark knowing they won’t be able to publish the results.

There are debates whether DeWitt clauses are moral or legal because of the implications above as well as the overall corrosive effect they have on the industry. Once one supplier adds a DeWitt clause, the others can feel that they are at a disadvantage without one. We're lucky it can also work the other way round, like in the Databricks and Snowflake case. However, open source software (OSS) licenses don’t and can't have DeWitt clauses, meaning that the OSS can be legally critiqued, but not their proprietary competitors.

At the same time, some vendors do quite the opposite to the inclusion of a DeWitt clause. Instead, they change their license to the DeWitt Embrace Clause meaning that if you disclose benchmark results for them, you automatically allow them to do the same with you. In a way, DeWitt Embrace Clause is like copyleft for the DeWitt Clause.

Can You Benchmark %DATABASE%?

Now, do you wonder whether some vendor has or hasn't a DeWitt clause?

Shortly after the Databricks and Snowflake debate, Mark Callaghan, an ex-MongoDB and ex-Rockset database expert, checked if some database and public cloud vendors have DeWitt clauses in their terms of use.

In this blog post, I try to extend his research and maintain an ever-green list of database and similar vendors with regard to the DeWitt clause. (If you spot any inaccuracy, which is totally possible, please get in touch via igor@cube.dev, I'll be happy to update the post.)

Last updated: June 2, 2022.

😇 Open-source vendors (without the DeWitt clause)

Open source licenses usually grant users permission to use open source software for any purpose. So, by definition, open source software can't contain DeWitt clauses (because if it did, it wouldn't be open source).

Yes, you can benchmark any of these tools:

Apache Drill, Druid, Flink, Hive, Kafka, Spark
appwrite
ClickHouse (they even maintain a list of benchmarks on GitHub)
CockroachDB (licensed under BSL, not OSI-approved)
Dremio
Elasticsearch (licensed under SSPL, not OSI-approved)
ksqlDB
Materialize (licensed under BSL, not OSI-approved)
MongoDB (licensed under SSPL, not OSI-approved)
PostgreSQL
Presto
Redpanda (licensed under BSL, not OSI-approved)
QuestDB
TimescaleDB
Trino

Also, some open-source vendors collaboratively maintain benchmarking suites such as Time Series Benchmark Suite to help choose the best tools for particular workloads.

😇 Vendors without the DeWitt clause

Cloud vendors can restrict you from benchmarking their services but sometimes they choose not to have the DeWitt clause.

Yes, you can benchmark any of these services:

Ahana Cloud
Altinity Cloud
appwrite Cloud (not launched to GA yet)
ClickHouse Cloud (not launched to GA yet)
Confluent Cloud
Firebolt
Materialize Cloud
MongoDB Cloud
QuestDB Cloud (not launched to GA yet)
Redpanda Cloud
Snowflake
Starburst Galaxy
Supabase
Timescale Cloud

However, many of these cloud vendors disallow abusive benchmarking in their acceptable use policies. So be careful and please don't get banned.

🙃 Vendors with the DeWitt Embrace clause

Some cloud vendors permit you to benchmark their service but require reciprocity: you must make the benchmark reproducible and allow benchmarking of your own service or tool in response.

Yes, you can benchmark any of these services and probably get away with it:

Amazon Web Services, including Athena, Aurora, and Redshift. You may perform benchmarks or comparative tests or evaluations (each, a “Benchmark”) of the Services. If you perform or disclose, or direct or permit any third party to perform or disclose, any Benchmark of any of the Services, you (i) will include in any disclosure, and will disclose to us, all information necessary to replicate such Benchmark, and (ii) agree that we may perform and disclose the results of Benchmarks of your products or services, irrespective of any restrictions on Benchmarks in the terms governing your products or services.
CockroachDB Serverless. You can only disclose benchmarking tests if you (i) provide us the necessary information to replicate the tests, (ii) disclose the methodology for the benchmarking tests along with the results, and (iii) allow us to run our own benchmarking tests against your products and services at our discretion.
Databricks. You may perform benchmarks or comparative tests or evaluations (each, a “Benchmark”) of the Platform Services and may disclose the results of the Benchmark other than for Beta Services. If you perform or disclose, or direct or permit any third party to perform or disclose, any Benchmark of any of the Platform Services, you (i) will include in any disclosure, and will disclose to us, all information necessary to replicate such Benchmark, and (ii) agree that we may perform and disclose the results of Benchmarks of your products or services, irrespective of any restrictions on Benchmarks in the terms governing your products or services.
SingleStore. Customer may perform industry standard benchmarks, comparative tests or evaluations (each, a "Performance Report") of the Managed Service. If Customer performs or discloses, or directs or permits any third party to perform or disclose, any Performance Report of any of the Managed Service, Customer (i) will include in any disclosure, and will disclose to SingleStore all information necessary to replicate such Performance Report and the data from the Performance Report, and (il) agree that SingleStore may perform and disclose the results of the Performance Report of Customer's products or services, irrespective of any restrictions on Performance Report.

🤬 Vendors with the DeWitt clause

These vendors chose to restrict benchmarks with DeWitt clauses.

No, you can't benchmark any of these tools and get away with it:

Cloudera Data Platform. Restrictions. Customer may not: [...] access or use the Services for purposes of monitoring availability, performance or functionality of the Services, or for any benchmarking or competitive purposes.
Decodable. Restrictions. Customer may not, and may not cause or permit others to: [...] disclose results of any benchmark tests or performance tests of the Decodable services hereunder without Decodable’s prior written consent.
Dremio Cloud. Restrictions. Customer shall not, and shall not cause or allow any Authorized User or third party to: [...] except with Dremio’s prior written permission, publish any performance or benchmark tests or analysis relating to Dremio Cloud.
Elastic Cloud. Restrictions [...] You shall not: [...] access or use any Cloud Service for purposes of monitoring its availability, performance or functionality, or for any other benchmarking or competitive purposes, including, without limitation, for the purpose of designing and/or developing any competitive services.
Google Cloud Platform, including BigQuery & Firebase. Benchmarking. Customer may conduct benchmark tests of the Services (each a "Test"). Customer may only publicly disclose the results of such Tests if it (a) obtains Google's prior written consent, (b) provides Google all necessary information to replicate the Tests, and (c) allows Google to conduct benchmark tests of Customer's publicly available products or services and publicly disclose the results of such tests. Notwithstanding the foregoing, Customer may not do either of the following on behalf of a hyperscale public cloud provider without Google's prior written consent: (i) conduct (directly or through a third party) any Test of the Services or (ii) disclose the results of any such Test.
InfluxDB Cloud. Restrictions. Customer will not: [...] publish or provide any benchmark, comparison or performance test results.
MariaDB SkySQL and MariaDB Xpand. Benchmarking. Customer may perform benchmarks or comparative tests or evaluations (each, a "Benchmark") of the SkySQL Services. However, Customer must obtain MariaDB's prior written approval to disclose to a third party the results of any Benchmark of the SkySQL Services.
Microsoft SQL Server. BENCHMARK TESTING. You must obtain Microsoft’s prior written approval to disclose to a third party the results of any benchmark test of the software.
Oracle. RESTRICTIONS. [...] You may not: [...] disclose results of any Program benchmark tests without Oracle's prior written consent.
PlanetScale. Customer will not, and will ensure that its End Users do not: [...] access any portion of the Product for the purpose of building a similar or competitive product or service, or monitor the Product for any benchmarking or competitive purpose.

😶 Vendors for which the author didn't find the data

I wasn't able to identify where a few vendors stand with regard to the DeWitt clause:

If you can provide any pointers, please get in touch via igor@cube.dev, I'll be happy to update the post. Also, please reach out if you'd like more databases to be featured here.

What about Cube?

Cube is an open-source headless BI platform that integrates with a lot of data sources: cloud data warehouses, query engines, and databases. Many of them are designed for low latency and some of them can provide high concurrency. However, sometimes they don't; you can read our blog post about that.

In any case, Cube will have no problem delivering your data, with sub-second latency and high concurrency, to any data consumer out there: BI tools, data notebooks, front-end apps, etc. Thanks to Cube Store, Cube's REST, GraphQL, and SQL APIs are able to respond to any request within 200-300 milliseconds and allow for 100+ concurrent requests—and that can scale.

Please explore how Cube integrates with databases and other tools in the data space in our blog. Also, don't hesitate to try Cube Cloud, join our large Slack community, and give Cube a star on GitHub 🌟

DEV Community