This is an incredibly in-depth post describing three new-ish OLAP SQL engines that have achieved notable traction and performance for certain use cases. While none of them are poised to take over as the next generic SQL OLAP engine tomorrow, they have impressive performance characteristics.
The subject systems run queries faster than the Big Data processing systems from the SQL-on-Hadoop family: Hive, Impala, Presto and Spark, even when the latter access the data stored in columnar format, such as Parquet or Kudu. This is because ClickHouse, Druid and Pinot
- Have their own format for storing data with indexes, and tightly integrated with their query processing engines. SQL-on-Hadoop systems are generally agnostic of the data format and therefore less “intrusive” in Big Data backends.
- Have data distributed relatively “statically” between the nodes, and the distributed query execution takes advantage of this knowledge. On the flip side, ClickHouse, Druid and Pinot don’t support queries that require movements of large amounts of data between the nodes, e. g. joins between two large tables.
Very worth following.