hive, trino & spark features (their journeys to sql, performance & durability)

Open-source frameworks for big data analytics have existed for over a decade now and have wide acceptance across all industries. Three of the most popular ones are Apache Hive, Trino, and Apache Spark. There are many aspects of these frameworks that can be compared & contrasted, but I want to focus on the following three features and walk through each framework’s journey to attain all three (spoiler; they all get there!).

SQL – Structured Query Language, like it or not, is THE most accepted analysis language for business data with known structure
Performance – Obviously, this suggests that we want these SQL queries to run as fast as possible
Durability – Many SQL queries & operations take a long time to complete and the feature of durability would ensure that user requests will run to completion; even if there are software/hardware failures

Historical Timeline

Event: Apache Hadoop Surfaces (2006)

Hadoop was released to attack large-scale data analysis tasks that existing technologies either could not process at all, or organizations could not afford to scale those technologies to the needed level. The cluster is a combination of storage (HDFS) and compute (YARN) that allows for an awesome feature called “data locality” which basically means to take the processing to the data instead of the inverse.

Initially, Hadoop developers were only presented with the Java MapReduce API which did offer data analysis processing abilities with inherent job reliability and durability features. This approach did not offer SQL (Hadoop will quickly offer a SQL abstraction layer), but was focused on guaranteeing a job would complete — regardless of how long it takes to complete.

	Hadoop (Hive)	Trino	Spark
SQL
Performance
Durability	2006

Event: Apache Hive is Created (2010)

Developers at Facebook built Hive, a SQL abstraction layer on top of Hadoop, to get past the Java programmer hurdle with Hadoop. Hive created a component called the metastore that stores the needed information for this schema-on-read data warehouse technology. This metadata for each table includes the following.

Hive is tightly-coupled with Hadoop and ultimately submits MapReduce jobs (although now with the optimized Tez engine) that run in the cluster along with other queries as well as other types of workloads.

	Hive	Trino	Spark
SQL	2010
Performance
Durability	2006

Hive also brought us the ORC file format, but even with this and Tez it wouldn’t be fair to say it has fully checked off the Performance checkbox yet.

Event: PrestoDB (see Trino family tree) Invented for Interactive Queries (2012)

Yep, I referenced PrestoDB in this section’s header and THEN dropped a Commander Bun Bun logo right after it. Here’s a great resource for describing The journey from Presto to Trino and Starburst. Armed with that awareness, bare with me as I focus on just saying Trino hereafter.
https://www.starburst.io/blog/the-journey-from-presto-to-trino-and-starburst/

Still over at Facebook, it was determined that Hive was great for long-running analytical queries and for data engineering pipelines. Trino (fka PrestoSQL) was created to execute fast queries. It did this by having its own cluster of dedicated compute nodes separate from Hadoop and optimized how the query could be run if we didn’t spill data to disk in the intermediary steps of a full query.

This improved speed greatly, but at the cost of reliability. If anything went wrong with the execution, an error is returned to the person or process that submitted the query.

	Hive	Trino
SQL	2010	2012
Performance		2012
Durability	2006

An added benefit of separating compute and storage allowed Trino to use a variety of Connectors to become a single point of access for a variety of data systems, not just a variety of file formats on the data lake. Additionally, federated queries can be run across the connectors.

Event: Apache Spark Emerges (2014)

The “smart kids” over at UC Berkley’s AMPLab were enjoying the new abilities to run jobs on Hadoop, but they realized that for recursive processing (such as machine learning algorithms) Hadoop’s inherent sharing-model of resource management was hurting them. The Spark creators did realize that there were existing resource managers, such as Hadoop YARN, and they utilized these existing tools instead of re-inventing them.

They started building Spark (still a MapReduce engine) and realized if they allocated all the resources they needed at the start of a program and coupled that with in-memory caching options (when the processing really needed to revisit the same immutable data over and over) then they could run jobs 50-100x faster.

For non-recursive (i.e. good old-fashioned data engineering) jobs the execution could easily be 3-7x faster due to not needing to request resources from parallel task to parallel task, therefore we can declare it a performance-oriented framework for a variety of workloads.

	Hive	Trino	Spark
SQL	2010	2012
Performance		2012	2014
Durability	2006		2014

At this point, the primary API was focused on the Resilient Distributed Dataset (RDD) which required programming expertise.

Event: Spark Adds SQL Support (2015)

As we know, the data analysis world is fueled by SQL. It didn’t take Spark long to add their DataFrame API which in addition to a programmatic API allows for classical SQL operations. This rounded the Spark platform out regarding the features of SQL, Performance and Durability.

	Hive	Trino	Spark
SQL	2010	2012	2015
Performance		2012	2014
Durability	2006		2014

Event: LLAP Hits the Scene (2017)

As this whole blog post is a testament to user requirements drive features AND imitation is the best form of flattery, the Hive community created an optional framework called Live Long and Process (LLAP). It has a lot of sophistication, but I’ll boil it down to the fact that it supports resources being allocated, online and ready for querying as well as a shared memory cache across all the nodes that the processing resources have been allocated to.

This solution can easily attain sub-second query results on datasets than can fit in the shared cache and LLAP doesn’t have to coordinate with YARN all the time. While LLAP is an optional element of Hive, it truly does bring the feature of high-performance to stable SQL engine.

	Hive	Trino	Spark
SQL	2010	2012	2015
Performance	2017	2012	2014
Durability	2006		2014

Event: Project Tardigrade (2022)

Over the years that Trino has been fulfilling its role as a fast query processor, users have also been leveraging it their ETL pipelines, too. While Facebook has been using fault-tolerant execution with Presto for years now, this feature finally came to open-source Trino. This feature-release blog post offers more details.

	Hive	Trino	Spark
SQL	2010	2012	2015
Performance	2017	2012	2014
Durability	2006	2022	2014

Now, as promised, we finally show a full table indicating all three popular SQL engines satisfy performance AND durability. This concludes our history lesson for today. 🙂