building trino data pipelines (with sql or python)

When people think of Trino they naturally hone in the benefits of querying data with SQL in this “engine that runs at ludicrous speed”. When asked if Trino would be useful as a transformation processing engine for ETL/ELT workloads, not all would immediately answer YES!

Usually, the answer depends on if someone has only heard of Trino as opposed to those who have actually used Trino. The latter will usually say ABSOLUTELY it can! This post focused on ensuring all know Trino is well suited for constructing data pipelines AND that data engineers have their choice of SQL and/or Python when constructing them.

History of Trino for ETL workloads

The project website is a great place to start if Trino is new to you. The Starburst DevCenter is another useful resource. And, of course, I’m proud to offer my own blog posts discussing Trino and/or Starburst to the community. Please spend a little bit of time at some (or all) of these resources if needed before continuing.

Within a year after Trino (then known as Presto) was released into production at Facebook for interactive use cases, users started scheduling batch/ETL queries with Trino instead of Hive. Within a few more years, 50% of the existing ETL workloads were running on Trino as well as 85% of all new workloads. Soon others caught on, and companies like Salesforce and Lyft began utilizing Trino for their batch/ETL workloads, too.

Architectural concerns & remedies

Trino came out of the gate blowing Hive away on query execution speed because Trino focused more on performance as opposed to Hive’s focus on reliability. On a side note, hive, trino & spark features (their journeys to sql, performance & durability) points out that both of these engines are always improving. To understand all of that, let’s talk about the original architecture of Trino.

Parallel processing fundamentals

Fundamentally, the Trino engine works in a distributed processing mechanism similar to Hive and Spark. Meaning, it breaks work down into multiple stages organized in a directed acyclic graph (DAG). The stages are where the same processing can be done on many slices (Trino calls them splits, Spark calls them partitions, and Hive calls them blocks) of data simultaneously.

When needed to move to another stage of processing, the data has to be redistributed to allow processing to continue in a highly parallel fashion. Examples of when this is needed are common SQL operations such as GROUP BY, JOIN, and ORDER BY. If these concepts are new to you, I recommend you watch, at least, the first video in trino query plan analysis (video series) which I’m embedding next.

Initial Trino architecture

With that DAG / query plan knowledge fresh in your mind, take a look at the following image which represents the base architecture of why Trino runs fast.

Data leaving one stage to another is redistributed into the next stage (often referred to as shuffling). Trino’s speed benefit comes from not persisting that intermediary data to disk — it acts more like a streaming engine internally and sends the data on to the next stage directly. The consequence of this is if the query fails before completion, the initiator is returned an error message.

The design goals for interactive querying performance did not provide sufficient support for long-running and memory-intensive queries.

Long running queries unreliable: the all-or-nothing architecture makes it really hard to tolerate faults
Distributed memory limit: with streaming shuffle, aggregations and joins have to process all at once

Also, with the original architecture, it is really hard to apply classic techniques like adaptive query execution, speculative execution, and skew handling.

Fault-tolerant execution

As hive, trino & spark features (their journeys to sql, performance & durability) calls out, platforms continue to grow over time to address as many features and limitations as possible. Trino has been no different and Project Tardigrade added fault-tolerant execution (FTE) benefits into the architectural design.

The earlier identified issues with the initial architecture were overcome by storing the intermediary data produced by the parallel tasks within a stage. This model enforces the stage-by-stage execution model that engines like Hive were famous for (and gave it the reliability we are discussing now). The following diagram shows the flow of that intermediary data to configurable storage.

The same work is still accomplished within each stage, but since all stages are not actively running at the same time and the data needed for a subsequent stage is being persisted durably, the following benefits surface with FTE.

Tolerate individual task failures: if a worker fails, then only the work in the affected tasks of the currently executing stage need to be restarted
Reliability run queries of any size: all tasks within a stage do not have to be executing at the same time thus the all-in-memory constraint is eliminated
More flexible resource management: the Trino engine can be flexible enough to allow higher-priority queries to the front of the queue without cancelling others
Adapt query execution dynamically midstream: Trino has the ability to adjust according to different running conditions and optimize its execution strategy midstream

FTE is a cluster-wide setting, so in some organizations it would make sense to run 1+ FTE clusters as well as 1+ (classically) ‘interactive’ clusters.

Pipelines across the medallion architecture

As the diagram below shows, the medallion architecture is a common pattern we see in data lake environments and is focused on the flow of data from ingestion all the way to consumption for analytical, ML/AI, or other data application needs.

Data pipeline construction options

The next diagram suggests that this end-to-end flow of data constructs a comprehensive data pipeline. The diagram also shows that the E (extract) & L (load) from ELT is often referred to, in aggregate, as ingestion. Interestingly, the T (transform) output can be input for another transformation; or even more than one.

More interesting to note is that those series of transforms; those data pipelines, can be constructed with classical SQL (no surprise for the Trino query engine), but also with Python.

SQL

Probably the best argument for building your transforms and other data pipeline activities such as validating, enriching, and aggregating, is that almost everyone knows SQL.

Pros for SQL

Widespread knowledge
Compact
Easy to understand for simple logic

Cons for SQL

No real time debugging
Can’t perform other actions on results
No unit tests
Version diffs often hard to read
As complexity grows, SQL gets more and more difficult

Python

Some folks want to implement their data pipelines with a programming approach, and currently Python is the most popular language for data processing. For Python transformation jobs with Trino, we really are talking about using a Dataframe API.

A Dataframe is a two-dimensional, tabular data structure that can be conceptually understood as a table. The following example is implemented with the very popular PySpark distributed processing tool.

Pros for Python Dataframes

Reusable functions
Version diffs are easy to read
Easily create unit tests
Could combine Python libraries like Numpy & Scipy
Perform real-time debugging
Easy to follow complex logic
Allows for logging

Cons for Python Dataframes

A bit verbose
Need to be a Python programmer

For a Python Dataframe application to run on Trino, the framework has to convert the code into SQL that Trino can understand. This process can end up producing SQL that looks different than what you might create directly with SQL.

Here is a simple multi-table join query written by a human.

Programming that same functionality in Python using methods for each operation such as filtering, joining, or aggregating, can end up creating a query that looks more exhaustive.

Fortunately, Trino has a feature-rich Cost-Based Optimizer (CBO) which optimizes whatever query it receives to determine the best approach to run it quickly & efficiently. Back to our earlier ‘parallel processing’ discussion, this means that we expect the CBO to create the same query plan / DAG regardless of the SQL presented if both are trying to answer the same question.

Dataframe API alternatives

The primary options for programming to a Dataframe API using Python with the goal of running the work in a Trino cluster are PyStarburt and Ibis.

PyStarburst

PyStarburst features a PySpark-like syntax using the same lazy execution model (waits until an input/output ‘action’ to be called before actually submitting work to the cluster). At that point, the Python code gets converted to SQL so that the heavy lifting is done by Trino.

To learn more about PyStarburst, including seeing some programming examples, check out my PyStarburst tagged content.

Ibis

Ibis provides a common Dataframe API for data manipulation in Python, and compiling that API into a variety of compute engines.

With this tooling, Trino is just one of the options of where your Dataframe code could be executed. Check out ibis & trino (dataframe api part deux) for code examples running in Trino.

Comparing Dataframe APIs

The APIs themselves for these two frameworks are similar, but surely not identical, as the following shows.

While the SQL that gets generated by each tool might look different, they are both attempting to solve the same problem so we rely on the CBO to generate the same, or very similar, query plan / DAG.

Orchestration & scheduling

Trino does not provide a native orchestration and/or scheduling subsystem so you still need to provide your own solution to this space. Tools like Airflow, Dagster, and Prefect are popular and can have a mix of SQL, Python, and other tooling steps that need to be executed appropriately and with dependencies.

Summary

Key points

Trino is, and has been, an appropriate clustering technology for running your transformation processing jobs.

Fault-tolerant execution mode makes it even more robust
Data engineers can leverage SQL and/or Python when constructing their pipelines

Dataframe API recommendations

If you use Starburst products >> Choose PyStarburst

More similar to PySpark Dataframe API (easier to port from, or even back to, Spark)
More likely to be optimized better for Trino than Ibis since development community knows Trino very well

If you are using OSS Trino >> Choose Ibis

PyStarburst is not supported here
Gain optionality for the backend execution engine

If you could benefit from additional commentary on this topic, please check out my YouTube video below.

building trino data pipelines (with sql or python)

History of Trino for ETL workloads