building scalar udf’s w/sql for trino (aka sql routines)

I’m finally getting a bit of time to try out Trino SQL routines in Starburst Galaxy and I thought I’d share my experiences. Like is often the case, the intro docs actually help quite a bit. I thought I’d show a simple example to get this post started and then add another slightly more complex one to illustrate their biggest benefit — abstraction!

A simple SQL routine

Run the following to create a SQL routine that takes a first and last name as parameters and then returns a single string in the “last_name, first_name” format.

CREATE FUNCTION combine_name(
    first_name varchar, last_name varchar)
  RETURNS varchar
  RETURN concat_ws(', ', last_name, first_name);

SELECT combine_name('Lester', 'Martin');

Starburst Galaxy has a sample table that can help test this out a bit more.

SELECT custkey, 
       combine_name(first_name, last_name) AS combined_name
  FROM sample.burstbank.customer
 LIMIT 5;

Easy-peasy, but for this simple example it would have been perfectly acceptable to just use the concat_ws function in-stream of the SELECT statement.

A slightly more complex one

For this one, let’s continue to use the sample Burst Bank customer table. Here are a few of the key columns we will work with.

SELECT custkey, paycheck_dd, 
       estimated_income, fico
  FROM sample.burstbank.customer 
 LIMIT 5;

Our business partners have created a medallion status algorithm based on creating a score by adding the following values together.

  1. If the customer is using direct deposit, start with 50,000 points (0 points if not).
  2. Their income (as-is).
  3. 100 times their credit score.

Then assign a medallion status based on the following score ranges.

bronzeup to 75,000
silverup to 150,000
goldup to 250,000
platinumeverything else

Here is the implementation.

CREATE OR REPLACE FUNCTION medallion_status(
    paycheck_dd varchar, estimated_income double, fico integer)
  RETURNS varchar
  BEGIN 
    DECLARE m_score double DEFAULT 1.00;
    SET m_score = IF(paycheck_dd = 'Y', 50000.00, 0.00) +
                  estimated_income + (fico * 100);
    IF m_score < 75000.01 THEN 
      RETURN 'bronze';
    ELSEIF m_score < 150000.01 THEN 
      RETURN 'silver';
    ELSEIF m_score < 250000.01 THEN 
      RETURN 'gold';
    ELSE 
      RETURN 'platinum';
    END IF;
    RETURN null;
  END;

Run a quick smoke-test.

SELECT paycheck_dd, estimated_income, fico,  
       medallion_status(paycheck_dd, estimated_income, fico)
  FROM sample.burstbank.customer LIMIT 5;

Let’s do a quick aggregation on the newly created medallion status to verify it makes sense across the full customer base.

WITH incl_medallion AS (
SELECT paycheck_dd, estimated_income, fico, state,
       medallion_status(paycheck_dd, estimated_income, fico) 
         AS medallion_status
  FROM sample.burstbank.customer
)
SELECT medallion_status, 
       round(avg(estimated_income),2) AS avg_income,
       round(avg(fico),0) AS avg_fico
  FROM incl_medallion
 GROUP BY medallion_status
 ORDER BY avg_income desc;

Yep, it looks good from here! Plus, we can abstract away the full user-defined function (UDF) so it doesn’t make our query so convoluted AND we can reuse it easily when needed. Cool beans!

Published by lestermartin

Developer advocate, trainer, blogger, and data engineer focused on data lake & streaming frameworks including Trino, Hive, Spark, Flink, Kafka and NiFi.

Leave a Reply

Discover more from Lester Martin (l11n)

Subscribe now to keep reading and get access to the full archive.

Continue reading