querying aviation data in the cloud (leveraging starburst galaxy)

In this blog post we will load some data into a cloud object store, configure a compute cluster that is built upon Trino, and finally explore some sample data with this cloud configuration that separates compute & storage.

NOTE: This post will make the most sense if you already have some schema-on-read knowledge with a tool like Apache Hive (to include concepts like external tables) as well as object store “bucket” creation and security. If this is not you then come on the journey anyway and I’ll try to point you to more information on these topics when appropriate.

Create a Bucket

I am most familiar with AWS, so I went ahead and created a bucket named lestermartin.blog with an empty directory within it called mygalaxy as shown below.

Those more comfortable with Azure or GCP can still follow these steps, but I’m going to let those folks figure out the appropriate physical steps for the logical ones I’m implementing on AWS.

Create a Catalog & Schema

For this exercise we are using Starburst Galaxy. Haven’t used Starburst Galaxy yet? No worries, it is free to signup and use (that’s what I’m doing!).

Once you are all setup with Starburst Galaxy, click on Catalogs on the left nav and then push the Configure a catalog button that surfaces.

From here, just follow the wizard to select Amazon S3 (or other appropriate object store if you are using this as a general exercise on another cloud provider). I decided on mycloud for the Catalog name.

For Authentication to S3, I chose to use AWS access key. If this is new to you, check out How to create an S3 Bucket and AWS Access Key ID and Secret Access Key for accessing it.

I went down the easy path of using AWS Glue for the Metastore type and configured S3 bucket name and directory to the values I created earlier. I knew I would be leveraging external tables (pointing a table’s metadata to a directory of files managed outside the control of the SQL engine), so I enabled the two switches shown below, too.

While I do want to experiment more with Iceberg, I’m going to save that for future blog posts. I changed the Default table format to be Hive before pressing the Test connection button.

That worked like a dream and I went ahead and clicked on Connect catalog. This then took me to the Set permissions screen and I kept all the defaults since I am the only user and want all those rights. Finally, the Add to cluster screen came up and I attached it to my existing tiny-cluster.

This redirected me to the Query editor and in the Cluster explorer I opened up my cluster and the mycloud schema underneath it to see there is only the metadata schema.

Lastly, we need to create a schema to contain tables for the data to be explored.

CREATE SCHEMA mycloud.aviation
  WITH (location = 's3://lestermartin.blog/mygalaxy/');

You should now have an empty schema in the Cluster explorer.

Create & Load Tables

For this post, we are going to use some FAA data. Please review FAA Dataset Information for details about the four files/tables we will be using for flights, carriers, airports & planes and for where to download this information from.

The raw data that you will download is in CSV format and has no guarantees of quality. A whole other blog series is brewing in my head of how to use this as example of building a data engineering pipeline, but for now… I will go ahead and assume (maybe HOPE is a better word) that I can fit this text file format into appropriate data types.

Upload Datasets

I created a subfolder in S3 to align with the aviation schema we identified earlier and then created four subdirectories as identified below.

Within each of those directories, place the appropriate CSV file within. As an example plane-data.csv belongs in the raw_plane folder.

Create External Tables

Now we can go back to the Query editor to create the external tables referencing this data loaded in the cloud. Be sure to verify that these new tables can be queried after creating them.

Flight

CREATE TABLE mycloud.aviation.raw_flight (
    month smallInt,
    day_of_month smallInt,
    day_of_week smallInt,
    dep_time smallInt,
    arr_time smallInt,
    unique_carrier varchar(15),
    flight_number smallInt,
    tail_number varchar(15),
    elapsed_time smallInt,
    air_time smallInt,
    arr_delay smallInt,
    dep_delay smallInt,
    origination varchar(15),
    destination varchar(15),
    distance smallInt,
    taxi_in smallInt,
    taxi_out smallInt,
    cancelled varchar(15),
    cancellation_code varchar(15),
    diverted varchar(15)
) WITH (
    external_location = 's3a://lestermartin.blog/mygalaxy/aviation/raw_flight',
    format = 'TEXTFILE',
    textfile_field_separator = ','    
);

Carrier

CREATE TABLE mycloud.aviation.raw_carrier (
    code varchar(15),
    description varchar(150)
) WITH (
    external_location = 's3a://lestermartin.blog/mygalaxy/aviation/raw_carrier',
    format = 'TEXTFILE',
    textfile_field_separator = ','    
);

Airport

CREATE TABLE mycloud.aviation.raw_airport (
    code varchar(15),
    description varchar(150),
    city varchar(150),
    state varchar(150),
    country varchar(150),
    lat decimal(10,8),
    lng decimal(11,8)
) WITH (
    external_location = 's3a://lestermartin.blog/mygalaxy/aviation/raw_airport',
    format = 'TEXTFILE',
    textfile_field_separator = ','    
);

Plane

CREATE TABLE mycloud.aviation.raw_plane (
    tail_number varchar(15),
    usage varchar(150),
    manufacturer varchar(150),
    issue_date varchar(150),
    model varchar(150),
    status varchar(150),
    aircraft_type varchar(150),
    engine_type varchar(150),
    year_built smallint
) WITH (
    external_location = 's3a://lestermartin.blog/mygalaxy/aviation/raw_plane',
    format = 'TEXTFILE',
    textfile_field_separator = ','    
);

Analyze the Data

Make sure all four tables are present by checking the Cluster explorer visualization or running the follow SQL statement.

SHOW TABLES IN mycloud.aviation;

How many rows are in each table?

SELECT count(*) from mycloud.aviation.raw_TABLE-NAME;

What countries are most of the airports located in?

SELECT country, count() AS num_airports
  FROM mycloud.aviation.raw_airport
 GROUP BY country
 ORDER BY num_airports DESC;

Ignoring the first record in the results from the prior query, what doe the rest of the results indicate? Hint: Remember what I said about data quality. 😉

What are the top 5 airline codes with the most number of flights?

SELECT unique_carrier, count() as num_flights
  FROM mycloud.aviation.raw_flight
 GROUP BY unique_carrier 
 ORDER BY num_flights DESC
 LIMIT 5;

Same question, but show the airline carrier’s name.

SELECT c.description, count() as num_flights
  FROM mycloud.aviation.raw_flight  f 
  JOIN mycloud.aviation.raw_carrier c
    ON (f.unique_carrier = c.code)
 GROUP BY c.description 
 ORDER BY num_flights DESC
 LIMIT 5;

What are the most common airplane models for flights over 1500 miles?

 SELECT p.model, count() as num_flights
   FROM mycloud.aviation.raw_flight f 
   JOIN mycloud.aviation.raw_plane  p
     ON (f.tail_number = p.tail_number)
  WHERE f.distance > 1500
    AND p.model IS NOT NULL
  GROUP BY p.model
  ORDER BY num_flights desc
  LIMIT 10;

We could go at this all day with running all kinds of fun queries, but let’s wrap up the post with a pair of more interesting questions that require the use of window functions.

What is the month over month percentage change of number of flights departing from each airport?

WITH agg_flights AS (
SELECT origination, month, 
       COUNT(*) AS num_flights
  FROM mycloud.aviation.raw_flight
 GROUP BY 1,2
),

change_flights AS (
SELECT origination, month, num_flights,
       LAG(num_flights, 1)
         OVER(PARTITION BY origination
                ORDER BY month ASC) 
           AS num_flights_before
  FROM agg_flights
)

SELECT origination, month, num_flights, num_flights_before,
       ROUND((1.0 * (num_flights - num_flights_before)) / 
             (1.0 * (num_flights_before)), 2)
          AS perc_change
  FROM change_flights;

Determine the top 3 routes departing from each airport.

WITH popular_routes AS (
SELECT origination, destination,
       COUNT(*) AS num_flights
  FROM raw_flight
 GROUP BY 1, 2
),

ranked_routes AS (
SELECT origination, destination,
       ROW_NUMBER() 
         OVER(PARTITION BY origination 
               ORDER BY num_flights DESC) 
           AS rank
  FROM popular_routes
)

SELECT origination, destination, rank
  FROM ranked_routes
  WHERE rank <= 3
  ORDER BY origination, rank;

Like I said… FUN! And in all fairness, I “lifted” these last two queries from a joint Stanford/MIT post.

For more fun, post your questions in the comments and I’ll see if I can answer them. If you feel more generous (and your don’t mind a little shameless self-promotion) then go ahead and present your answers along with your challenging SQL-oriented questions.

querying aviation data in the cloud (leveraging starburst galaxy)

Create a Bucket

Create a Catalog & Schema

Create & Load Tables

Upload Datasets

Create External Tables

Flight

Carrier

Airport

Plane

Analyze the Data

Like this:

Related

Published by lestermartin

One thought on “querying aviation data in the cloud (leveraging starburst galaxy)”

Leave a ReplyCancel reply

Create a Bucket

Create a Catalog & Schema

Create & Load Tables

Upload Datasets

Create External Tables

Flight

Carrier

Airport

Plane

Analyze the Data

Share this:

Like this:

Related

Published by lestermartin

One thought on “querying aviation data in the cloud (leveraging starburst galaxy)”

Leave a ReplyCancel reply

Discover more from Lester Martin (l11n)