finally checking out chatgpt (adding a new tool in my toolbelt)

OpenAI and ChatGPT logos are seen in this illustration taken, February 3, 2023. REUTERS/Dado Ruvic/Illustration

YES… EVERYONE has been playing with (using?) ChatGPT for a good while now, but I finally decided it was my turn. I’ll leave out all the talk that my daughter and I share with so many others about where A.I. might lead to. Reading Robopocalypse did a number on me!

Although, I will say A.I. Artificial Intelligence is a cool movie. Heck, since we’re talking movies now, I think Bicentennial Man is my favorite A.I. movie, but I digress… 😉

I threw ChatGPT a softball to get it started. I asked, What is the difference between Presto and Trino? and got the following reply.

Presto and Trino are both distributed SQL query engines designed for interactive analytic queries on large-scale data sets. They share a common history and originate from the same codebase, but they have diverged in certain aspects due to different development and community support. Here are the key differences between Presto and Trino:

  1. Naming and Ownership: In December 2018, a major split occurred within the Presto community, resulting in the formation of two separate projects. One group of developers continued to maintain and enhance the original Presto project, while another group created a fork called Trino. As a result, Trino is essentially a community-driven fork of Presto.
  2. Governance and Community: Trino has a more community-driven development model compared to Presto. It has an open governance model where contributors from various organizations collaborate to develop and maintain the project. Trino has its own website, documentation, and community resources, distinct from the original Presto project.
  3. Branding and Trademarks: Due to the split, the original Presto project retains the rights to the Presto name and trademarks. The Trino project had to rebrand itself to avoid trademark conflicts and legal issues. Consequently, the project is now called “Trino” officially, and its codebase and associated artifacts reflect this new branding.
  4. Development Pace and Features: Both Presto and Trino continue to evolve independently, with each project introducing its own enhancements and features. Trino tends to have a faster release cycle and may introduce new features and improvements more rapidly. However, both projects strive to maintain compatibility with each other, allowing users to switch between them with minimal code changes.
  5. Compatibility: While Presto and Trino are separate projects, they remain largely compatible at the API and query level. The SQL syntax and query semantics are largely the same between the two engines. This compatibility enables users to migrate between Presto and Trino with relative ease, depending on their specific needs and preferences.

In summary, Presto and Trino are similar in functionality and design, but they have different development models, branding, and community support. Both projects continue to evolve independently, and users can choose between them based on their specific requirements and the development pace that aligns with their needs.

ChatGPT

Pretty solid first-pass response to this question I hear often from my students. So I went a bit deeper into the logical follow-up question for those interested in Starburst. I asked, What is the difference between Trino and Starburst?

Trino and Starburst are two distinct projects in the realm of distributed SQL query engines, but they are closely related. Here’s an overview of the differences between Trino and Starburst:

  1. Origins and Codebase: Trino, formerly known as PrestoSQL, is an open-source project that originated as a fork of the original Presto project. It focuses on providing a fast and scalable SQL query engine for big data analytics. Starburst, on the other hand, is a company that offers a commercial distribution of Trino called “Starburst Enterprise for Presto.” Starburst builds upon the Trino open-source codebase and adds enterprise-grade features, support, and additional tooling.
  2. Licensing and Support: Trino is an open-source project released under the Apache License 2.0. It has an open and community-driven development model, with contributions from various organizations and individuals. Starburst, as a commercial offering, provides enterprise-level support and additional features beyond what is available in the open-source Trino project. This includes extended security, data connectors, administration tools, and more.
  3. Enterprise Features: Starburst focuses on delivering enterprise-grade features to enhance Trino for production deployments. These features can include advanced security and authentication options, fine-grained access controls, data federation capabilities, query optimization enhancements, and integration with other enterprise systems and tools. These additions cater to the specific needs of organizations that require robust, scalable, and secure query engines.
  4. Support and Maintenance: Starburst provides commercial support and maintenance services for Trino deployments through its Starburst Enterprise for Presto offering. This includes assistance with installation, configuration, troubleshooting, performance optimization, and ongoing support to meet the needs of enterprise customers. Trino, being an open-source project, relies on community support through forums, mailing lists, and online resources.
  5. Ecosystem and Partnerships: Starburst maintains partnerships with various technology vendors and cloud providers to offer seamless integration and compatibility with their offerings. They work to ensure that Trino, through Starburst Enterprise, can connect with a wide range of data sources, data lakes, and analytics tools. These partnerships facilitate a broader ecosystem and interoperability for customers using Trino in their data infrastructure.

In summary, Trino is the open-source project that provides a distributed SQL query engine, while Starburst Enterprise for Presto is a commercial distribution of Trino with additional enterprise features, support, and tooling. Starburst focuses on enhancing Trino for enterprise use cases, providing additional functionality, and offering commercial support and maintenance services.

ChatGPT

Again, decent reply for sure. It did have some old branding names (i.e. needs to remove “for Presto” from the end of “Starburst Enterprise”). It didn’t mention Starburst Galaxy — shame shame! Let’s give it another chance. What are the key differences between Starburst Enterprise and Starburst Galaxy?

Starburst Enterprise and Starburst Galaxy are two distinct offerings provided by the Starburst Data company. Here are the key differences between Starburst Enterprise and Starburst Galaxy:

  1. Starburst Enterprise: Starburst Enterprise is a commercial distribution of the open-source Trino (formerly PrestoSQL) query engine. It enhances Trino with enterprise-grade features, support, and additional tooling. Starburst Enterprise is designed for organizations that require robust, scalable, and secure SQL query engines for their big data analytics needs. It offers advanced security options, fine-grained access controls, data federation capabilities, query optimization enhancements, and integration with other enterprise systems and tools. Starburst Enterprise provides enterprise-level support and maintenance services for Trino deployments.
  2. Starburst Galaxy: Starburst Galaxy, on the other hand, is a fully managed cloud service that offers Trino as a service. It provides a simplified and managed environment for running Trino-based analytics workloads without the need for infrastructure setup, management, or maintenance. Starburst Galaxy is designed to offer the benefits of Trino’s high-performance analytics capabilities with the ease and convenience of a cloud-native platform. With Starburst Galaxy, users can focus on their analytics tasks and data exploration without the burden of infrastructure management. It offers scalability, elasticity, and integration with popular cloud storage platforms, making it suitable for organizations that prefer a fully managed analytics service in the cloud.
  3. Deployment and Management: Starburst Enterprise is typically deployed on-premises or in private cloud environments, where organizations have control over their infrastructure and data. It provides flexibility for customization and integration with existing systems. Starburst Enterprise requires users to set up and manage the infrastructure, deployment, and maintenance of Trino clusters.

In contrast, Starburst Galaxy is a fully managed service that handles infrastructure provisioning, scaling, and maintenance tasks for Trino clusters in the cloud. Users can focus on their analytics tasks without needing to worry about underlying infrastructure management.

  1. Pricing Model: Starburst Enterprise follows a commercial licensing model where customers purchase licenses based on their specific needs and requirements. The pricing typically includes subscription fees for licensing and support services. The pricing details may vary based on factors such as the number of users, data volume, and required features.

Starburst Galaxy operates on a consumption-based pricing model, where users pay for the resources and usage of the managed Trino service. The pricing is typically based on factors such as the number of queries, data scanned, storage usage, and other associated resources consumed.

In summary, Starburst Enterprise is a commercial distribution of Trino with enterprise features and support, while Starburst Galaxy is a fully managed cloud service that offers Trino as a service, handling infrastructure management and maintenance. Organizations can choose between the two based on their preferences for deployment, management, and support options.

ChatGPT

This is really what I was expecting (it can regurgitate website information) and why I’m going to start trying to use ChatGPT alongside my normal Google searching to see how well each do and decide how often to leverage ChatGPT in day-to-day efforts.

I gave it a chance to stroke my ego with Are there any famous people named Lester Martin?

As of my knowledge cutoff in September 2021, there are no widely recognized or famous individuals named Lester Martin. It’s possible that there may be individuals named Lester Martin who have achieved notable accomplishments in their specific fields or have gained recognition within certain communities, but they may not have widespread fame or recognition. It’s important to note that information about individuals can change over time, so it’s always a good idea to check reliable sources for the most up-to-date information.

ChatGPT

Fair and accurate, but sad for the merry band of Lester Martin’s around the globe. 😉

Okay, let’s be serious again. Is it smarter to write technical blog posts or write technical books?

Deciding whether to write technical blog posts or technical books depends on several factors, including your goals, audience, expertise, and available resources. Here are some considerations to help you make a decision:

  1. Scope and Depth: Technical blog posts are typically shorter and more focused on specific topics or problems. They provide concise explanations or tutorials, making them suitable for addressing specific issues, sharing insights, or showcasing your expertise in a particular area. On the other hand, technical books offer a more comprehensive and structured approach, allowing you to delve deeper into a subject, cover multiple aspects, and provide a broader context.
  2. Time and Commitment: Writing a technical book requires a significant investment of time and effort. Books require extensive planning, research, organization, and revision. On the other hand, blog posts can be written more quickly and published more frequently, allowing you to share knowledge and engage with your audience more frequently.
  3. Audience Reach and Engagement: Blog posts offer the advantage of immediacy and easier accessibility to a wider audience. They can be quickly shared, discovered through search engines, and receive feedback and comments from readers. Books, on the other hand, have the potential for greater authority and reach, as they can be published in print or digital formats, distributed through various channels, and have a longer shelf life.
  4. Expertise and Authority: Writing a technical book can establish you as an authority in your field and boost your professional reputation. It showcases your expertise and can open doors to speaking engagements, consulting opportunities, or career advancements. Blog posts can also enhance your reputation, but the impact may be more immediate and focused on specific topics or problems.
  5. Resource Requirements: Publishing a technical book often requires finding a publisher, negotiating contracts, working with editors and designers, and going through a formal publishing process. This can involve significant resources, including time, financial investment, and collaboration with publishing professionals. In contrast, blogging platforms offer a more accessible and cost-effective way to share your knowledge, as they require minimal technical setup and can be self-published.

Ultimately, the decision between writing technical blog posts or books depends on your goals, resources, and preferences. Consider the level of detail and scope you want to cover, the level of commitment you can dedicate, and the audience reach and engagement you desire. You may also find value in a combination of both, where blog posts can complement and promote a more comprehensive book project.

ChatGPT

It nailed it! And, for all those reasons I think I’ll stick to my blog posts!!

One last question to pick-up where I started from; How many people are fearful of AI?

The level of fear or concern regarding artificial intelligence (AI) can vary among individuals and is influenced by various factors such as their knowledge of AI, exposure to AI-related advancements, personal experiences, cultural and societal influences, and media portrayals. It is challenging to provide an exact number or percentage of people who are fearful of AI as it is a subjective and evolving sentiment.

While some individuals express concerns about AI’s potential impact on employment, privacy, ethics, and control, others may embrace its potential benefits and view it as a tool for solving complex problems and enhancing human capabilities. Public opinion surveys and studies conducted on this topic can provide some insights into the general attitudes towards AI, but they may not capture the full spectrum of opinions or accurately represent individual views.

It is worth noting that the perception of AI is not static and can change over time as people gain more understanding and experience with AI technologies. Additionally, efforts to address ethical considerations, transparency, and responsible AI development can help alleviate some of the fears associated with AI.

Overall, while there are individuals who express fear or concerns about AI, it is important to recognize that attitudes towards AI are diverse and can vary significantly among different populations and contexts.

ChatGPT

Okay, but it did NOT say we shouldn’t be fearful and that second-to-last paragraph was NOT reassuring for nut-jobs like me! LOL!!

To wrap-up, ChatGPT found a place in my toolbelt and I look forward to leveraging it daily. What have your experiences been?

determining # of splits w/trino/starburst/galaxy (hive table format)

Parallel processing engines like Trino/Starburst, Hive, and Spark perform & scale because they can work on multiple “chunks” of the data at the same time. These frameworks use different terminology as shown below, but they GENERALLY mean the same thing.

Trino / Starburstsplits
Hiveblocks
Sparkpartitions
six of one, half a dozen of the other

This article is geared to those who understand, at least conceptually, how Hive and Spark work when consuming data from HDFS and/or object stores. And, of course, those Trino / Starburst folks who don’t know exactly how the number of splits is decided when using the Hive connector.

HDFS uses block sizes (default is 128MB) to control the “chunks” being processed in parallel.

In a nutshell

To help show how “small files” are bad to everyone in the big data space, especially with Trino / Starburst that I’m teaching on currently, I created a couple of tables with the same information, but in different setups.

TableFile size# of filesNotes
5min_ingest500KB10,080New file created every 5 mins
daily_rollup50MB105Each day compacted into 3 files
35 Days of Transactions

Of course, I got a massive performance improvement with the fewer, reasonably-sized, files than with 10,080 small files. That said, the behind the scenes breakdown of the number of splits was not exactly what I had imagined and I wanted to find out more.

The 5min_ingest table’s behavior was easy enough to decipher; it had 10,080 splits which aligned to the 10,800 files.

Hive, which has a large amount of overhead in spinning-up a JVM to read data, has config properties to allow each process to read more than one of these small files. The Trino worker daemons are always running. This dramatically lowers the scheduling overhead for processing splits (similar to what happens when a Spark program gets, and holds onto, its resources) and thus doesn’t have the same concern.

Additionally, and I think it is a good thing, Trino does not try to mask this small files problem which itself needs to be remedied.

The mystery came when I queried daily_rollup. Ideally, the compaction could have created larger (thus fewer) files each day. I imagined these 105 somewhat reasonably-sized files would have become 105 splits (i.e. one/file as I saw with the small files). I ended up getting 205 which puzzled me. I (correctly) figured that the engine was breaking the files into 2 sections thus giving me 2 splits/file, but the math didn’t really add up. That should be 210, not 205. My head began to hurt wondering what happened.

Thankfully, someone pointed me in the right direction (thanks, Anne!). Yep, RTFM time again! Nice warning, too!

These 3 properties are what I needed to look at a bit more.

Property nameDescriptionDefault value
hive.max-initial-splitsFor each table scan, the coordinator first assigns file sections of up to max-initial-split-size. After max-initial-splits have been assigned, max-split-size is used for the remaining splits.200
hive.max-initial-split-sizeThe size of a single file section assigned to a worker until max-initial-splits have been assigned. Smaller splits results in more parallelism, which gives a boost to smaller queries.32MB
hive.max-split-sizeThe largest size of a single file section assigned to a worker. Smaller splits result in more parallelism and thus can decrease latency, but also have more overhead and increase load on the system.64MB

I ran through my two table setups from above and YAY!… IT ALL MADE SENSE! Let’s do the math…

  • 5min_ingest table (10,080 files; 500KB each)
    • Table scan yielded 10,080 files, all smaller than hive.max-initial-split-size, so the first 200 splits were 200 of these files
    • Now that the hive.max-initial-splits threshold was reached, each of the remaining 9,880 files are smaller than the hive.max-split-size limit, so each are assigned to a split
    • 200 + 9,880 = 10,800 splits
  • daily_rollup table (105 files; 50MB each)
    • Table scan yielded 105 files, all larger than hive.max-initial-split-size
    • For the first 100 of these 50MB files, the initial 32MB (ref: hive.max-initial-split-size) was used to create a split and the remaining 18MB (which was not larger than hive.max-split-size) became the second split which hit the 200 hive.max-initial-splits value
    • The remaining 5 files were all under the hive.max-split-size limit, so were each assigned to a split
    • 200 + 5 = 205 splits

Now that this all makes sense (aka don’t be too proud of this technological terror you constructed), I’m wondering how Trino & Starburst determine splits with Iceberg!

starburst galaxy’s materialized views (using apache iceberg)

image BORROWED from https://trino.io/episodes/40.html

Materialized views have become available in Starburst Galaxy for catalogs using Great Lakes connectivity. For folks who are NOT already using Starburst Galaxy — come sign upit’s FREE — especially if you want to exercise the content in this blog post. To test this all out you will need to connect to one of the cloud object stores, too. Feel to look at my querying data in the cloud post if you need a jump start on that.

Since I’m plugging Starburst already, please check out training opportunities from Starburst Academy where an enhanced version of this information will soon be available. And yes, enough with all of that, let’s DO SOMETHING!

Define Table and Query

Create a couple of schemas in a catalog aligned to your cloud object store.

CREATE SCHEMA mycloud.mv_storage;
CREATE SCHEMA mycloud.mvtest;
USE mycloud.mvtest;

Create an Apache Iceberg table and populate it.

CREATE TABLE dune_characters (
  id integer,
  name varchar(55),
  notes varchar(255)
)
WITH (type = 'iceberg');

INSERT INTO dune_characters 
  (id, name, notes)
VALUES
  (101, 'Leto', 'Ruler of House Atreides'),
  (102, 'Jessica', 'Concubine of the Duke');

Run a very simple query to base a materialized view on.

SELECT name, notes 
  FROM dune_characters;

Create/Load the Materialized View

Create a materialized view with the previous defining query.

CREATE MATERIALIZED VIEW mat_view
WITH (storage_schema = 'mv_storage')
AS SELECT name, notes 
     FROM dune_characters;

The WITH clause is optional, but the default behavior for the Iceberg connector writes the underlying “storage table” to the same schema. In practice, it makes sense to locate the storage table somewhere else (if only to limit consumer confusion).

If you queried mat_view now the results would be as expected, but the query would actually access the table in the defining query. The following command will (re)build the underlying storage table.

REFRESH MATERIALIZED VIEW mat_view;

You can now see the dune_characters table and the mat_view materialized view in the mvtest schema as well as the storage table located in the mv_storage schema.

Running a query on the materialized view returns the expected results.

SELECT * FROM mat_view;

To fully verify that the storage table was accessed instead of the original table, click on Query details from the screen shot above and then the Advanced tab under Query ID heading on the page that renders.

As you can see above, the storage table that starts with st_ from the mv_storage schema is being accessed instead of the dune_characters table from the defining query.

Addressing Staleness

Unlike Starburst Enterprise’s Hive connector, there is no automatic or incremental refresh options for the Iceberg connector’s materialized views. This suggests that the data engineering pipeline that makes modifications to the underlying table should consider this and run the REFRESH command when most appropriate.

this discussion about staleness is no longer valid as described in the following blog post

BETTER ICEBERG MATERIALIZED VIEWS IN GALAXY (NO STALENESS CHECK)

A common concern with materialized views on the data lake is the fact that they can actually return stale data. The Iceberg-backed materialized views that Starburst Galaxy utilizes are susceptible to the concern UNLESS all tables in the defining query are are using the Iceberg table format themselves.

If you are unaware that Iceberg keeps track of versions of the table that can be used for time-travel queries and rollbacks, please check out the Introduction to Apache Iceberg in Trino blog series.

Times like this call for the old RTFM adage.

Refreshing a materialized view also stores the snapshot-ids of all Iceberg tables that are part of the materialized view’s query in the materialized view metadata. When the materialized view is queried, the snapshot-ids are used to check if the data in the storage table is up to date. If the data is outdated, the materialized view behaves like a normal view, and the data is queried directly from the base tables. Detecting outdated data is possible only when the materialized view uses Iceberg tables only, or when it uses mix of Iceberg and non-Iceberg tables but some Iceberg tables are outdated. When the materialized view is based on non-Iceberg tables, querying it can return outdated data, since the connector has no information whether the underlying non-Iceberg tables have changed.

https://docs.starburst.io/latest/connector/iceberg.html#materialized-views

Let’s test it out by adding another row to the table in the defining query and then querying mat_view again.

INSERT INTO dune_characters 
  (id, name, notes)
VALUES
  (103, 'Paul', 'Son of Leto');

SELECT * FROM mat_view;

As you can see from the output, even the new row surfaced. Remember, this is only happening since all of the tables in the defining query are created with the Iceberg table format. You can verify the actual table was accessed from the web UI.

A cool feature for sure, but one you need to understand as it wasn’t the behavior I expected initially. Since the underlying Trino Iceberg connector is open-source (and they’re ALWAYS accepting contributions!), I’ve added an item on my potential development page to create an allow_staleness property for scenarios where you might not want this behavior.

Trust, but Verify…

For those of us who like to trust what we hear, but (maybe too) often want to “validate” what we hear from others by just trying it out… you could rerun everything above, but this time create the dune_characters table using WITH (type = 'hive') instead. 😉

What about Starburst Enterprise?

Great news!! This all works on Starburst Enterprise, too, as called out in the Starburst Iceberg connector!

querying starburst / trino from apache superset (in 7 steps)

This post offers up a short list of steps for running Superset using Docker Compose on a Mac and then making a client connection to, and querying data from, a Starburst (Enterprise or Galaxy) cluster.

I think I’ve got it down to 7 (relatively) quick steps!

As always, Starburst Documentation has great instructions for making this client connection. It did seem that maybe one additional command was missing, but it is identified in the very next link of the doc page; Trino Community Broadcast 12.

Now that I have referenced where I got all the information… Let’s Do This!!

Step 1

Ensure Docker will allocate >= 6GB of memory.

Step 2

Clone the repo.

cd repos/apache
git clone https://github.com/apache/superset.git

Step 3

Set up the Trino driver.

cd superset
echo "sqlalchemy-trino" >> ./docker/requirements-local.txt

Step 4

Start Superset.

docker-compose -f docker-compose-non-dev.yml up

Step 5

Log in as user admin with password admin at http://localhost:8088.

Step 6

Create the driver by navigating to Settings (upper-right corner) > selecting Database Connections link from the pull down > clicking on the + DATABASE button > choosing Trino from SUPPORTED DATABASES pulldown > entering an appropriate SQLALCHEMY URI (syntax below) > then clicking on the CONNECT button.

trino://username:password@hostname:port

OK… that was really SIX (sub)steps, or maybe six CLICKS? 😉

Step 7

Navigate to the query editor by visiting SQL > SQL Lab and then running a query.

eliminate rollup’s null confusion (hint: grouping keyword)

This quick blog post is presented to help you distinguish aggregate null from stored null values when using ROLLUP, CUBE, and GROUPING_SETS in your SQL analytical queries.

To help visualize the problem, I will focus on ROLLUP. This scenario (and solution) should work with any SQL-compliant database, but I will use Starburst Galaxy for this tutorial. Why? Well, it is FREE (and yes, I work there!!).

Once you get all setup with Starburst Galaxy, find your way to the Query editor and under the Cluster explorer you should be able to find the astronauts table.

Let’s focus in on a small number of records to calculate our ROLLUP results from.

SELECT year_of_birth, name, 
       military_civilian, occupation
  FROM sample.demo.astronauts
 WHERE nationality = 'U.S.' 
   AND year_of_birth BETWEEN 1936 AND 1937
 ORDER BY nationality, year_of_birth, name, 
          military_civilian, occupation;

Let’s now see a ROLLUP working like we would expect.

SELECT year_of_birth,
       military_civilian,
       COUNT() AS nbr_trips
  FROM sample.demo.astronauts
 WHERE nationality = 'U.S.'
   AND year_of_birth BETWEEN 1936 AND 1937
 GROUP BY ROLLUP(year_of_birth, military_civilian)
 ORDER BY year_of_birth, military_civilian;

The totals are identified below.

  • 27 total records
    • 14 records for 1936
    • 13 records for 1937

That use of null works fine until we have a scenario where there are actually null values in the columns being rolled up. We can see this when we change the ROLLUP to function on the occupation column instead of military_civilian.

SELECT year_of_birth,
       occupation,
       COUNT() AS nbr_trips
  FROM sample.demo.astronauts
 WHERE nationality = 'U.S.'
   AND year_of_birth BETWEEN 1936 AND 1937
 GROUP BY ROLLUP(year_of_birth, occupation)
 ORDER BY year_of_birth, occupation;

The totals are calculated similarly as before.

  • 27 total records
    • 14 records for 1936
    • 13 records for 1937
    NOTE: The confusing thing is that we have two rows with 1936 & null. When looking at these results (and the original data shown earlier in this post) you can deduce what’s going on. There are two records for 1936 that have null values and those 2 rollup into the 14 total for 1936.

By adding GROUPING into the query below, for the column in question, we can get some more insights into what is going on.

SELECT year_of_birth,
       occupation,
       GROUPING(occupation) AS occ_tot_ind,
       COUNT() AS nbr_trips
  FROM sample.demo.astronauts
 WHERE nationality = 'U.S.'
   AND year_of_birth BETWEEN 1936 AND 1937
 GROUP BY ROLLUP(year_of_birth, occupation)
 ORDER BY year_of_birth, occupation, occ_tot_ind;

If the GROUPING function returns 0 then the returned null is for a regular null value. If it returns a 1 then it means it is the special use case for the appropriate rollup total.

querying aviation data in the cloud (leveraging starburst galaxy)

In this blog post we will load some data into a cloud object store, configure a compute cluster that is built upon Trino, and finally explore some sample data with this cloud configuration that separates compute & storage.

NOTE: This post will make the most sense if you already have some schema-on-read knowledge with a tool like Apache Hive (to include concepts like external tables) as well as object store “bucket” creation and security. If this is not you then come on the journey anyway and I’ll try to point you to more information on these topics when appropriate.

Create a Bucket

I am most familiar with AWS, so I went ahead and created a bucket named lestermartin.blog with an empty directory within it called mygalaxy as shown below.

Those more comfortable with Azure or GCP can still follow these steps, but I’m going to let those folks figure out the appropriate physical steps for the logical ones I’m implementing on AWS.

Create a Catalog & Schema

For this exercise we are using Starburst Galaxy. Haven’t used Starburst Galaxy yet? No worries, it is free to signup and use (that’s what I’m doing!).

Once you are all setup with Starburst Galaxy, click on Catalogs on the left nav and then push the Configure a catalog button that surfaces.

From here, just follow the wizard to select Amazon S3 (or other appropriate object store if you are using this as a general exercise on another cloud provider). I decided on mycloud for the Catalog name.

For Authentication to S3, I chose to use AWS access key. If this is new to you, check out How to create an S3 Bucket and AWS Access Key ID and Secret Access Key for accessing it.

I went down the easy path of using AWS Glue for the Metastore type and configured S3 bucket name and directory to the values I created earlier. I knew I would be leveraging external tables (pointing a table’s metadata to a directory of files managed outside the control of the SQL engine), so I enabled the two switches shown below, too.

While I do want to experiment more with Iceberg, I’m going to save that for future blog posts. I changed the Default table format to be Hive before pressing the Test connection button.

That worked like a dream and I went ahead and clicked on Connect catalog. This then took me to the Set permissions screen and I kept all the defaults since I am the only user and want all those rights. Finally, the Add to cluster screen came up and I attached it to my existing tiny-cluster.

This redirected me to the Query editor and in the Cluster explorer I opened up my cluster and the mycloud schema underneath it to see there is only the metadata schema.

Lastly, we need to create a schema to contain tables for the data to be explored.

CREATE SCHEMA mycloud.aviation
  WITH (location = 's3://lestermartin.blog/mygalaxy/');

You should now have an empty schema in the Cluster explorer.

Create & Load Tables

For this post, we are going to use some FAA data. Please review FAA Dataset Information for details about the four files/tables we will be using for flights, carriers, airports & planes and for where to download this information from.

The raw data that you will download is in CSV format and has no guarantees of quality. A whole other blog series is brewing in my head of how to use this as example of building a data engineering pipeline, but for now… I will go ahead and assume (maybe HOPE is a better word) that I can fit this text file format into appropriate data types.

Upload Datasets

I created a subfolder in S3 to align with the aviation schema we identified earlier and then created four subdirectories as identified below.

Within each of those directories, place the appropriate CSV file within. As an example plane-data.csv belongs in the raw_plane folder.

Create External Tables

Now we can go back to the Query editor to create the external tables referencing this data loaded in the cloud. Be sure to verify that these new tables can be queried after creating them.

Flight

CREATE TABLE mycloud.aviation.raw_flight (
    month smallInt,
    day_of_month smallInt,
    day_of_week smallInt,
    dep_time smallInt,
    arr_time smallInt,
    unique_carrier varchar(15),
    flight_number smallInt,
    tail_number varchar(15),
    elapsed_time smallInt,
    air_time smallInt,
    arr_delay smallInt,
    dep_delay smallInt,
    origination varchar(15),
    destination varchar(15),
    distance smallInt,
    taxi_in smallInt,
    taxi_out smallInt,
    cancelled varchar(15),
    cancellation_code varchar(15),
    diverted varchar(15)
) WITH (
    external_location = 's3a://lestermartin.blog/mygalaxy/aviation/raw_flight',
    format = 'TEXTFILE',
    textfile_field_separator = ','    
);

Carrier

CREATE TABLE mycloud.aviation.raw_carrier (
    code varchar(15),
    description varchar(150)
) WITH (
    external_location = 's3a://lestermartin.blog/mygalaxy/aviation/raw_carrier',
    format = 'TEXTFILE',
    textfile_field_separator = ','    
);

Airport

CREATE TABLE mycloud.aviation.raw_airport (
    code varchar(15),
    description varchar(150),
    city varchar(150),
    state varchar(150),
    country varchar(150),
    lat decimal(10,8),
    lng decimal(11,8)
) WITH (
    external_location = 's3a://lestermartin.blog/mygalaxy/aviation/raw_airport',
    format = 'TEXTFILE',
    textfile_field_separator = ','    
);

Plane

CREATE TABLE mycloud.aviation.raw_plane (
    tail_number varchar(15),
    usage varchar(150),
    manufacturer varchar(150),
    issue_date varchar(150),
    model varchar(150),
    status varchar(150),
    aircraft_type varchar(150),
    engine_type varchar(150),
    year_built smallint
) WITH (
    external_location = 's3a://lestermartin.blog/mygalaxy/aviation/raw_plane',
    format = 'TEXTFILE',
    textfile_field_separator = ','    
);

Analyze the Data

Make sure all four tables are present by checking the Cluster explorer visualization or running the follow SQL statement.

SHOW TABLES IN mycloud.aviation;

How many rows are in each table?

SELECT count(*) from mycloud.aviation.raw_TABLE-NAME;

What countries are most of the airports located in?

SELECT country, count() AS num_airports
  FROM mycloud.aviation.raw_airport
 GROUP BY country
 ORDER BY num_airports DESC;

Ignoring the first record in the results from the prior query, what doe the rest of the results indicate? Hint: Remember what I said about data quality. 😉

What are the top 5 airline codes with the most number of flights?

SELECT unique_carrier, count() as num_flights
  FROM mycloud.aviation.raw_flight
 GROUP BY unique_carrier 
 ORDER BY num_flights DESC
 LIMIT 5;

Same question, but show the airline carrier’s name.

SELECT c.description, count() as num_flights
  FROM mycloud.aviation.raw_flight  f 
  JOIN mycloud.aviation.raw_carrier c
    ON (f.unique_carrier = c.code)
 GROUP BY c.description 
 ORDER BY num_flights DESC
 LIMIT 5;

What are the most common airplane models for flights over 1500 miles?

 SELECT p.model, count() as num_flights
   FROM mycloud.aviation.raw_flight f 
   JOIN mycloud.aviation.raw_plane  p
     ON (f.tail_number = p.tail_number)
  WHERE f.distance > 1500
    AND p.model IS NOT NULL
  GROUP BY p.model
  ORDER BY num_flights desc
  LIMIT 10;

We could go at this all day with running all kinds of fun queries, but let’s wrap up the post with a pair of more interesting questions that require the use of window functions.

What is the month over month percentage change of number of flights departing from each airport?

WITH agg_flights AS (
SELECT origination, month, 
       COUNT(*) AS num_flights
  FROM mycloud.aviation.raw_flight
 GROUP BY 1,2
),

change_flights AS (
SELECT origination, month, num_flights,
       LAG(num_flights, 1)
         OVER(PARTITION BY origination
                ORDER BY month ASC) 
           AS num_flights_before
  FROM agg_flights
)

SELECT origination, month, num_flights, num_flights_before,
       ROUND((1.0 * (num_flights - num_flights_before)) / 
             (1.0 * (num_flights_before)), 2)
          AS perc_change
  FROM change_flights;

Determine the top 3 routes departing from each airport.

WITH popular_routes AS (
SELECT origination, destination,
       COUNT(*) AS num_flights
  FROM raw_flight
 GROUP BY 1, 2
),

ranked_routes AS (
SELECT origination, destination,
       ROW_NUMBER() 
         OVER(PARTITION BY origination 
               ORDER BY num_flights DESC) 
           AS rank
  FROM popular_routes
)

SELECT origination, destination, rank
  FROM ranked_routes
  WHERE rank <= 3
  ORDER BY origination, rank;

Like I said… FUN! And in all fairness, I “lifted” these last two queries from a joint Stanford/MIT post.

For more fun, post your questions in the comments and I’ll see if I can answer them. If you feel more generous (and your don’t mind a little shameless self-promotion) then go ahead and present your answers along with your challenging SQL-oriented questions.

hive, trino & spark features (their journeys to sql, performance & durability)

Open-source frameworks for big data analytics have existed for over a decade now and have wide acceptance across all industries. Three of the most popular ones are Apache Hive, Trino, and Apache Spark. There are many aspects of these frameworks that can be compared & contrasted, but I want to focus on the following three features and walk through each framework’s journey to attain all three (spoiler; they all get there!).

  • SQL – Structured Query Language, like it or not, is THE most accepted analysis language for business data with known structure
  • Performance – Obviously, this suggests that we want these SQL queries to run as fast as possible
  • Durability – Many SQL queries & operations take a long time to complete and the feature of durability would ensure that user requests will run to completion; even if there are software/hardware failures

Historical Timeline

Event: Apache Hadoop Surfaces (2006)

Hadoop was released to attack large-scale data analysis tasks that existing technologies either could not process at all, or organizations could not afford to scale those technologies to the needed level. The cluster is a combination of storage (HDFS) and compute (YARN) that allows for an awesome feature called “data locality” which basically means to take the processing to the data instead of the inverse.

Initially, Hadoop developers were only presented with the Java MapReduce API which did offer data analysis processing abilities with inherent job reliability and durability features. This approach did not offer SQL (Hadoop will quickly offer a SQL abstraction layer), but was focused on guaranteeing a job would complete — regardless of how long it takes to complete.

Hadoop (Hive)TrinoSpark
SQL
Performance
Durability2006

Event: Apache Hive is Created (2010)

Developers at Facebook built Hive, a SQL abstraction layer on top of Hadoop, to get past the Java programmer hurdle with Hadoop. Hive created a component called the metastore that stores the needed information for this schema-on-read data warehouse technology. This metadata for each table includes the following.

Hive is tightly-coupled with Hadoop and ultimately submits MapReduce jobs (although now with the optimized Tez engine) that run in the cluster along with other queries as well as other types of workloads.

HiveTrinoSpark
SQL2010
Performance
Durability2006

Hive also brought us the ORC file format, but even with this and Tez it wouldn’t be fair to say it has fully checked off the Performance checkbox yet.

Event: PrestoDB (see Trino family tree) Invented for Interactive Queries (2012)

Yep, I referenced PrestoDB in this section’s header and THEN dropped a Commander Bun Bun logo right after it. Here’s a great resource for describing The journey from Presto to Trino and Starburst. Armed with that awareness, bare with me as I focus on just saying Trino hereafter.

https://www.starburst.io/blog/the-journey-from-presto-to-trino-and-starburst/

Still over at Facebook, it was determined that Hive was great for long-running analytical queries and for data engineering pipelines. Trino (fka PrestoSQL) was created to execute fast queries. It did this by having its own cluster of dedicated compute nodes separate from Hadoop and optimized how the query could be run if we didn’t spill data to disk in the intermediary steps of a full query.

This improved speed greatly, but at the cost of reliability. If anything went wrong with the execution, an error is returned to the person or process that submitted the query.

HiveTrinoSpark
SQL20102012
Performance2012
Durability2006

An added benefit of separating compute and storage allowed Trino to use a variety of Connectors to become a single point of access for a variety of data systems, not just a variety of file formats on the data lake. Additionally, federated queries can be run across the connectors.

Event: Apache Spark Emerges (2014)

The “smart kids” over at UC Berkley’s AMPLab were enjoying the new abilities to run jobs on Hadoop, but they realized that for recursive processing (such as machine learning algorithms) Hadoop’s inherent sharing-model of resource management was hurting them. The Spark creators did realize that there were existing resource managers, such as Hadoop YARN, and they utilized these existing tools instead of re-inventing them.

They started building Spark (still a MapReduce engine) and realized if they allocated all the resources they needed at the start of a program and coupled that with in-memory caching options (when the processing really needed to revisit the same immutable data over and over) then they could run jobs 50-100x faster.

For non-recursive (i.e. good old-fashioned data engineering) jobs the execution could easily be 3-7x faster due to not needing to request resources from parallel task to parallel task, therefore we can declare it a performance-oriented framework for a variety of workloads.

HiveTrinoSpark
SQL20102012
Performance20122014
Durability20062014

At this point, the primary API was focused on the Resilient Distributed Dataset (RDD) which required programming expertise.

Event: Spark Adds SQL Support (2015)

As we know, the data analysis world is fueled by SQL. It didn’t take Spark long to add their DataFrame API which in addition to a programmatic API allows for classical SQL operations. This rounded the Spark platform out regarding the features of SQL, Performance and Durability.

HiveTrinoSpark
SQL201020122015
Performance20122014
Durability20062014

Event: LLAP Hits the Scene (2017)

As this whole blog post is a testament to user requirements drive features AND imitation is the best form of flattery, the Hive community created an optional framework called Live Long and Process (LLAP). It has a lot of sophistication, but I’ll boil it down to the fact that it supports resources being allocated, online and ready for querying as well as a shared memory cache across all the nodes that the processing resources have been allocated to.

This solution can easily attain sub-second query results on datasets than can fit in the shared cache and LLAP doesn’t have to coordinate with YARN all the time. While LLAP is an optional element of Hive, it truly does bring the feature of high-performance to stable SQL engine.

HiveTrinoSpark
SQL201020122015
Performance201720122014
Durability20062014

Event: Project Tardigrade (2022)

Over the years that Trino has been fulfilling its role as a fast query processor, users have also been leveraging it their ETL pipelines, too. While Facebook has been using fault-tolerant execution with Presto for years now, this feature finally came to open-source Trino. This feature-release blog post offers more details.

HiveTrinoSpark
SQL201020122015
Performance201720122014
Durability200620222014

Now, as promised, we finally show a full table indicating all three popular SQL engines satisfy performance AND durability. This concludes our history lesson for today. 🙂

workplace diversity (embrace it)

I snagged a few images from my workplace diversity training that I loved and wanted to share them. Diversity is a fact of life and IMHO only helps us grow in our shared humanity when diversity is in all elements of our lives.

In the workplace, we all need to embrace our diversity and see that we can all work together regardless of the things that makes us different.

And it is my hope and desire that we call all embrace this diversity in our hearts & minds and do so much more than “work together” — we can LIVE together.

Diversity doesn’t look like anyone. It looks like everyone.

Karen Draper

why i joined starburst (optionality and common sense)

I’ve been meaning for a while now to write a post after wrapping up my 8 year hortonworks – cloudera adventure declaring why I joined Starburst. I joined at the beginning of the 2022 and was crowned not only our first ever Educational Engineer (meet me in the comments section to discuss what that means), but formally Employee # 300. We all know what 300 focused individuals can do — what a powerful image that is!

On our way to doubling our staff by end of year, we cleared the 400 employees mark in under 4 months. Wow! I figured no better time than right now to share why I made the move. There are many reasons, but so many of them come back to this one word…

the potential for options

To give this term, and how we use it at Starburst, its due justice I invite you to listen to our CEO at this podcast. In a nutshell, optionality is the ability to adapt (as much, or little, as you want/need) to the reality that all of your data WILL NOT live in a single tool, database, object store, cloud vendor, etc for a myriad of reasons. Optionality does not give up on the goal of a single source-of-truth, but concedes the benefits of a single point-of-access. Especially one with a MPP SQL engine, built on Trino, focused on performance.

And… if you are one of those folks who has not accepted that the “perfect vision/implementation” will not ever happen, I invite you to read the book titled recrEAtion to open your eyes. It was a tough read for me, a fella who had leadership roles in 3 different EA teams in my career, but I’m better for it. I digress…

In addition to optionality, there are many other factors that played into this choice. Here are a few of them:

  • Laser-Focused (open-sourced, of course!) Product — we aren’t trying to be all things for everyone; we are the BEST SQL engine to access your data, wherever it lives
  • Multiple Deployment Options — we offer several ways to get going with Starburst as installable software; including deploying on k8s
  • Cloud Strategy — our SaaS offering, Starburst Galaxy, is live and production ready, but we are cognizant that customers need to pick the right solution for themselves and we are not pushing customers to move to the cloud if not right for them
  • Rational & Reasonable Leadership — we are gritty work-horses (not unicorns because they don’t exist) with leadership, from my manager up to the CEO, who know what it takes to make aggressive AND attainable goals (again, please listen to this podcast)
  • Startup Mindset — one of my favorite things about startups is that you can get a decision made quickly (even it isn’t the one you wanted) and oh my… I got some sweet stock options to boot!

If I whet your appetite to give us a look, surf over to the careers page and definitely let me know if I can help in any way.

federated queries on starburst galaxy (long and short videos)

As a follow up to my querying starburst galaxy from tableau video, the Marketing team here at Starburst Data was as happy as I was to record another video. This time it was to show how easy it is to connect to multiple data sources (of multiple persistent storage types) and then perform a federated query that allows for a single SQL statement join across these three technology types.

Remember, there are a BUNCH of connectors available.

Long Version

Short Version