apache spark (yet another overview)

Yes, it is early 2025 and Apache Spark has been around “some time”, but there is always an audience out there that can benefit from a good overview. I thought I’d take my attempt at it after trying to find a good deck I could use for introducing Spark to an audience I’m speaking to. Let’s get into it!

From 20,000 feet

Why was Spark created in the first place?

TLDR; Data scientists were excited when machine learning (ML) algorithms started becoming available on Apache Hadoop via frameworks such as Apache Mahout, but they didn’t like waiting for these jobs to finish. A smart fella at UC Berkeley created a map/reduce engine that used long-lived worker processes and featured dataset caching. Oh, they ported over all the major ML algorithms, too.

Spark in a nutshell

Spark is a general-purpose data processing engine. It is designed to handle very large amounts of data and it distributes processing across a cluster. It supports a wide range of workloads types such as ML, querying structured data, batch processing, streaming ingestion, and graph computations.

No real surprises in the types of use cases it can tackle; transformations, aggregations & windowing, interactive & batch data analysis, text mining, index building, pattern recognition, graph creation & analysis, sentiment analysis, prediction models, recommendation systems, fraud detection, logs processing, event detection, and more.

Data sources

Can read from a variety of sources.

Data lakes are the sweet spot.

Ecosystem

The top layer of the diagram below shows that Spark is generally accessed via APIs for a select set of programming languages. These APIs can access the initial (and underlying core) data representation known as a Resilient Distributed Dataset (RDD). For structured data, Spark created the Dataframe API (also commonly referred to as Spark SQL) that layers on top of the RDD API.

Additional APIs such as Structured Streaming and MLLib use the Dataframe API so that programmers can quickly transition to different workload types without learning a whole new framework.

As the diagram above shows, Spark requires a resource management framework such as Kubernetes or Hadoop’s YARN at runtime.

More details

Check out the following video for more details at the 20,000 foot level.

On the surface

Developer tools

While formalizing code into production systems usually includes typical SDLC lifecycle tools such as IDEs, source code control, CI/CD tooling, and the like, much of the exploratory work done by data engineers, and often most of the work done by data analysts & scientists, is done using the shell or the more friendly notebook experience.

RDD basics

An RDD is a collection of unstructured items, usually of the same type, that is composed of multiple pieces referred to as partitions. These partitions are distributed across a cluster to enable operations to execute in parallel. NOTE: the items in the collection are very often structured, but the API isn’t directly aware of this.

Functional programming

Spark datasets are operated against using an approach called functional programming. Here are a few key points from that paradigm.

  • Immutable datasets – you can’t change them, but you can create new ones
  • Functions are deterministic and can accept functions as arguments (even anonymous functions)
  • Lazy execution – the system waits as long as possible before real work begins, thus operations are either lazy or not lazy
    • Transformations – lazy operations that return a new RDD from input RDDs
    • Actions – trigger an I/O activity

Example code

Given a text file called some_words.txt

these are words
these are more words
words in english

Calculate “word count” totals via transformations

rdd1 = textFile("some_words.txt")

rdd2 = rdd1.flatMap(lambda line: line.split(" "))

rdd3 = rdd2.map(lambda word: (word, 1))

rdd4 = rdd3.reduceByKey(lambda a, b: a + b)

Fire off an action

rdd4.collect()

Displays the totals

[('these', 2),
 ('are', 2),
 ('more', 1),
 ('in', 1),
 ('words', 3),
 ('english', 1)]

NOTE: you can rewrite to chain methods…

counts = (
  textFile("some_words.txt")
  .flatMap(lambda line: line.split(" "))
  .map(lambda word: (word, 1))
  .reduceByKey(lambda a, b: a + b)
)
counts.collect()

BTW, it’s OK if your head hurts and know I’m purposely NOT trying to explain the code above in any detail.

Visualizing the flow

This next diagram attempts to visualize immutable datasets and lazy execution defined earlier.

Spark SQL

The RDD API can feel rather primitive, especially when working with structured datasets. Fortunately, the Spark SQL / Dataframe API offers a more functional approach. Think of this model as RDD + Schema = Dataframe.

Spark SQL allows for native integrations to schema-enabled file formats, databases, and metastores (such as Hive’s HMS). Additionally, a SQL optimizer is in place which allows the programmer to focus on WHAT needs to get done, not specifically HOW to do it. Note: the optimizer creates & runs RDD code under the covers.

Example code

Here is a simple example using the Dataframe API

tli = session.table("tpch.tiny.lineitem")
tli.select("orderkey", "linenumber", "quantity",
           "extendedprice", "linestatus")
   .filter("orderkey <= 5")
   .sort("orderkey", "linenumber")
   .show(50)

You can use the sql() function to change the above code to simply use SQL

tli_sql = session.sql(" \
    SELECT orderkey, linenumber, quantity, \ 
           extendedprice, linestatus \
      FROM tpch.tiny.lineitem \
     WHERE orderkey <= 5 \
     ORDER BY orderkey, linenumber")
tli_sql.show(50)

With the Dataframe API you can mix/n/match the the descriptive methods such as filter() & sort() with just using sql() to write well-understood SQL.

Powered by Dataframes

As mentioned back in the ecosystem section, there are other cool things you can do with Dataframes.

MLLib

MLLib is Spark’s ML library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

  • ML algorithms
  • Featurization
  • Pipelines
  • Persistence
  • Utilities

Structured Streaming

Utilizing a micro-batch architecture, you can express your streaming computation the same way you would express a batch computation on static data. Features include:

  • Exactly-once processing
  • Streaming aggregations
  • Event-time windows
  • Stream-to-batch joins

More details

Please watch the following video for more about RDDs and Dataframes.

Below the waterline

Now that we’ve seen Spark from an airplane window AND from the programming APIs, let’s see what happens under the covers.

Revisiting partitions

RDDs & Dataframes are composed of multiple partitions distributed across a cluster of machines to enable parallel operations.

Operations (lazy or not) are classified as either:

  • Narrow – operations that can be pipelined together and run independently on each partition
  • Wide – functions that require the existing partitions to be reshuffled before beginning (this is expen$ive)

Function characteristics

These two types of characteristics are mutually exclusive from each other as shown in the table below.

  • Categorized as a transformation or an action
  • Parallelization classification of narrow or wide
FunctionDescriptionCategoryParallel
map()Creates a new RDD by looping through each element and performing the function passed inTransformNarrow
collect()Return all the elements of the dataset as an arrayActionWide
filter()Creates a new RDD by selecting those elements of the source on which the passed n function returns trueTransformNarrow
reduceByKey()Operates much like a GROUP BY query and returns only one element in the newly created RDD for a given keyTransformWide
saveAs...()Writes the RDD to a repository in a parallel fashion with each partition stored as a independent fileActionNarrow

Visualizing partitioning

Job elements

Jobs, or queries when leveraging the Dataframe API, are broken down into these elements in the list below and visualized using the same diagrams we have been using.

  • Task – a series of narrow operations that work on the same partition and are pipelined together
  • Stage – grouping of tasks that can run in parallel on different partitions of the same dataset (they require the dataset to be reshuffled)
  • Job – consists of all the stages that are required when an action is performed
  • Query – made up of one, or more, jobs required for a Spark SQL statement to run

A job is a Directed Acyclic Graph (DAG) which is all the activities need to run a job, organized with all the dependencies & relationships to indicate how they should run. The DAG can also be represented as a query plan which can be obtained programmatically via the explain() method as well as via the Spark monitoring tools.

Cluster execution

A job utilizes multiple nodes at execution time and leverages multiple deployment options.

  • A single driver process is the coordinator for a given job
  • The driver schedules tasks to be run on executor processes
  • More than one process may run on a single node

While outside the scope of this overview post, here are some additional thoughts on cluster execution.

  • When a job starts it will gain access to resources to run its executors and generally will not release them until it is complete.
  • A new stage can’t start until the stage(s) it depends on have finalized.
  • Spark allows for Adaptive Query Execution (AQE) which recreates the plan at the end of a stage thereby allowing it to make better decisions on how many partitions the shuffling effort should produce.
  • Spark tries to be efficient, but can (and often) spill intermediary results to disk.
  • RDD and Dataframe APIs support explicit caching methods with multiple persistence options.

More details

The final video provides more under the covers details.


There is a world of information available about Spark on the internet to continue your learning journey, but if you liked my post feel free to check out other related articles here on my blog as well as my legacy blog.

Published by lestermartin

Developer advocate, trainer, blogger, and data engineer focused on data lake & streaming frameworks including Trino, Hive, Spark, Flink, Kafka and NiFi.

One thought on “apache spark (yet another overview)

Leave a Reply to lestermartinCancel reply

Discover more from Lester Martin (l11n)

Subscribe now to keep reading and get access to the full archive.

Continue reading