
Yes, it is early 2025 and Apache Spark has been around “some time”, but there is always an audience out there that can benefit from a good overview. I thought I’d take my attempt at it after trying to find a good deck I could use for introducing Spark to an audience I’m speaking to. Let’s get into it!
From 20,000 feet
Why was Spark created in the first place?
TLDR; Data scientists were excited when machine learning (ML) algorithms started becoming available on Apache Hadoop via frameworks such as Apache Mahout, but they didn’t like waiting for these jobs to finish. A smart fella at UC Berkeley created a map/reduce engine that used long-lived worker processes and featured dataset caching. Oh, they ported over all the major ML algorithms, too.
Spark in a nutshell
Spark is a general-purpose data processing engine. It is designed to handle very large amounts of data and it distributes processing across a cluster. It supports a wide range of workloads types such as ML, querying structured data, batch processing, streaming ingestion, and graph computations.
No real surprises in the types of use cases it can tackle; transformations, aggregations & windowing, interactive & batch data analysis, text mining, index building, pattern recognition, graph creation & analysis, sentiment analysis, prediction models, recommendation systems, fraud detection, logs processing, event detection, and more.
Data sources
Can read from a variety of sources.

Data lakes are the sweet spot.

Ecosystem
The top layer of the diagram below shows that Spark is generally accessed via APIs for a select set of programming languages. These APIs can access the initial (and underlying core) data representation known as a Resilient Distributed Dataset (RDD). For structured data, Spark created the Dataframe API (also commonly referred to as Spark SQL) that layers on top of the RDD API.
Additional APIs such as Structured Streaming and MLLib use the Dataframe API so that programmers can quickly transition to different workload types without learning a whole new framework.

As the diagram above shows, Spark requires a resource management framework such as Kubernetes or Hadoop’s YARN at runtime.
More details
Check out the following video for more details at the 20,000 foot level.
On the surface
Developer tools
While formalizing code into production systems usually includes typical SDLC lifecycle tools such as IDEs, source code control, CI/CD tooling, and the like, much of the exploratory work done by data engineers, and often most of the work done by data analysts & scientists, is done using the shell or the more friendly notebook experience.


RDD basics
An RDD is a collection of unstructured items, usually of the same type, that is composed of multiple pieces referred to as partitions. These partitions are distributed across a cluster to enable operations to execute in parallel. NOTE: the items in the collection are very often structured, but the API isn’t directly aware of this.
Functional programming
Spark datasets are operated against using an approach called functional programming. Here are a few key points from that paradigm.
- Immutable datasets – you can’t change them, but you can create new ones
- Functions are deterministic and can accept functions as arguments (even anonymous functions)
- Lazy execution – the system waits as long as possible before real work begins, thus operations are either lazy or not lazy
- Transformations – lazy operations that return a new RDD from input RDDs
- Actions – trigger an I/O activity
Example code
Given a text file called some_words.txt
these are words
these are more words
words in english
Calculate “word count” totals via transformations
rdd1 = textFile("some_words.txt")
rdd2 = rdd1.flatMap(lambda line: line.split(" "))
rdd3 = rdd2.map(lambda word: (word, 1))
rdd4 = rdd3.reduceByKey(lambda a, b: a + b)
Fire off an action
rdd4.collect()
Displays the totals
[('these', 2),
('are', 2),
('more', 1),
('in', 1),
('words', 3),
('english', 1)]
NOTE: you can rewrite to chain methods…
counts = (
textFile("some_words.txt")
.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
)
counts.collect()
BTW, it’s OK if your head hurts and know I’m purposely NOT trying to explain the code above in any detail.
Visualizing the flow
This next diagram attempts to visualize immutable datasets and lazy execution defined earlier.
Spark SQL
The RDD API can feel rather primitive, especially when working with structured datasets. Fortunately, the Spark SQL / Dataframe API offers a more functional approach. Think of this model as RDD + Schema = Dataframe.
Spark SQL allows for native integrations to schema-enabled file formats, databases, and metastores (such as Hive’s HMS). Additionally, a SQL optimizer is in place which allows the programmer to focus on WHAT needs to get done, not specifically HOW to do it. Note: the optimizer creates & runs RDD code under the covers.
Example code
Here is a simple example using the Dataframe API
tli = session.table("tpch.tiny.lineitem")
tli.select("orderkey", "linenumber", "quantity",
"extendedprice", "linestatus")
.filter("orderkey <= 5")
.sort("orderkey", "linenumber")
.show(50)
You can use the sql() function to change the above code to simply use SQL
tli_sql = session.sql(" \
SELECT orderkey, linenumber, quantity, \
extendedprice, linestatus \
FROM tpch.tiny.lineitem \
WHERE orderkey <= 5 \
ORDER BY orderkey, linenumber")
tli_sql.show(50)
With the Dataframe API you can mix/n/match the the descriptive methods such as filter() & sort() with just using sql() to write well-understood SQL.

Powered by Dataframes
As mentioned back in the ecosystem section, there are other cool things you can do with Dataframes.
MLLib
MLLib is Spark’s ML library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:
- ML algorithms
- Featurization
- Pipelines
- Persistence
- Utilities
Structured Streaming
Utilizing a micro-batch architecture, you can express your streaming computation the same way you would express a batch computation on static data. Features include:
- Exactly-once processing
- Streaming aggregations
- Event-time windows
- Stream-to-batch joins
More details
Please watch the following video for more about RDDs and Dataframes.
Below the waterline
Now that we’ve seen Spark from an airplane window AND from the programming APIs, let’s see what happens under the covers.
Revisiting partitions
RDDs & Dataframes are composed of multiple partitions distributed across a cluster of machines to enable parallel operations.
Operations (lazy or not) are classified as either:
- Narrow – operations that can be pipelined together and run independently on each partition
- Wide – functions that require the existing partitions to be reshuffled before beginning (this is expen$ive)
Function characteristics
These two types of characteristics are mutually exclusive from each other as shown in the table below.
- Categorized as a transformation or an action
- Parallelization classification of narrow or wide
| Function | Description | Category | Parallel |
map() | Creates a new RDD by looping through each element and performing the function passed in | Transform | Narrow |
collect() | Return all the elements of the dataset as an array | Action | Wide |
filter() | Creates a new RDD by selecting those elements of the source on which the passed n function returns true | Transform | Narrow |
reduceByKey() | Operates much like a GROUP BY query and returns only one element in the newly created RDD for a given key | Transform | Wide |
saveAs...() | Writes the RDD to a repository in a parallel fashion with each partition stored as a independent file | Action | Narrow |
Visualizing partitioning
Job elements
Jobs, or queries when leveraging the Dataframe API, are broken down into these elements in the list below and visualized using the same diagrams we have been using.
- Task – a series of narrow operations that work on the same partition and are pipelined together
- Stage – grouping of tasks that can run in parallel on different partitions of the same dataset (they require the dataset to be reshuffled)
- Job – consists of all the stages that are required when an action is performed
- Query – made up of one, or more, jobs required for a Spark SQL statement to run
A job is a Directed Acyclic Graph (DAG) which is all the activities need to run a job, organized with all the dependencies & relationships to indicate how they should run. The DAG can also be represented as a query plan which can be obtained programmatically via the explain() method as well as via the Spark monitoring tools.
Cluster execution
A job utilizes multiple nodes at execution time and leverages multiple deployment options.
- A single driver process is the coordinator for a given job
- The driver schedules tasks to be run on executor processes
- More than one process may run on a single node
While outside the scope of this overview post, here are some additional thoughts on cluster execution.
- When a job starts it will gain access to resources to run its executors and generally will not release them until it is complete.
- A new stage can’t start until the stage(s) it depends on have finalized.
- Spark allows for Adaptive Query Execution (AQE) which recreates the plan at the end of a stage thereby allowing it to make better decisions on how many partitions the shuffling effort should produce.
- Spark tries to be efficient, but can (and often) spill intermediary results to disk.
- RDD and Dataframe APIs support explicit caching methods with multiple persistence options.
More details
The final video provides more under the covers details.
There is a world of information available about Spark on the internet to continue your learning journey, but if you liked my post feel free to check out other related articles here on my blog as well as my legacy blog.
Here’s a much shorter overview if interested — https://www.youtube.com/watch?v=EZfvk1Yog3o