an overview of apache spark presented from 20,000 feet, on the surface, and below the waterline
Tag Archives: streaming
iceberg snapshots affect storage footprint (not performance)
it is easy to understand why most folks initially imagine that iceberg’s ability to maintain a long history of snapshots will cause performance problems, but that is not the case — the real gotcha is that keeping many versions can quickly consume 2-10+ times the amount of data lake storage space
becoming a data engineer (yet another top 10 list)
after a recent class i was asked what skills someone needs to become a data engineer – there are plenty of these lists all over the internet, yet here i go assuming i know enough to jot down yet another; at least i put mine all in a single picture 😉
updated streaming supervision features scorecard (added flink)
added apache flink to the comparison grid of kafka streams, spark streaming, and storm focused on the features they offer the operations side of the devops formula — it measures up well
batch as a “special case” of flink streaming (yes, now we’re mv’ing streaming back to batch)
the third part of a loosely coupled trilogy on flink batch and streaming that take us full-circle with the collapse of the DataSet API into the DataStream API — i’m not sure Run-D.M.C. could make this less tricky
mv’ing batch flink to streaming (easy breezy)
building on a prior post, this tutorial ports a simple flink batch program to become a streaming solution – put lakeside on the turntable and let’s finish up the fantastic voyage
hello world with flink (from scratch)
come along and ride on a fantastic voyage where we will setup an apache flink environment, code up a very simple job, and execute it & verify our results — we’ll just slide, glide, slippity-side
big data api’s look a lot alike (code comparison with flink, kafka, spark, trident and pig)
exploring the similarity of the APIs from flink, kafka streams, spark (RDDs & DFs), storm’s trident and yes, even good old pig by implementing the canonical word count solution with each framework
topology supervision features of streaming frameworks (or lack thereof)
a smackdown of sort pitting kafka streams, spark streaming, and storm against each other — not for the features they give developers, but for the features they offer the operations side of the devops formula