big data api’s look a lot alike (code comparison with flink, kafka, spark, trident and pig)

There is no shortage of big data frameworks out there; even when you focus only on open-source options. As a big data trainer, I find myself in a fortunate position of being able to investigate many different technologies. What I have noticed by working with the various options is that while the underlying architectural nuances and the run/monitor-time tools are different, the APIs themselves end up looking very similar. To me, that’s a good thing as it means others are thinking the same way.

As I captured in functional programming and big data (what a team), the APIs are aligned with functional programming’s insistence on immutable data, transformation-oriented functions and lazy execution that allows programmers to focus on describing WHAT needs to get done, not on exactly HOW to do it.

So, let’s get to comparing the APIs and use the classic word count as our simple example. We will use these frameworks in our code comparison.

FlinkKafka StreamsSpark RDDSpark DFStorm TridentPig

DISCLAIMER: ALL CODE IN BLOGS ALWAYS WORKS “PERFECTLY”, so feel free to let me know in the comments if I messed anything up in my hurry to publish this posting. 🙂

Step1: Read the File

Flink
DataSet lines = env.readTextFile( ... );
Kafka Streams
KStream<String> lines = builder.stream( ... );
Spark RDD
val lines = sc.textFile( ... )
Spark DF
val lines = spark.read.text( ... ).toDF("line")
Storm Trident
Stream lines = topology.newStream( ... );
Pig
lines = LOAD ' ... ' AS (line:chararray);

Step 2: Split into Words

Flink
//bundled into next step
Kafka Streams
KStream<String, Long> words = lines.flatMapValues( 
  value -> Arrays.asList(value.split("\\W+")));
Spark RDD
val words = lines.flatMap(line => line.split(" "))
Spark DF
val words = lines.select(
  split(lines("value")," ").alias("word"))
Storm Trident
Stream words = lines.each(new Fields("sentence"), 
  new Split(), new Fields("word"));
Pig
words = FOREACH lines GENERATE 
  FLATTEN(TOKENIZE(line)) as word;

Step 3: Prepare for Aggregation

Flink
DataSet<Tuple2<String, Integer>> KVPs = 
  lines.flatMap(new Tokenizer()); //udf
Kafka Streams
//not needed - progress to next step
Spark RDD
val KVPs = words.map(word => (word, 1))
Spark DF
//not needed - progress to next step
Storm Trident
//not needed - progress to next step
Pig
KVPs = GROUP words BY word;

Step 4: Group and Count

Flink
DataSet<Tuple2<String, Integer>> wordcounts =
  KVPs.groupBy(new int[] { 0 }).sum(1);
Kafka Streams
KTable<String, Long> wordcounts = 
  words.groupBy((key, value) -> value).count();
Spark RDD
val wordcounts = KVPs.reduceByKey(_ + _)
Spark DF
val wordcounts = words.groupBy("word").count()
Storm Trident
//actual counting happens in next step
GroupedStream wordcounts = words.groupBy(new Fields("word"));
Pig
wordcounts = FOREACH KVPs GENERATE group, COUNT(words);

Step 5: Output the Results

Flink
wordcounts.writeAsCsv( ... , "\n", " ");
//the next line triggers the lazy execution to begin
env.execute("WordCount");
Kafka Streams
wordcounts.toStream().to( ... , 
  Produced.with(stringSerde, longSerde));
Spark RDD
wordcounts.collect()
Spark DF
wordcounts.show()
Storm Trident
wordcounts.persistentAggregate(new MemoryMapState.Factory(), 
  new Count(), new Fields("count");
Pig
DUMP wordcounts;

Download PDF for another view of this API comparison which features all code together for each framework.

Published by lestermartin

Developer advocate, trainer, blogger, and data engineer focused on data lake & streaming frameworks including Trino, Hive, Spark, Flink, Kafka and NiFi.

Leave a Reply

Discover more from Lester Martin (l11n)

Subscribe now to keep reading and get access to the full archive.

Continue reading