big data api’s look a lot alike (code comparison with flink, kafka, spark, trident and pig)

There is no shortage of big data frameworks out there; even when you focus only on open-source options. As a big data trainer, I find myself in a fortunate position of being able to investigate many different technologies. What I have noticed by working with the various options is that while the underlying architectural nuances and the run/monitor-time tools are different, the APIs themselves end up looking very similar. To me, that’s a good thing as it means others are thinking the same way.

As I captured in functional programming and big data (what a team), the APIs are aligned with functional programming’s insistence on immutable data, transformation-oriented functions and lazy execution that allows programmers to focus on describing WHAT needs to get done, not on exactly HOW to do it.

So, let’s get to comparing the APIs and use the classic word count as our simple example. We will use these frameworks in our code comparison.

DISCLAIMER: ALL CODE IN BLOGS ALWAYS WORKS “PERFECTLY”, so feel free to let me know in the comments if I messed anything up in my hurry to publish this posting. 🙂

Step1: Read the File

Flink

DataSet lines = env.readTextFile( ... );

Kafka Streams

KStream<String> lines = builder.stream( ... );

Spark RDD

val lines = sc.textFile( ... )

Spark DF

val lines = spark.read.text( ... ).toDF("line")

Storm Trident

Stream lines = topology.newStream( ... );

Pig

lines = LOAD ' ... ' AS (line:chararray);

Step 2: Split into Words

Flink

//bundled into next step

Kafka Streams

KStream<String, Long> words = lines.flatMapValues( 
  value -> Arrays.asList(value.split("\\W+")));

Spark RDD

val words = lines.flatMap(line => line.split(" "))

Spark DF

val words = lines.select(
  split(lines("value")," ").alias("word"))

Storm Trident

Stream words = lines.each(new Fields("sentence"), 
  new Split(), new Fields("word"));

Pig

words = FOREACH lines GENERATE 
  FLATTEN(TOKENIZE(line)) as word;

Step 3: Prepare for Aggregation

Flink

DataSet<Tuple2<String, Integer>> KVPs = 
  lines.flatMap(new Tokenizer()); //udf

Kafka Streams

//not needed - progress to next step

Spark RDD

val KVPs = words.map(word => (word, 1))

Spark DF

//not needed - progress to next step

Storm Trident

//not needed - progress to next step

Pig

KVPs = GROUP words BY word;

Step 4: Group and Count

Flink

DataSet<Tuple2<String, Integer>> wordcounts =
  KVPs.groupBy(new int[] { 0 }).sum(1);

Kafka Streams

KTable<String, Long> wordcounts = 
  words.groupBy((key, value) -> value).count();

Spark RDD

val wordcounts = KVPs.reduceByKey(_ + _)

Spark DF

val wordcounts = words.groupBy("word").count()

Storm Trident

//actual counting happens in next step
GroupedStream wordcounts = words.groupBy(new Fields("word"));

Pig

wordcounts = FOREACH KVPs GENERATE group, COUNT(words);

Step 5: Output the Results

Flink

wordcounts.writeAsCsv( ... , "\n", " ");
//the next line triggers the lazy execution to begin
env.execute("WordCount");

Kafka Streams

wordcounts.toStream().to( ... , 
  Produced.with(stringSerde, longSerde));

Spark RDD

wordcounts.collect()

Spark DF

wordcounts.show()

Storm Trident

wordcounts.persistentAggregate(new MemoryMapState.Factory(), 
  new Count(), new Fields("count");

Pig

DUMP wordcounts;

Download PDF

Download PDF for another view of this API comparison which features all code together for each framework.

big data api’s look a lot alike (code comparison with flink, kafka, spark, trident and pig)

Step1: Read the File

Flink

Kafka Streams

Spark RDD

Spark DF

Storm Trident

Pig

Step 2: Split into Words

Flink

Kafka Streams

Spark RDD

Spark DF

Storm Trident

Pig

Step 3: Prepare for Aggregation

Flink

Kafka Streams

Spark RDD

Spark DF

Storm Trident

Pig

Step 4: Group and Count

Flink

Kafka Streams

Spark RDD

Spark DF

Storm Trident

Pig

Step 5: Output the Results

Flink

Kafka Streams

Spark RDD

Spark DF

Storm Trident

Pig

Like this:

Related

Published by lestermartin

Leave a ReplyCancel reply

Step1: Read the File

Flink

Kafka Streams

Spark RDD

Spark DF

Storm Trident

Pig

Step 2: Split into Words

Flink

Kafka Streams

Spark RDD

Spark DF

Storm Trident

Pig

Step 3: Prepare for Aggregation

Flink

Kafka Streams

Spark RDD

Spark DF

Storm Trident

Pig

Step 4: Group and Count

Flink

Kafka Streams

Spark RDD

Spark DF

Storm Trident

Pig

Step 5: Output the Results

Flink

Kafka Streams

Spark RDD

Spark DF

Storm Trident

Pig

Share this:

Like this:

Related

Published by lestermartin

Leave a ReplyCancel reply

Discover more from Lester Martin (l11n)