
There is no shortage of big data frameworks out there; even when you focus only on open-source options. As a big data trainer, I find myself in a fortunate position of being able to investigate many different technologies. What I have noticed by working with the various options is that while the underlying architectural nuances and the run/monitor-time tools are different, the APIs themselves end up looking very similar. To me, that’s a good thing as it means others are thinking the same way.
As I captured in functional programming and big data (what a team), the APIs are aligned with functional programming’s insistence on immutable data, transformation-oriented functions and lazy execution that allows programmers to focus on describing WHAT needs to get done, not on exactly HOW to do it.
So, let’s get to comparing the APIs and use the classic word count as our simple example. We will use these frameworks in our code comparison.
DISCLAIMER: ALL CODE IN BLOGS ALWAYS WORKS “PERFECTLY”, so feel free to let me know in the comments if I messed anything up in my hurry to publish this posting. 🙂
Step1: Read the File
Flink
DataSet lines = env.readTextFile( ... );
Kafka Streams
KStream<String> lines = builder.stream( ... );
Spark RDD
val lines = sc.textFile( ... )
Spark DF
val lines = spark.read.text( ... ).toDF("line")
Storm Trident
Stream lines = topology.newStream( ... );
Pig
lines = LOAD ' ... ' AS (line:chararray);
Step 2: Split into Words
Flink
//bundled into next step
Kafka Streams
KStream<String, Long> words = lines.flatMapValues(
value -> Arrays.asList(value.split("\\W+")));
Spark RDD
val words = lines.flatMap(line => line.split(" "))
Spark DF
val words = lines.select(
split(lines("value")," ").alias("word"))
Storm Trident
Stream words = lines.each(new Fields("sentence"),
new Split(), new Fields("word"));
Pig
words = FOREACH lines GENERATE
FLATTEN(TOKENIZE(line)) as word;
Step 3: Prepare for Aggregation
Flink
DataSet<Tuple2<String, Integer>> KVPs =
lines.flatMap(new Tokenizer()); //udf
Kafka Streams
//not needed - progress to next step
Spark RDD
val KVPs = words.map(word => (word, 1))
Spark DF
//not needed - progress to next step
Storm Trident
//not needed - progress to next step
Pig
KVPs = GROUP words BY word;
Step 4: Group and Count
Flink
DataSet<Tuple2<String, Integer>> wordcounts =
KVPs.groupBy(new int[] { 0 }).sum(1);
Kafka Streams
KTable<String, Long> wordcounts =
words.groupBy((key, value) -> value).count();
Spark RDD
val wordcounts = KVPs.reduceByKey(_ + _)
Spark DF
val wordcounts = words.groupBy("word").count()
Storm Trident
//actual counting happens in next step
GroupedStream wordcounts = words.groupBy(new Fields("word"));
Pig
wordcounts = FOREACH KVPs GENERATE group, COUNT(words);
Step 5: Output the Results
Flink
wordcounts.writeAsCsv( ... , "\n", " ");
//the next line triggers the lazy execution to begin
env.execute("WordCount");
Kafka Streams
wordcounts.toStream().to( ... ,
Produced.with(stringSerde, longSerde));
Spark RDD
wordcounts.collect()
Spark DF
wordcounts.show()
Storm Trident
wordcounts.persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count");
Pig
DUMP wordcounts;
Download PDF for another view of this API comparison which features all code together for each framework.