batch as a “special case” of flink streaming (yes, now we’re mv’ing streaming back to batch)

If you remember in mv’ing batch flink to streaming (easy breezy), we discussed how Flink has the DataSet API for batch and the DataStream API for streaming. The Flink team has been talking for some time about treating batch as a “special case” of streaming and finally in version 1.12 they have “soft-deprecated” the DataSet API.

We are here now and have a BATCH Execution Mode! What do we have to do? Basically the same work we had to do moving off DataSet API to DataStream API. For our (twisted) example, we can convert our streaming app back to batch with changing one line of code. Heck, only change a half of line of code.

Let’s grab the line starting with…

DataStream lines = …

Swap out…

env.socketTextStream("localhost", 9999);

With…

env.readTextFile(params.get("input"));

ALMOST DONE! We need configure the job to run in batch mode. We could add the following line just after our instantiation of StreamExecutionEnvironment.

env.setRuntimeMode(RuntimeExecutionMode.BATCH);

That said, the recommendation from the documentation is to use this command-line switch instead when submitting your job.

-Dexecution.runtime-mode=BATCH

So let’s run our job like we did before in hello world with flink (from scratch).

MBP15:bin lmartin$ pwd
/Users/lmartin/blog/flink-1.12.2/bin
MBP15:bin lmartin$ ./flink run --class wordcount.WordCountBatchWithDataStream \
>  -Dexecution.runtime-mode=BATCH \
>  ~/blog/flink-exploration/target/flink-exploration-0.0.1-SNAPSHOT.jar \
>  --input file:///Users/lmartin/blog/flink-exploration/src/test/resources/BitOfGreenEggsAndHam.txt 
Job has been submitted with JobID df8565fff40cc00356d376920249ca6c
Program execution finished
Job with JobID df8565fff40cc00356d376920249ca6c has finished.
Job Runtime: 858 ms

MBP15:bin lmartin$ 

As I left the sink setup like that of the streaming solution, here is the “Stdout” tab with results again.

In summary, Flink has melded the DataSet API into the DataStream API and likely will cause less confusion to most developers going forward. If you need it, here’s the source code.

Published by lestermartin

Developer advocate, trainer, blogger, and data engineer focused on data lake & streaming frameworks including Trino, Hive, Spark, Flink, Kafka and NiFi.

Leave a Reply

Discover more from Lester Martin (l11n)

Subscribe now to keep reading and get access to the full archive.

Continue reading