i’ve got a fever and the only prescription is more pystarburst examples — this third installment is all about window functions via the dataframe api and like before, I present sql first for comparison
Tag Archives: spark
pystarburst analytics examples (querying aviation data part deux)
i had so much fun publishing my first pystarburst post and running it in starburst galaxy that i wanted to share some more examples – i ported my aviation dataset analytical queries to python and the dataframe api
pystarburst (the dataframe api)
the dataframe api is finally available for trino and starburst galaxy thanks to the pystarburst libraries — take a peek at some example usages in this quick validation run
hive, trino & spark features (their journeys to sql, performance & durability)
different big data sql engines are created to solve a particular lack of focus from existing ones, but sooner or later they all start looking like each other from their list of features and observable behaviors
updated streaming supervision features scorecard (added flink)
added apache flink to the comparison grid of kafka streams, spark streaming, and storm focused on the features they offer the operations side of the devops formula — it measures up well
big data api’s look a lot alike (code comparison with flink, kafka, spark, trident and pig)
exploring the similarity of the APIs from flink, kafka streams, spark (RDDs & DFs), storm’s trident and yes, even good old pig by implementing the canonical word count solution with each framework
functional programming and big data (what a pair)
a high-level overview of how functional programming with immutable datasets is a great partner with big data processing frameworks — code examples with spark rdds using scala
building a spark sql udf with scala (using multiple arguments)
a short & sweet code-focused tutorial declaring a scala function as a spark sql udf that can be leveraged via the api approach or in a formal sql statement
joining spark dataframes with identical column names (not just in the join condition)
a quick walkthru of spark sql dataframe code showing joining scenarios when both tables have columns with the same name; this includes when they are used in the join condition as well as when they are not
topology supervision features of streaming frameworks (or lack thereof)
a smackdown of sort pitting kafka streams, spark streaming, and storm against each other — not for the features they give developers, but for the features they offer the operations side of the devops formula