Professional – Page 8 – Lester Martin (l11n)

building a spark sql udf with scala (using multiple arguments)

a short & sweet code-focused tutorial declaring a scala function as a spark sql udf that can be leveraged via the api approach or in a formal sql statement

joining spark dataframes with identical column names (not just in the join condition)

a quick walkthru of spark sql dataframe code showing joining scenarios when both tables have columns with the same name; this includes when they are used in the join condition as well as when they are not

securing hive entities (ranger and atlas to the rescue)

video showing how to use ranger & atlas to create security policies on hive tables, columns and rows as well as implementing data masking and tag-based restrictions

hive’s merge statement (it drops a lot of acid)

hive’s merge command provides another option for acid transactioning beyond insert, update and delete — this post walks you through a simple example and looks at the underlying filesystem at all the base, delta and delta_delete files that are created to support this standard sql command

moving my tech blog (already missing confluence)

yes, it is time to move on and start blogging on a dedicated blogging platform — confluence… you have been very good to me (and yes, i still have all my personal blog posts there; haha)

hive delta file compaction (minor and major)

a quick walk-thru of how minor and major compactions occur for hive transactional tables; ensuring all the delta files eventually roll into base ones

hive acid transactions with partitions (a behind the scenes perspective)

let’s take a deeper look at what happens under the hood of hive on these “acid” activities such as insert, update and delete — including look at the actual directories and orc files created

viewing the content of ORC files (using the Java ORC tool jar)

a quick tutorial about finding and using the orc java tool jar for peering into the contents of the otherwise non humanly readable orc file format

topology supervision features of streaming frameworks (or lack thereof)

a smackdown of sort pitting kafka streams, spark streaming, and storm against each other — not for the features they give developers, but for the features they offer the operations side of the devops formula

presenting at hadoop summit (archiving evolving databases in hive)

overview of, and links to related artifacts for, my presentation at hadoop summit about strategies to handle changing data in hive’s immutable architecture