joining spark dataframes with identical column names (not just in the join condition)

a quick walkthru of spark sql dataframe code showing joining scenarios when both tables have columns with the same name; this includes when they are used in the join condition as well as when they are not

hive’s merge statement (it drops a lot of acid)

hive’s merge command provides another option for acid transactioning beyond insert, update and delete — this post walks you through a simple example and looks at the underlying filesystem at all the base, delta and delta_delete files that are created to support this standard sql command

hive acid transactions with partitions (a behind the scenes perspective)

let’s take a deeper look at what happens under the hood of hive on these “acid” activities such as insert, update and delete — including look at the actual directories and orc files created

presenting at hadoop summit (archiving evolving databases in hive)

overview of, and links to related artifacts for, my presentation at hadoop summit about strategies to handle changing data in hive’s immutable architecture

how do i load a fixed-width formatted file into hive? (with a little help from pig)

presents a couple of options for converting a fixed-width formatted file a a delimited one to prepare it to be exposed as a hive table