joining spark dataframes with identical column names (an easier way)

It seems whenever I look at my blog site stats, the joining spark dataframes with identical column names (not just in the join condition) post still wins out on number of views. NGL, it is nice to know that something posted back in 2020 is still helping folks out.

With all that traffic, I thought it was warranted to publish a short follow-up with an easier way to solve that same problem.

Note: All the code presented here can be found in this Jupyter notebook.

The problem

As the code shows below, joining dataframes that contain duplicate column names causes problems. I’m using PyStarburst for this example, but you could use PySpark and see a slightly different problem.

orders.show()

custs.show()
orders.join(custs, orders.cust_id == custs.cust_id).show()

As you can see PyStarburst creates some “cute” names to avoid ambiguity, but those names are random which means they won’t be useful in your code if you need to modify or drop the columns in any way.

Prior recommendation

My original joining spark dataframes with identical column names (not just in the join condition) post suggested manually renaming the columns that were going to be a problem. As you can see below, that’s still a decent way to go.

ordersMod = orders.with_column_renamed(
    "notes", "order_notes")

custsMod = custs.with_column_renamed(
    "cust_id", "id")

ordersMod.join(custsMod, 
               ordersMod.cust_id == custsMod.id).show()

Easier approach

As the title of this post suggests, there is an even easier way. The join command allow for optional lsuffix and/or rsuffix arguments to be passed. These will tackle the renames on collisions for you auto-magically AND still give you predictable column names.

orders.join(custs, 
            orders.cust_id == custs.cust_id, 
            rsuffix = '_c').show()

Reminder: All the code presented here can be found in this Jupyter notebook.

Published by lestermartin

Developer advocate, trainer, blogger, and data engineer focused on data lake & streaming frameworks including Trino, Hive, Spark, Flink, Kafka and NiFi.

One thought on “joining spark dataframes with identical column names (an easier way)

Leave a Reply

Discover more from Lester Martin (l11n)

Subscribe now to keep reading and get access to the full archive.

Continue reading