
It seems whenever I look at my blog site stats, the joining spark dataframes with identical column names (not just in the join condition) post still wins out on number of views. NGL, it is nice to know that something posted back in 2020 is still helping folks out.
With all that traffic, I thought it was warranted to publish a short follow-up with an easier way to solve that same problem.
Note: All the code presented here can be found in this Jupyter notebook.
The problem
As the code shows below, joining dataframes that contain duplicate column names causes problems. I’m using PyStarburst for this example, but you could use PySpark and see a slightly different problem.
orders.show()
custs.show()

orders.join(custs, orders.cust_id == custs.cust_id).show()

As you can see PyStarburst creates some “cute” names to avoid ambiguity, but those names are random which means they won’t be useful in your code if you need to modify or drop the columns in any way.
Prior recommendation
My original joining spark dataframes with identical column names (not just in the join condition) post suggested manually renaming the columns that were going to be a problem. As you can see below, that’s still a decent way to go.
ordersMod = orders.with_column_renamed(
"notes", "order_notes")
custsMod = custs.with_column_renamed(
"cust_id", "id")
ordersMod.join(custsMod,
ordersMod.cust_id == custsMod.id).show()

Easier approach
As the title of this post suggests, there is an even easier way. The join command allow for optional lsuffix and/or rsuffix arguments to be passed. These will tackle the renames on collisions for you auto-magically AND still give you predictable column names.
orders.join(custs,
orders.cust_id == custs.cust_id,
rsuffix = '_c').show()

Reminder: All the code presented here can be found in this Jupyter notebook.
One thought on “joining spark dataframes with identical column names (an easier way)”