Spark sql dataframe drop all columns from alias table after join - apache-spark

I am trying to join two dataframes with the same column names and compute some new values. after that i need to drop all columns of second table. The number of columns is huge. How can I do it in easier way? I tried to .drop("table2.*"),but this dont work.

You can use select with aliases:
df1.alias("df1")
.join(df2.alias("df2"), Seq("someJoinColumn"))
.select($"df1.*", $"someComputedColumn", ...)
reference with the parent DataFrame:
df1.join(df2, Seq("someJoinColumn")).select(df1("*"), $"someComputedColumn", ...)

Instead of dropping, you can select all the necessary columns that you want hold for further operations something like below
val newDataFrame = joinedDataFrame.select($"col1", $"col4", $"col6")

Related

How to avoid key column name duplication in join?

I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id as the key.
spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")
The joined table has duplicated id column.
We can of course specify which id column to keep like below:
spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")
But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id column, no other duplicated column names).
what should I do? thanks.
If id is the only column name in common, you can take advantage of the USING clause:
spark.sql("select * from tbl1 join tbl2 using (id) ")
The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.
Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.
select *,ctid from table;
outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.
val joined = spark
.sql("select * from tbl1")
.join(
spark.sql("select * from tbl2"),
Seq("id"),
"inner" // optional
)
joined should have only one id column. Tested with Spark 2.4.8

Spark Union vs adding columns using lit in spark

This is a spark related question. I have to add static data to various types of records, each type of record being processed as a different dataframe (say df1, df2, .. df6)
The static data that I intend to add, has to be repeated with all the 6 dataframes.
What would be a more performant way:
For each of the 6 dataframes, use:
.witColumn("testA", lit("somethingA"))
.witColumn("testB", lit("somethingB"))
.witColumn("testC", lit("somethingC"))
or
Create a new DF, say staticDF which has all the columns that I intend to append to each of the 6 dataframes and use a union?
or
Any other option that I have not considered?
The first way is correct. The second way wouldn't work because union add rows to a dataframe, not columns.
Another way is to use select to select all new columns at the same time:
df2 = df.select(
'*',
lit('somethingA').alias('testA'),
lit('somethingB').alias('testB'),
lit('somethingC').alias('testC')
)

Merge, Combine 2 column in spark dataframe

I have 2 different dataframes and I was able to join them together based on g_id. Just like below:
df1 = dfx.join(df_gi, regexp_extract(trim(dfx.LOCATION), ".*/GDocs/([0-9]{1,5})/.*", 1) == df_gi.g_id, "inner")\
.select (dfx["*"], df_gi["G_Number2"])
Now, dfx daraframe has a column called G_Number1 and df_gi dataframe has a similar column called G_Number2, Both of these columns combined solves the missing pieces ... Meaning one column has some information and the other has some. Combining both together is the output needed.
How can I achieve in pyspark?? I tried the concat function .. but i was way off.
Thank you in advance.
You can use coalesce:
import pyspark.sql.functions as f
df.withColumn('Output', f.coalesce('G_Number2', 'G_Number1'))
Notice this will prioritize G_Number2 column when both are not null, if you need the other way, just switch the order of the two columns.

Spark Dataset: how to change alias of the columns after a flatmap?

I have two spark datasets that I'm trying to join. The join keys are nested in dataset A, so I must flatmap it out first before joining with dataset B. The problem is that as soon as I flatmap that field, the column name becomes the default "_1", "_2", etc. Is it possible to change the alias somehow?
A.flatMap(a => a.keys).join(B).where(...)
After applying the transformation like flatMap you lose the columns as which is logical as after applying transformation like flatMap or map it does not guarantee that the number of column or datatype inside each column remain the same.That's why we lose the column name there.
What you can do is you can fetch all previous column and then apply it to the dataset like this:-
val columns = A.columns
A.flatMap(a => a.keys).toDF(columns:_ *).join(B).where(...)
this will only work if the number of columns is same after applying flatmap
Hope this clears your issue
Thanks

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Resources