dropping all the column from dataframe while joining 2 dataframe in spark - apache-spark

I'm joining two Dataframes and adding some columns using withColumn method in my final dataframe I want all the columns from first dataframe and new columns which i have added using withcolumn method i want to drop all the columns from second dataframe is there any method to drop all the column from 2nd dataframe currently I'm using separate drop method foe every column.
val df3 = df1.join(df2, df1("id") === df2("id"))
.drop(df2("name"))
.drop(df2("lastname"))
is there any way to drop all the column using single method instead of dropping separately.

It can be done as below and please find the inline comments for the code exaplanation
val df2ColumnList = df2.columns // Get the list of df2 columns
val df3 = df1.join(df2, df1("id") === df2("id"))
.drop(df2ColumnList : _*) // You can pass the list to drop function

Problem is drop will only take one value of type Column & multiple value of type String.
If you pass multiple values of type String there could be chances of having same column in both joining DFs, you might be loosing that column related data.
Instead dropping columns select only required columns like below.
val columns = df1.columns.map(c => df1(c)).toList ::: List(col("with_column_a"),col("with_column_b"))
val df3 = df1.join(df2, df1("id") === df2("id")).select(columns:_*)
Or
val df3 = df1.join(df2, df1("id") === df2("id"))
df2.columns.map(column => df2(column)).foldLeft(df3)((ddf,column) => ddf.drop(column))

The best approach when you have multiple columns to drop from a join is by using .select
val df3 = df1.join(df2, df1("id") === df2("id"))
.select("Select all the columns you need")
This way you don't need to think much about if you have dropped the column you need as there might be ambiguous columns in both the dataframes.
Also you can use .selectExpr() to do aliasing using as while selecting the column

Related

spark Dataframe column transformations using lookup for values into other Dataframe

I need to transform dataframe's multiple column values by looking up into other dataframe.
The other dataframe on the right will not have too much rows, say around 5000 records.
I need to replace for example field_1 column values to ratios like field_1,0 to 8 & field_1,3 to 25 by looking up into right data frame.
So eventually it will be filled like below:
Option 1 is to load & collect the look up dataframe on left into memory as broadcast it as boadcast variable. A Map of Map can be used I believe and should not take too much of memory on executors.
Option 2 is join the lookup data frame for each column. But I believe this will be highly inefficient as the number of field columns can be too many like 50 to 100.
Which of the above option is good? Or is there a better way of filling the values?
I would go for option1, e.g.:
val dfBig : DataFrame = ???
val dfLookup : DataFrame = ???
val lookupMap = dfLookup
.map{case Row(category:String,field_values:Int,ratio:Int) => ((category,field_values),ratio)}
.collect()
.toMap
val bc_lookupMap = spark.sparkContext.broadcast(lookupMap)
val lookupUdf = udf((field1:Int,field2:Int) =>
(bc_lookupMap.value(("field_1",field1)),bc_lookupMap.value(("field_2",field2)))
)
dfBig
.withColumn("udfResult", lookupUdf($"field_1",$"field_2"))
.select($"primaryId",$"udfResult._1".as("field_1"),$"udfResult._2".as("field_2"))

How to merge edits from one dataframe into another dataframe in Spark?

I have a dataframe df1 with 150 columns and many rows. I also have a dataframe df2 with the same schema but very few rows containing edits that should be applied to df1 (there's a key column id to identify which row to update). df2 has only columns with updates populated. The other of the columns are null. What I want to do is to update the rows in df1 with correspoding rows from dataframe df2 in the following way:
if a column in df2 is null, it should not cause any changes in df1
if a column in df2 contains a tilde "~", it should result in nullifying that column in df1
otherwise the value in column in df1 should get replaced with the value from df2
How can I best do it? Can it be done in a generic way without listing all the columns but rather iterating over them? Can it be done using dataframe API or do I need to switch to RDDs?
(Of course by updating dataframe df1 I mean creating a new, updated dataframe.)
Example
Let's say the schema is: id:Int, name:String, age: Int.
df1 is:
1,"Greg",18
2,"Kate",25
3,"Chris",30
df2 is:
1,"Gregory",null
2,~,26
The updated dataframe should look like this:
1,"Gregory",18
2,null,26
3,"Chris",30
you can also use case or coalesce using full outer join to merge the two dataframes. see a link below for an explanation.
Spark incremental loading overwrite old record
I figured out how to do it with an intermediate conversion to RDD. First, create a map idsToEdits where keys are row ids and values are maps of column numbers to values (only the non-null ones).
val idsToEdits=df2.rdd.map{row=>
(row(0),
row.getValuesMap[AnyVal](row.schema.fieldNames.filterNot(colName=>row.isNullAt(row.fieldIndex(colName))))
.map{case (k,v)=> (row.fieldIndex(k),if(v=="~") null else v)} )
}.collectAsMap()
Broadast that map and define an editRow function updating a row.
val idsToEditsBr=sc.broadcast(idsToEdits)
import org.apache.spark.sql.Row
val editRow:Row=>Row={ row =>
idsToEditsBr
.value
.get(row(0))
.map{edits => Row.fromSeq(edits.foldLeft(row.toSeq){case (rowSeq,
(idx,newValue))=>rowSeq.updated(idx,newValue)})}
.getOrElse(row)
}
Finally, use that function on RDD derived from df1 and convert back to a dataframe.
val updatedDF=spark.createDataFrame(df1.rdd.map(editRow),df1.schema)
It sounds like your question is how to perform this without explcitly naming all the columns so I will assume you have some "doLogic" udf function or dataframe functions to perform your logic after joining.
import org.apache.spark.sql.types.StringType
val cols = df1.schema.filterNot(x => x.name == "id").map({ x =>
if (x.dataType == StringType) {
doLogicUdf(col(x), col(x + "2")))
} else {
when(col(x + "2").isNotNull, col(x + "2")).otherwise(col(x))
}
}) :+ col("id")
val df2 = df2.select(df2.columns.map( x=> col(x).alias(x+"2")) : _*))
df1.join(df2, col("id") ===col("id2") , "inner").select(cols : _*)

spark: apply explode to a list of columns in a dataFrame, but not to all columns

Suppose I have a list of column names
val expFields = List("f1", "f2") and a dataFrame df, and I'd like to explode columns in the expField list of df. That means I'd like to apply "explode" to a select number of columns and return a new dataFrame. I don't want to manually specify column names like df.withColumn("f1", explode(col("f1"))).withColumn("f2", explode(col("f2"))). I'd like to use the expFields list to specify these columns. How do I do that in Spark?
Just fold over the list of columns:
expFields.foldLeft(df)((acc, c) => acc.withColumn(c, explode(col(c))))

efficiently get joined and not joined data of a dataframe against other dataframe

I have two dataframes lets say A and B. They have different schemas.
I want to get records from dataframe A which joins with B on a key and the records which didn't get joined, I want those as well.
Can this be done in a single query?
Since going over the same data twice will reduce the performance. The DataFrame A is much bigger in size than B.
Dataframe B's size will be around 50Gb-100gb.
Hence I can't broadcast B in that case.
I am okay with getting a single Dataframe C as a result, which can have a partition column "Joined" with values "Yes" or "No", signifying whether the data in A got joined or not with B.
What in case if A has duplicates? and I don't want them.
I was thinking that I'll do a recudeByKey later on the C dataframe. Any suggestions around that?
I am using hive tables to store the Data in ORC file format on HDFS.
Writing code in scala.
Yes, you just need to do a left-outer join:
import sqlContext.implicits._
val A = sc.parallelize(List(("id1", 1234),("id1", 1234),("id3", 5678))).toDF("id1", "number")
val B = sc.parallelize(List(("id1", "Hello"),("id2", "world"))).toDF("id2", "text")
val joined = udf((id: String) => id match {
case null => "No"
case _ => "Yes"
})
val C = A
.distinct
.join(B, 'id1 === 'id2, "left_outer")
.withColumn("joined",joined('id2))
.drop('id2)
.drop('text)
This will yield a dataframe C:[id1: string, number: int, joined: string] that looks like this:
[id1,1234,Yes]
[id3,5678,No]
Note that I have added a distinct to filter out duplicates in A and that the last column in C refers to wether or not is was joined.
EDIT: Following remark from OP, I have added the drop lines to remove the columns from B.

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Resources