How to merge edits from one dataframe into another dataframe in Spark? - apache-spark

I have a dataframe df1 with 150 columns and many rows. I also have a dataframe df2 with the same schema but very few rows containing edits that should be applied to df1 (there's a key column id to identify which row to update). df2 has only columns with updates populated. The other of the columns are null. What I want to do is to update the rows in df1 with correspoding rows from dataframe df2 in the following way:
if a column in df2 is null, it should not cause any changes in df1
if a column in df2 contains a tilde "~", it should result in nullifying that column in df1
otherwise the value in column in df1 should get replaced with the value from df2
How can I best do it? Can it be done in a generic way without listing all the columns but rather iterating over them? Can it be done using dataframe API or do I need to switch to RDDs?
(Of course by updating dataframe df1 I mean creating a new, updated dataframe.)
Example
Let's say the schema is: id:Int, name:String, age: Int.
df1 is:
1,"Greg",18
2,"Kate",25
3,"Chris",30
df2 is:
1,"Gregory",null
2,~,26
The updated dataframe should look like this:
1,"Gregory",18
2,null,26
3,"Chris",30

you can also use case or coalesce using full outer join to merge the two dataframes. see a link below for an explanation.
Spark incremental loading overwrite old record

I figured out how to do it with an intermediate conversion to RDD. First, create a map idsToEdits where keys are row ids and values are maps of column numbers to values (only the non-null ones).
val idsToEdits=df2.rdd.map{row=>
(row(0),
row.getValuesMap[AnyVal](row.schema.fieldNames.filterNot(colName=>row.isNullAt(row.fieldIndex(colName))))
.map{case (k,v)=> (row.fieldIndex(k),if(v=="~") null else v)} )
}.collectAsMap()
Broadast that map and define an editRow function updating a row.
val idsToEditsBr=sc.broadcast(idsToEdits)
import org.apache.spark.sql.Row
val editRow:Row=>Row={ row =>
idsToEditsBr
.value
.get(row(0))
.map{edits => Row.fromSeq(edits.foldLeft(row.toSeq){case (rowSeq,
(idx,newValue))=>rowSeq.updated(idx,newValue)})}
.getOrElse(row)
}
Finally, use that function on RDD derived from df1 and convert back to a dataframe.
val updatedDF=spark.createDataFrame(df1.rdd.map(editRow),df1.schema)

It sounds like your question is how to perform this without explcitly naming all the columns so I will assume you have some "doLogic" udf function or dataframe functions to perform your logic after joining.
import org.apache.spark.sql.types.StringType
val cols = df1.schema.filterNot(x => x.name == "id").map({ x =>
if (x.dataType == StringType) {
doLogicUdf(col(x), col(x + "2")))
} else {
when(col(x + "2").isNotNull, col(x + "2")).otherwise(col(x))
}
}) :+ col("id")
val df2 = df2.select(df2.columns.map( x=> col(x).alias(x+"2")) : _*))
df1.join(df2, col("id") ===col("id2") , "inner").select(cols : _*)

Related

dropping all the column from dataframe while joining 2 dataframe in spark

I'm joining two Dataframes and adding some columns using withColumn method in my final dataframe I want all the columns from first dataframe and new columns which i have added using withcolumn method i want to drop all the columns from second dataframe is there any method to drop all the column from 2nd dataframe currently I'm using separate drop method foe every column.
val df3 = df1.join(df2, df1("id") === df2("id"))
.drop(df2("name"))
.drop(df2("lastname"))
is there any way to drop all the column using single method instead of dropping separately.
It can be done as below and please find the inline comments for the code exaplanation
val df2ColumnList = df2.columns // Get the list of df2 columns
val df3 = df1.join(df2, df1("id") === df2("id"))
.drop(df2ColumnList : _*) // You can pass the list to drop function
Problem is drop will only take one value of type Column & multiple value of type String.
If you pass multiple values of type String there could be chances of having same column in both joining DFs, you might be loosing that column related data.
Instead dropping columns select only required columns like below.
val columns = df1.columns.map(c => df1(c)).toList ::: List(col("with_column_a"),col("with_column_b"))
val df3 = df1.join(df2, df1("id") === df2("id")).select(columns:_*)
Or
val df3 = df1.join(df2, df1("id") === df2("id"))
df2.columns.map(column => df2(column)).foldLeft(df3)((ddf,column) => ddf.drop(column))
The best approach when you have multiple columns to drop from a join is by using .select
val df3 = df1.join(df2, df1("id") === df2("id"))
.select("Select all the columns you need")
This way you don't need to think much about if you have dropped the column you need as there might be ambiguous columns in both the dataframes.
Also you can use .selectExpr() to do aliasing using as while selecting the column

PySpark: do I need to re-cache a DataFrame?

Say I have a dataframe:
rdd = sc.textFile(file)
df = sqlContext.createDataFrame(rdd)
df.cache()
and I add a column
df = df.withColumn('c1', lit(0))
I want to use df repeatedly. So do I need to re-cache() the dataframe, or does Spark automatically do it for me?
you will have to re-cache the dataframe again everytime you manipulate/change the dataframe. However the entire dataframe doesn't have to be recomputed.
df = df.withColumn('c1', lit(0))
In the above statement a new dataframe is created and reassigned to variable df. But this time only the new column is computed and the rest is retrieved from the cache.

efficiently get joined and not joined data of a dataframe against other dataframe

I have two dataframes lets say A and B. They have different schemas.
I want to get records from dataframe A which joins with B on a key and the records which didn't get joined, I want those as well.
Can this be done in a single query?
Since going over the same data twice will reduce the performance. The DataFrame A is much bigger in size than B.
Dataframe B's size will be around 50Gb-100gb.
Hence I can't broadcast B in that case.
I am okay with getting a single Dataframe C as a result, which can have a partition column "Joined" with values "Yes" or "No", signifying whether the data in A got joined or not with B.
What in case if A has duplicates? and I don't want them.
I was thinking that I'll do a recudeByKey later on the C dataframe. Any suggestions around that?
I am using hive tables to store the Data in ORC file format on HDFS.
Writing code in scala.
Yes, you just need to do a left-outer join:
import sqlContext.implicits._
val A = sc.parallelize(List(("id1", 1234),("id1", 1234),("id3", 5678))).toDF("id1", "number")
val B = sc.parallelize(List(("id1", "Hello"),("id2", "world"))).toDF("id2", "text")
val joined = udf((id: String) => id match {
case null => "No"
case _ => "Yes"
})
val C = A
.distinct
.join(B, 'id1 === 'id2, "left_outer")
.withColumn("joined",joined('id2))
.drop('id2)
.drop('text)
This will yield a dataframe C:[id1: string, number: int, joined: string] that looks like this:
[id1,1234,Yes]
[id3,5678,No]
Note that I have added a distinct to filter out duplicates in A and that the last column in C refers to wether or not is was joined.
EDIT: Following remark from OP, I have added the drop lines to remove the columns from B.

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Intersect dataframes that have a List column

I have two dataframes that have List as a column. Both the dataframes are identical except for the fact that the order of the list is different in the dataframes.
eg. Schema: (id text, name List'<'text>)
df1: (5,WrappedArray(abc, pqr, xyz))
df2: (5,WrappedArray(abc, xyz, pqr))
When i use intersect i dont get this record in the results. How can i get the intersection of such records?
I think you are right that the easiest way would be to sort the list column.
val sortListFunc = udf((inputList: WrappedArray[String]) => {
inputList.sorted
})
val df1Sorted = df1
.withColumn("name_sorted",sortListFunc(col("name"))
.select($"id","name_sorted".as("name"))
val df2Sorted = df2
.withColumn("name_sorted",sortListFunc(col("name"))
.select($"id","name_sorted".as("name"))
Then you should be able to join or intersect.

Resources