PySpark: do I need to re-cache a DataFrame? - apache-spark

Say I have a dataframe:
rdd = sc.textFile(file)
df = sqlContext.createDataFrame(rdd)
df.cache()
and I add a column
df = df.withColumn('c1', lit(0))
I want to use df repeatedly. So do I need to re-cache() the dataframe, or does Spark automatically do it for me?

you will have to re-cache the dataframe again everytime you manipulate/change the dataframe. However the entire dataframe doesn't have to be recomputed.
df = df.withColumn('c1', lit(0))
In the above statement a new dataframe is created and reassigned to variable df. But this time only the new column is computed and the rest is retrieved from the cache.

Related

dropping all the column from dataframe while joining 2 dataframe in spark

I'm joining two Dataframes and adding some columns using withColumn method in my final dataframe I want all the columns from first dataframe and new columns which i have added using withcolumn method i want to drop all the columns from second dataframe is there any method to drop all the column from 2nd dataframe currently I'm using separate drop method foe every column.
val df3 = df1.join(df2, df1("id") === df2("id"))
.drop(df2("name"))
.drop(df2("lastname"))
is there any way to drop all the column using single method instead of dropping separately.
It can be done as below and please find the inline comments for the code exaplanation
val df2ColumnList = df2.columns // Get the list of df2 columns
val df3 = df1.join(df2, df1("id") === df2("id"))
.drop(df2ColumnList : _*) // You can pass the list to drop function
Problem is drop will only take one value of type Column & multiple value of type String.
If you pass multiple values of type String there could be chances of having same column in both joining DFs, you might be loosing that column related data.
Instead dropping columns select only required columns like below.
val columns = df1.columns.map(c => df1(c)).toList ::: List(col("with_column_a"),col("with_column_b"))
val df3 = df1.join(df2, df1("id") === df2("id")).select(columns:_*)
Or
val df3 = df1.join(df2, df1("id") === df2("id"))
df2.columns.map(column => df2(column)).foldLeft(df3)((ddf,column) => ddf.drop(column))
The best approach when you have multiple columns to drop from a join is by using .select
val df3 = df1.join(df2, df1("id") === df2("id"))
.select("Select all the columns you need")
This way you don't need to think much about if you have dropped the column you need as there might be ambiguous columns in both the dataframes.
Also you can use .selectExpr() to do aliasing using as while selecting the column

Python: DataFrame Index shifting

I have several dataframes that I have concatenated with pandas in the line:
xspc = pd.concat([df1,df2,df3], axis = 1, join_axes = [df3.index])
In df2 the index values read one day later than the values of df1, and df3. So for instance when the most current date is 7/1/19 the index values for df1 and df3 will read "7/1/19" while df2 reads '7/2/19'. I would like to be able to concatenate each series so that each dataframe is joined on the most recent date, so in other words I would like all the dataframe values from df1 index value '7/1/19' to be concatenated with dataframe 2 index value '7/2/19' and dataframe 3 index value '7/1/19'. When methods can I use to shift the data around to join on these not matching index values?
You can reset the index of the data frame and then concat the dataframes
df1=df1.reset_index()
df2=df2.reset_index()
df3=df3.reset_index()
df_final = pd.concat([df1,df2,df3],axis=1, join_axes=[df3.index])
This should work since you mentioned that the date in df2 will be one day after df1 or df3

How to merge edits from one dataframe into another dataframe in Spark?

I have a dataframe df1 with 150 columns and many rows. I also have a dataframe df2 with the same schema but very few rows containing edits that should be applied to df1 (there's a key column id to identify which row to update). df2 has only columns with updates populated. The other of the columns are null. What I want to do is to update the rows in df1 with correspoding rows from dataframe df2 in the following way:
if a column in df2 is null, it should not cause any changes in df1
if a column in df2 contains a tilde "~", it should result in nullifying that column in df1
otherwise the value in column in df1 should get replaced with the value from df2
How can I best do it? Can it be done in a generic way without listing all the columns but rather iterating over them? Can it be done using dataframe API or do I need to switch to RDDs?
(Of course by updating dataframe df1 I mean creating a new, updated dataframe.)
Example
Let's say the schema is: id:Int, name:String, age: Int.
df1 is:
1,"Greg",18
2,"Kate",25
3,"Chris",30
df2 is:
1,"Gregory",null
2,~,26
The updated dataframe should look like this:
1,"Gregory",18
2,null,26
3,"Chris",30
you can also use case or coalesce using full outer join to merge the two dataframes. see a link below for an explanation.
Spark incremental loading overwrite old record
I figured out how to do it with an intermediate conversion to RDD. First, create a map idsToEdits where keys are row ids and values are maps of column numbers to values (only the non-null ones).
val idsToEdits=df2.rdd.map{row=>
(row(0),
row.getValuesMap[AnyVal](row.schema.fieldNames.filterNot(colName=>row.isNullAt(row.fieldIndex(colName))))
.map{case (k,v)=> (row.fieldIndex(k),if(v=="~") null else v)} )
}.collectAsMap()
Broadast that map and define an editRow function updating a row.
val idsToEditsBr=sc.broadcast(idsToEdits)
import org.apache.spark.sql.Row
val editRow:Row=>Row={ row =>
idsToEditsBr
.value
.get(row(0))
.map{edits => Row.fromSeq(edits.foldLeft(row.toSeq){case (rowSeq,
(idx,newValue))=>rowSeq.updated(idx,newValue)})}
.getOrElse(row)
}
Finally, use that function on RDD derived from df1 and convert back to a dataframe.
val updatedDF=spark.createDataFrame(df1.rdd.map(editRow),df1.schema)
It sounds like your question is how to perform this without explcitly naming all the columns so I will assume you have some "doLogic" udf function or dataframe functions to perform your logic after joining.
import org.apache.spark.sql.types.StringType
val cols = df1.schema.filterNot(x => x.name == "id").map({ x =>
if (x.dataType == StringType) {
doLogicUdf(col(x), col(x + "2")))
} else {
when(col(x + "2").isNotNull, col(x + "2")).otherwise(col(x))
}
}) :+ col("id")
val df2 = df2.select(df2.columns.map( x=> col(x).alias(x+"2")) : _*))
df1.join(df2, col("id") ===col("id2") , "inner").select(cols : _*)

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Computing difference between Spark DataFrames

I have two DataFrames df1 and df2. I want to compute a third DataFrame ``df3 such that df3 = (df1 - df2) i.e all elements present in df1 but not in df2. Is there any in-built library function to achieve that something like df1.subtract(df2)?
You are probably searching for except function: http://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.sql.DataFrame
From the description:
def except(other: DataFrame): DataFrame
Returns a new DataFrame containing rows in this frame but not in
another frame. This is equivalent to EXCEPT in SQL.

Resources