How to subtract two DataFrames keeping duplicates in Spark 2.3.0 - apache-spark

Spark 2.4.0 introduces new handy function exceptAll which allows to subtract two dataframes, keeping duplicates.
Example
val df1 = Seq(
("a", 1L),
("a", 1L),
("a", 1L),
("b", 2L)
).toDF("id", "value")
val df2 = Seq(
("a", 1L),
("b", 2L)
).toDF("id", "value")
df1.exceptAll(df2).collect()
// will return
Seq(("a", 1L),("a", 1L))
However I can only use Spark 2.3.0.
What is the best way to implement this using only functions from Spark 2.3.0?

One option is to use row_number to generate a sequential number column and use it on a left join to get the missing rows.
PySpark solution shown here.
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w1 = Window.partitionBy(df1.id).orderBy(df1.value)
w2 = Window.partitionBy(df2.id).orderBy(df2.value)
df1 = df1.withColumn("rnum", row_number().over(w1))
df2 = df2.withColumn("rnum", row_number().over(w2))
res_like_exceptAll = df1.join(df2, (df1.id==df2.id) & (df1.val == df2.val) & (df1.rnum == df2.rnum), 'left') \
.filter(df2.id.isNull()) \ #Identifies missing rows
.select(df1.id,df1.value)
res_like_exceptAll.show()

Related

Performance benefit to separate actions in a series of transformation

Consider a series of heavy transformation:
val df1 = spark.sql("select * from big table where...")
val df2 = df1.groupBy(...).agg(..)
val df3 = df2.join(...)
val df4 = df3.columns.foldleft(df3)((inpudDF, column) => (...))
val df5 = df4.withColumn("newColName", row_number.over(Window.OrderBy(..,..)))
val df5.count
is there any performance benefit to add action to each transformation, instead of take action in the last transformation?
val df1 = spark.sql("select * from big table where...")
val df2 = df1.groupBy(...).agg(..)
val df2 .count // does it help to ease the performance impact by df5.count
val df3 = df2.join(...)
val df3.count // does it help to ease the performance impact by df5.count
val df4 = df3.columns.foldleft(df3)((inpudDF, column) => (...))
val df4.count // does it help to ease the performance impact by df5.count
val df5 = df4.withColumn("newColName", row_number.over(Window.OrderBy(..,..)))
val df5.count

Passing an entire row as an argument to spark udf through spark dataframe - throws AnalysisException

I am trying to pass an entire row to the spark udf along with few other arguments, I am not using spark sql rather I am using dataframe withColumn api, but I am getting the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) col3#9 missing from col1#7,col2#8,col3#13 in operator !Project [col1#7, col2#8, col3#13, UDF(col3#9, col2, named_struct(col1, col1#7, col2, col2#8, col3, col3#9)) AS contcatenated#17]. Attribute(s) with the same name appear in the operation: col3. Please check if the right attribute(s) are used.;;
The above exception can be replicated using the below code:
addRowUDF() // call invokes
def addRowUDF() {
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().config(new SparkConf().set("master", "local[*]")).appName(this.getClass.getSimpleName).getOrCreate()
import spark.implicits._
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")).toDF("col1", "col2", "col3")
execute(df)
}
def execute(df: org.apache.spark.sql.DataFrame) {
import org.apache.spark.sql.Row
def concatFunc(x: Any, y: String, row: Row) = x.toString + ":" + y + ":" + row.mkString(", ")
import org.apache.spark.sql.functions.{ udf, struct }
val combineUdf = udf((x: Any, y: String, row: Row) => concatFunc(x, y, row))
def udf_execute(udf: String, args: org.apache.spark.sql.Column*) = (combineUdf)(args: _*)
val columns = df.columns.map(df(_))
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
val df3 = df2.withColumn("contcatenated", udf_execute("uudf", df2.col("col3"), lit("col2"), struct(columns: _*)))
df3.show(false)
}
output should be:
+----+----+-----------+----------------------------+
|col1|col2|col3 |contcatenated |
+----+----+-----------+----------------------------+
|a |b |xxxxxxxxxxx|xxxxxxxxxxx:col2:a, b, c |
|a1 |b1 |xxxxxxxxxxx|xxxxxxxxxxx:col2:a1, b1, c1 |
+----+----+-----------+----------------------------+
That happens because you refer to column that is no longer in the scope. When you call:
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
it shades the original col3 column, effectively making preceding columns with the same name accessible. Even if it wasn't the case, let's say after:
val df2 = df.select($"*", lit("xxxxxxxxxxx") as "col3")
the new col3 would be ambiguous, and indistinguishable by name from the one defined brought by *.
So to achieve the required output you'll have to use another name:
val df2 = df.withColumn("col3_", lit("xxxxxxxxxxx"))
and then adjust the rest of your code accordingly:
df2.withColumn(
"contcatenated",
udf_execute("uudf", df2.col("col3_") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")
If the logic is as simple as the one in the example, you can of course just inline things:
df.withColumn(
"contcatenated",
udf_execute("uudf", lit("xxxxxxxxxxx") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")

Efficient way to reduceByKey ignoring keys not in another RDD?

I have a large collection of data in pyspark. The format is key-value pairs, and I need to do a reducebykey operation, but ignoring all data whose key isn't in an RDD of 'interesting' keys that I also have.
I found a solution on SO that utilizes the subtractbykey operation to achieve this. It works, but crashes due to low memory on my cluster. I have not been able to change this with tweaking the settings, so I'm hoping there's a more efficient solution.
Here's my solution that works on smaller datasets:
# The keys I'm interested in
edges = sc.parallelize([("a", "b"), ("b", "c"), ("a", "d")])
# Data containing both interesting and uninteresting stuff
data1 = sc.parallelize([(("a", "b"), [42]), (("a", "c"), [60]), (("a", "d"), [13, 37])])
data2 = sc.parallelize([(("a", "b"), [43]), (("b", "c"), [23, 24]), (("a", "c"), [13, 37])])
all_data = [data1, data2]
mask = edges.map(lambda t: (tuple(t), None))
rdds = []
for datum in all_data:
combined = datum.reduceByKey(lambda a, b: a+b)
unwanted = combined.subtractByKey(mask)
wanted = combined.subtractByKey(unwanted)
rdds.append(wanted)
edge_alltimes = sc.union(rdds).reduceByKey(lambda a,b: a+b)
edge_alltimes.collect()
As desired, this outputs [(('a', 'd'), [13, 37]), (('a', 'b'), [42, 43]), (('b', 'c'), [23, 24])]
(i.e. data for the 'interesting' key tuples have been combined and the rest has been dropped).
The reason I have the data in several RDDs is to mimic behavior on my cluster where I can't load all the data simultaneously due to its size.
Any help would be great.
Example with join. A small drawback is that you need to have RDD of pairs before join and you need to strip extra data after join.
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val goodKeys = sc.parallelize(Seq(1, 2))
val allData = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "c")))
val goodPairs = goodKeys.map(v => (v, 0))
val goodData = allData.join(goodPairs).mapValues(p => p._1)
goodData.collect().foreach(println)
}
}
Output:
(1,a)
(2,b)

How to convert a Spark DataFrame column into a list?

I want to convert a Spark DataFrame into another DataFrame with a specific manner as follows:
I have Spark DataFrame:
col des
A a
A b
B b
B c
As a result of the operation I would like to a have also a Spark DataFrame as:
col des
A a,b
B b,c
I tried to use:
result <- summarize(groupBy(df, df$col), des = n(df$des))
As a result I obtained the count. Is there any parameter of (summarize or agg) that converts column into a list or something similar, but with assumption that all operations are done on Spark?
Thank you in advance
Here is the solution in scala, you need to figure out for the SparkR.
val dataframe = spark.sparkContext.parallelize(Seq(
("A", "a"),
("A", "b"),
("B", "b"),
("B", "c")
)).toDF("col", "desc")
dataframe.groupBy("col").agg(collect_list(struct("desc")).as("desc")).show
Hope this helps!
sparkR code:
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
#create R data frame
df <- data.frame(col= c("A","A","B","B"),des= c("a","b","b","c"))
#converting to spark dataframe
sdf <- createDataFrame( sqlContext, df)
registerTempTable(sdf, "sdf")
head(sql(sqlContext, "SQL QUERY"))
write the corresponding sql query in it and execute it.

How to get all columns after groupby on Dataset<Row> in spark sql 2.1.0

First, I am very new to SPARK
I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am getting correct results but I need all columns in my resultset.
Dataset<Row> resultset = studentDataSet.select("*").groupBy("name").max("age");
resultset.show(1000,false);
I am getting only name and max(age) in my resultset dataset.
For your solution you have to try different approach. You was almost there for solution but let me help you understand.
Dataset<Row> resultset = studentDataSet.groupBy("name").max("age");
now what you can do is you can join the resultset with studentDataSet
Dataset<Row> joinedDS = studentDataset.join(resultset, "name");
The problem with groupBy this that after applying groupBy you get RelationalGroupedDataset so it depends on what next operation you perform like sum, min, mean, max etc then the result of these operation joined with groupBy
As in you case name column is joined with the max of age so it will return only two columns but if use apply groupBy on age and then apply max on 'age' column you will get two column one is age and second is max(age).
Note :- code is not tested please make changes if needed
Hope this clears you query
The accepted answer isn't ideal because it requires a join. Joining big DataFrames can cause a big shuffle that'll execute slowly.
Let's create a sample data set and test the code:
val df = Seq(
("bob", 20, "blah"),
("bob", 40, "blah"),
("karen", 21, "hi"),
("monica", 43, "candy"),
("monica", 99, "water")
).toDF("name", "age", "another_column")
This code should run faster with large DataFrames.
df
.groupBy("name")
.agg(
max("name").as("name1_dup"),
max("another_column").as("another_column"),
max("age").as("age")
).drop(
"name1_dup"
).show()
+------+--------------+---+
| name|another_column|age|
+------+--------------+---+
|monica| water| 99|
| karen| hi| 21|
| bob| blah| 40|
+------+--------------+---+
What your trying to achieve is
group rows by age
reduce each group to 1 row with maximum age
This alternative achieves this output without use of aggregate
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
object TestJob5 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("ERROR")
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("Moe", "Slap", 7.9, 118),
("Larry", "Spank", 8.0, 115),
("Curly", "Twist", 6.0, 113),
("Laurel", "Whimper", 7.53, 119),
("Hardy", "Laugh", 6.0, 118),
("Charley", "Ignore", 9.7, 115),
("Moe", "Spank", 6.8, 118),
("Larry", "Twist", 6.0, 115),
("Charley", "fall", 9.0, 115)
).toDF("name", "requisite", "funniness_of_requisite", "age")
rawDf.show(false)
rawDf.printSchema
val nameWindow = Window
.partitionBy("name")
val aggDf = rawDf
.withColumn("id", monotonically_increasing_id)
.withColumn("maxFun", max("funniness_of_requisite").over(nameWindow))
.withColumn("count", count("name").over(nameWindow))
.withColumn("minId", min("id").over(nameWindow))
.where(col("maxFun") === col("funniness_of_requisite") && col("minId") === col("id") )
.drop("maxFun")
.drop("minId")
.drop("id")
aggDf.printSchema
aggDf.show(false)
}
}
bear in mind that a group could potentially have more than 1 row with max age so you need to pick one by some logic. In the example I assume it doesn't matter so I just assign a unique number to choose
Noting that a subsequent join is extra shuffling and some of the other solutions seem inaccurate in the returns or even turn the Dataset into Dataframes, I sought a better solution. Here is mine:
case class People(name: String, age: Int, other: String)
val df = Seq(
People("Rob", 20, "cherry"),
People("Rob", 55, "banana"),
People("Rob", 40, "apple"),
People("Ariel", 55, "fox"),
People("Vera", 43, "zebra"),
People("Vera", 99, "horse")
).toDS
val oldestResults = df
.groupByKey(_.name)
.mapGroups{
case (nameKey, peopleIter) => {
var oldestPerson = peopleIter.next
while(peopleIter.hasNext) {
val nextPerson = peopleIter.next
if(nextPerson.age > oldestPerson.age) oldestPerson = nextPerson
}
oldestPerson
}
}
oldestResults.show
The following produces:
+-----+---+------+
| name|age| other|
+-----+---+------+
|Ariel| 55| fox|
| Rob| 55|banana|
| Vera| 99| horse|
+-----+---+------+
You need to remember that aggregate functions reduce the rows and therefore you need to specify which of the rows age you want with a reducing function. If you want to retain all rows of a group (warning! this can cause explosions or skewed partitions) you can collect them as a list. You can then use a UDF (user defined function) to reduce them by your criteria, in this example funniness_of_requisite. And then expand columns belonging to the reduced row from the single reduced row with another UDF .
For the purpose of this answer I assume you wish to retain the age of the person who has the max funniness_of_requisite.
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType}
import scala.collection.mutable
object TestJob4 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
(1, "Moe", "Slap", 7.9, 118),
(2, "Larry", "Spank", 8.0, 115),
(3, "Curly", "Twist", 6.0, 113),
(4, "Laurel", "Whimper", 7.53, 119),
(5, "Hardy", "Laugh", 6.0, 18),
(6, "Charley", "Ignore", 9.7, 115),
(2, "Moe", "Spank", 6.8, 118),
(3, "Larry", "Twist", 6.0, 115),
(3, "Charley", "fall", 9.0, 115)
).toDF("id", "name", "requisite", "funniness_of_requisite", "age")
rawDf.show(false)
rawDf.printSchema
val rawSchema = rawDf.schema
val fUdf = udf(reduceByFunniness, rawSchema)
val nameUdf = udf(extractAge, IntegerType)
val aggDf = rawDf
.groupBy("name")
.agg(
count(struct("*")).as("count"),
max(col("funniness_of_requisite")),
collect_list(struct("*")).as("horizontal")
)
.withColumn("short", fUdf($"horizontal"))
.withColumn("age", nameUdf($"short"))
.drop("horizontal")
aggDf.printSchema
aggDf.show(false)
}
def reduceByFunniness= (x: Any) => {
val d = x.asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]]
val red = d.reduce((r1, r2) => {
val funniness1 = r1.getAs[Double]("funniness_of_requisite")
val funniness2 = r2.getAs[Double]("funniness_of_requisite")
val r3 = funniness1 match {
case a if a >= funniness2 =>
r1
case _ =>
r2
}
r3
})
red
}
def extractAge = (x: Any) => {
val d = x.asInstanceOf[GenericRowWithSchema]
d.getAs[Int]("age")
}
}
d.getAs[String]("name")
}
}
here is the output
+-------+-----+---------------------------+-------------------------------+---+
|name |count|max(funniness_of_requisite)|short
|age|
+-------+-----+---------------------------+-------------------------------+---+
|Hardy |1 |6.0 |[5, Hardy, Laugh, 6.0, 18]
|18 |
|Moe |2 |7.9 |[1, Moe, Slap, 7.9, 118]
|118|
|Curly |1 |6.0 |[3, Curly, Twist, 6.0, 113]
|113|
|Larry |2 |8.0 |[2, Larry, Spank, 8.0, 115]
|115|
|Laurel |1 |7.53 |[4, Laurel, Whimper, 7.53, 119]|119|
|Charley|2 |9.7 |[6, Charley, Ignore, 9.7, 115] |115|
+-------+-----+---------------------------+-------------------------------+---+

Resources