Question about joining dataframes in Spark - apache-spark

Suppose I have two partitioned dataframes:
df1 = spark.createDataFrame(
[(x,x,x) for x in range(5)], ['key1', 'key2', 'time']
).repartition(3, 'key1', 'key2')
df2 = spark.createDataFrame(
[(x,x,x) for x in range(7)], ['key1', 'key2', 'time']
).repartition(3, 'key1', 'key2')
(scenario 1) If I join them by [key1, key2] join operation is performed within each partition without shuffle (number of partitions in result dataframe is the same):
x = df1.join(df2, on=['key1', 'key2'], how='left')
assert x.rdd.getNumPartitions() == 3
(scenario 2) But If I joint them by [key1, key2, time] shuffle operation takes place (number of partitions in result dataframe is 200 which is driven by spark.sql.shuffle.partitions option):
x = df1.join(df2, on=['key1', 'key2', 'time'], how='left')
assert x.rdd.getNumPartitions() == 200
At the same time groupby and window operations by [key1, key2, time] preserve number of partitions and done without shuffle:
x = df1.groupBy('key1', 'key2', 'time').agg(F.count('*'))
assert x.rdd.getNumPartitions() == 3
I can’t understand is this a bug or there are some reasons for performing shuffle operation in second scenario? And how can I avoid shuffle if it's possible?

I guess was able to figure out the reason of different result in Python and Scala.
The reason is in broadcast optimisation. If spark-shell is started with broadcast disabled both Python and Scala works identically.
./spark-shell --conf spark.sql.autoBroadcastJoinThreshold=-1
val df1 = Seq(
(1, 1, 1)
).toDF("key1", "key2", "time").repartition(3, col("key1"), col("key2"))
val df2 = Seq(
(1, 1, 1),
(2, 2, 2)
).toDF("key1", "key2", "time").repartition(3, col("key1"), col("key2"))
val x = df1.join(df2, usingColumns = Seq("key1", "key2", "time"))
x.rdd.getNumPartitions == 200
So looks like spark 2.4.0 isn't able to optimise described case out of the box and catalyst optimizer extension needed as suggested by #user10938362.
BTW. Here are info about writing catalyst optimizer extensions https://developer.ibm.com/code/2017/11/30/learn-extension-points-apache-spark-extend-spark-catalyst-optimizer/

The behaviour of Catalyst Optimizer differs between pyspark and Scala (using Spark 2.4 at least).
I ran both and got two different plans.
Indeed you get 200 partitions in pyspark, unless you state for pyspark explicitly:
spark.conf.set("spark.sql.shuffle.partitions", 3)
Then 3 partitions are processed, and thus 3 retained under pyspark.
A little surprised as I would have thought under the hood it would be common. So people keep telling me. It just goes to show.
Physical Plan for pyspark with param set via conf:
== Physical Plan ==
*(5) Project [key1#344L, key2#345L, time#346L]
+- SortMergeJoin [key1#344L, key2#345L, time#346L], [key1#350L, key2#351L, time#352L], LeftOuter
:- *(2) Sort [key1#344L ASC NULLS FIRST, key2#345L ASC NULLS FIRST, time#346L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#344L, key2#345L, time#346L, 3)
: +- *(1) Scan ExistingRDD[key1#344L,key2#345L,time#346L]
+- *(4) Sort [key1#350L ASC NULLS FIRST, key2#351L ASC NULLS FIRST, time#352L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key1#350L, key2#351L, time#352L, 3)
+- *(3) Filter ((isnotnull(key1#350L) && isnotnull(key2#351L)) && isnotnull(time#352L))
+- *(3) Scan ExistingRDD[key1#350L,key2#351L,time#352L]

Related

Spark streaming with cassandra direct join don't work

Hy, guys! I'm trying to dev a spark streaming apps but have some problems.
Some details:
We have Kafka topic, spark 3.2.1 and Cassandra 4.0.4 with datastax spark-cassandra-connector version com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
I need a next route of a data.
Get kafka message and transform to DataFrame in spark -> left join with cassandra existing table on two columns, which bе composite primary key in cassandra table* -> if row with thats keys already exists do nothing, in another way -- write data.
In the documentation wrote about new feature, available since SCC 2.5 in DataFrame API not only from DSE, is a DirectJoin what equal joinWithCassandraTable in RDD API. If i'm trying to use Datasourse V2 API i get usual SortMergeJoin on the spark side. To be frank with you, it's not really "streaming" app, to add data in cassandra i use microbatch way.
== Physical Plan ==
AppendData (12)
+- * Project (11)
+- * Filter (10)
+- * SortMergeJoin LeftOuter (9)
:- * Sort (4)
: +- Exchange (3)
: +- * SerializeFromObject (2)
: +- Scan (1)
+- * Sort (8)
+- Exchange (7)
+- * Project (6)
+- BatchScan (5)
(1) Scan
Output [1]: [obj#342]
Arguments: obj#342: org.apache.spark.sql.Row, MapPartitionsRDD[82] at start at RunnableStream.scala:13
(2) SerializeFromObject [codegen id : 1]
Input [1]: [obj#342]
Arguments: [validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, user_id), LongType) AS user_id#343L, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, user_type), StringType), true, false, true) AS user_type#344, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, order_id), StringType), true, false, true) AS order_id#345, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 3, status_name), StringType), true, false, true) AS status_name#346, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, fromJavaTimestamp, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 4, status_dttm), TimestampType), true, false, true) AS status_dttm#347]
(3) Exchange
Input [5]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347]
Arguments: hashpartitioning(user_id#343L, user_type#344, 16), ENSURE_REQUIREMENTS, [id=#793]
(4) Sort [codegen id : 2]
Input [5]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347]
Arguments: [user_id#343L ASC NULLS FIRST, user_type#344 ASC NULLS FIRST], false, 0
(5) BatchScan
Output [2]: [user_id#348L, user_type#349]
Cassandra Scan: keyspace_name.table_name
- Cassandra Filters: []
- Requested Columns: [user_id,user_type]
(6) Project [codegen id : 3]
Output [2]: [user_id#348L, user_type#349]
Input [2]: [user_id#348L, user_type#349]
(7) Exchange
Input [2]: [user_id#348L, user_type#349]
Arguments: hashpartitioning(user_id#348L, user_type#349, 16), ENSURE_REQUIREMENTS, [id=#801]
(8) Sort [codegen id : 4]
Input [2]: [user_id#348L, user_type#349]
Arguments: [user_id#348L ASC NULLS FIRST, user_type#349 ASC NULLS FIRST], false, 0
(9) SortMergeJoin [codegen id : 5]
Left keys [2]: [user_id#343L, user_type#344]
Right keys [2]: [user_id#348L, user_type#349]
Join condition: None
(10) Filter [codegen id : 5]
Input [7]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347, user_id#348L, user_type#349]
Condition : (isnull(user_id#348L) = true)
(11) Project [codegen id : 5]
Output [5]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347]
Input [7]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347, user_id#348L, user_type#349]
(12) AppendData
Input [5]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347]
Arguments: org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$3358/1878168161#32616db8, org.apache.spark.sql.connector.write.WriteBuilder$1#1d354f3b
In another way, if i'm trying to use Datasource V1 with explicitly pointing out directJoinSetting when getting cassandra table as DataFrame, like
spark.read.cassandraFormat("tableName", "keyspace").option("directJoinSetting", "on").load
this calls error on join:
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.UnaryExecNode.children$(Lorg/apache/spark/sql/execution/UnaryExecNode;)Lscala/collection/Seq;
at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinExec.children(CassandraDirectJoinExec.scala:18)
at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinStrategy$.hasCassandraChild(CassandraDirectJoinStrategy.scala:206)
at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinStrategy$$anonfun$1.applyOrElse(CassandraDirectJoinStrategy.scala:241)
at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinStrategy$$anonfun$1.applyOrElse(CassandraDirectJoinStrategy.scala:240)
full spark-submit command
/opt/spark-3.2.1-bin-hadoop3.2/bin/spark-submit --master yarn --deploy-mode cluster --name "name" \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1g \
--conf spark.driver.extraJavaOptions="-XX:+UseG1GC -Duser.timezone=GMT -Dfile.encoding=utf-8 -Dlog4j.configuration=name_Log4j.properties" \
--conf spark.executor.instances=1 \
--conf spark.executor.cores=4 \
--conf spark.executor.memory=8g \
--conf spark.executor.extraJavaOptions="-XX:+UseG1GC -Duser.timezone=GMT -Dfile.encoding=utf-8 -Dlog4j.configuration=name_Log4j.properties" \
--conf spark.yarn.queue=default \
--conf spark.yarn.submit.waitAppCompletion=true \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=hdfs:///spark3-history/ \
--conf spark.eventLog.compress=true \
--conf spark.sql.shuffle.partitions=16 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
--conf spark.sql.catalog.cassandracatalog=com.datastax.spark.connector.datasource.CassandraCatalog \
--conf spark.sql.dse.search.enableOptimization=on \
--conf spark.cassandra.connection.host=cassandra_host \
--conf spark.cassandra.auth.username=user_name \
--conf spark.cassandra.auth.password=*** \
--conf spark.sql.directJoinSetting=on \
--class ...
class connector to cassandra
import org.apache.spark.sql._
class CassandraConnector(
val ss: SparkSession,
catalog: String,
keyspace: String,
table: String
) extends Serializable {
def read: DataFrame = ss.read.table(s"$catalog.$keyspace.$table")
def writeDirect(dataFrame: DataFrame): Unit = dataFrame.writeTo(s"$catalog.$keyspace.$table").append()
}
cassadra ddl table
CREATE KEYSPACE IF NOT EXISTS keyspace_name
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
CREATE TABLE IF NOT EXISTS keyspace_name.table_name
(
user_id BIGINT,
user_type VARCHAR,
order_id VARCHAR,
status_name VARCHAR,
status_dttm timestamp,
PRIMARY KEY (user_id, user_type)
);
method which are making join and writing to cassandra
override def writeBatch(batch: Dataset[Row], batchId: Long): Unit = {
val result =
batch
.as("df")
.join(
cassandraConnector.read
.as("cass"),
col("df.user_id") === col("cass.user_id")
&& col("df.user_type") === col("cass.user_type"),
"left"
)
.withColumn("need_write", when(col("cass.user_id").isNull, true).otherwise(false))
.filter(col("need_write") === true)
.select("df.user_id", "df.user_type", "df.order_id", "df.status_name", "df.status_dttm")
cassandraConnector.writeDirect(result)
}
Can someone explain what i do wrong, please?
Yes, the version of the Spark Cassandra Connector is the source of the problem - advanced functionality, like, Direct Join is heavily dependent on the Spark internal classes that may change between versions. So if you use Spark 3.2, then you need to use corresponding version of the SCC: com.datastax.spark:spark-cassandra-connector_2.12:3.2.0.
Please note that there is no version for Spark 3.3 yet...
P.S. I have a blog post about using direct joins - it could be interesting for you.

Spark Data frame Join: Non matching Records from first Dataframe

Hi all I have 2 Dataframes and I'm applying some join condition on those dataframes.
1.after join condition i want all the data from first dataframe whose name,id,code,lastname is not matching which second dataframe.I have written below code.
val df3=df1.join(df2,df1("name") !== df2("name_2") &&
df1("id") !== df2("id_2") &&
df1("code") !== df2("code_2") &&
df1("lastname") !== df2("lastname_2"),"inner")
.drop(df2("id_2"))
.drop(df2("name_2"))
.drop(df2("code_2"))
.drop(df2("lastname"))
expected result.
DF1
id,name,code,lastname
1,A,001,p1
2,B,002,p2
3,C,003,p3
DF2
id_2,name_2,code_2,lastname_2
1,A,001,p1
2,B,002,p4
4,D,004,p4
DF3
id,name,code,lastname
3,C,003,p3
Can someone please help me is this the correct way to do this or Should I use sql inner query with 'not In '?. I am new to spark and using first time dataframe methods
so I am not sure this is the correct way or not?
I recommend you using Spark API to work with data:
val df1 =
Seq((1, "20181231"), (2, "20190102"), (3, "20190103"), (4, "20190104"), (5, "20190105")).toDF("id", "date")
val df2 =
Seq((1, "20181231"), (2, "20190102"), (4, "20190104"), (5, "20190105")).toDF("id", "date")
Option1. You can get all rows are not included in other dataframe:
val df3=df1.except(df2)
Option2. You can use a specific fields to do anti join, for example 'id':
val df3 = df1.as("table1").join(df2.as("table2"), $"table1.id" === $"table2.id", "leftanti")
df3.show()

Parquet filter pushdown is not working with Spark Dataset API [duplicate]

This question already has answers here:
Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?
(1 answer)
Spark 2.0 Dataset vs DataFrame
(3 answers)
Closed 4 years ago.
Here is the sample code which i am running.
Creating a test parquet Dataset with mod column as partition.
scala> val test = spark.range(0 , 100000000).withColumn("mod", $"id".mod(40))
test: org.apache.spark.sql.DataFrame = [id: bigint, mod: bigint]
scala> test.write.partitionBy("mod").mode("overwrite").parquet("test_pushdown_filter")
After that, i am reading this data as dataframe and applying filter on partition column i.e. mod.
scala> val df = spark.read.parquet("test_pushdown_filter").filter("mod = 5")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, mod: int]
scala> df.queryExecution.executedPlan
res1: org.apache.spark.sql.execution.SparkPlan =
*FileScan parquet [id#16L,mod#17] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/kprajapa/WorkSpace/places/test_pushdown_filter], PartitionCount: 1, PartitionFilters: [
isnotnull(mod#17), (mod#17 = 5)], PushedFilters: [], ReadSchema: struct<id:bigint>
You can see in execution plan, it is only reading 1 partition.
But if you apply same filter with dataset. its reading all the partition and then applying filter.
scala> case class Test(id: Long, mod: Long)
defined class Test
scala> val ds = spark.read.parquet("test_pushdown_filter").as[Test].filter(_.mod==5)
ds: org.apache.spark.sql.Dataset[Test] = [id: bigint, mod: int]
scala> ds.queryExecution.executedPlan
res2: org.apache.spark.sql.execution.SparkPlan =
*Filter <function1>.apply
+- *FileScan parquet [id#22L,mod#23] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/kprajapa/WorkSpace/places/test_pushdown_filter], PartitionCount: 40, PartitionFilter
s: [], PushedFilters: [], ReadSchema: struct<id:bigint>
Is this how dataset API works? or am i missing something?

Find median in spark SQL for multiple double datatype columns

I have a requirement to find median for multiple double datatype columns.Request suggestion to find the correct approach.
Below is my sample dataset with one column. I am expecting the median value to be returned as 1 for my sample.
scala> sqlContext.sql("select num from test").show();
+---+
|num|
+---+
|0.0|
|0.0|
|1.0|
|1.0|
|1.0|
|1.0|
+---+
I tried the following options
1) Hive UDAF percentile, it worked only for BigInt.
2) Hive UDAT percentile_approx, but it does not work as expected (returns 0.25 vs 1).
sqlContext.sql("select percentile_approx(num,0.5) from test").show();
+----+
| _c0|
+----+
|0.25|
+----+
3) Spark window function percent_rank- to find median the way i see is to look for all percent_rank above 0.5 and pick the max percent_rank's corresponding num value. But it does not work in all cases, especially when i have even record counts, in such case the median is the average of the middle value in the sorted distribution.
Also in the percent_rank, as i have to find the median for multiple columns, i have to calculate it in different dataframes, which to me is little complex method. Please correct me, if my understanding is not right.
+---+-------------+
|num|percent_rank |
+---+-------------+
|0.0|0.0|
|0.0|0.0|
|1.0|0.4|
|1.0|0.4|
|1.0|0.4|
|1.0|0.4|
+---+---+
Which version of Apache Spark are you using out of curiosity? There were some fixes within the Apache Spark 2.0+ which included changes to approxQuantile.
If I was to run the pySpark code snippet below:
rdd = sc.parallelize([[1, 0.0], [1, 0.0], [1, 1.0], [1, 1.0], [1, 1.0], [1, 1.0]])
df = rdd.toDF(['id', 'num'])
df.createOrReplaceTempView("df")
with the median calculation using approxQuantile as:
df.approxQuantile("num", [0.5], 0.25)
or
spark.sql("select percentile_approx(num, 0.5) from df").show()
the results are:
Spark 2.0.0: 0.25
Spark 2.0.1: 1.0
Spark 2.1.0: 1.0
Note, as these are the approximate numbers (via approxQuantile) though in general this should work well. If you need the exact median, one approach is to use numpy.median. The code snippet below is updated for this df example based on gench's SO response to How to find the median in Apache Spark with Python Dataframe API?:
from pyspark.sql.types import *
import pyspark.sql.functions as F
import numpy as np
def find_median(values):
try:
median = np.median(values) #get the median of values in a list in each row
return round(float(median),2)
except Exception:
return None #if there is anything wrong with the given values
median_finder = F.udf(find_median,FloatType())
df2 = df.groupBy("id").agg(F.collect_list("num").alias("nums"))
df2 = df2.withColumn("median", median_finder("nums"))
# print out
df2.show()
with the output of:
+---+--------------------+------+
| id| nums|median|
+---+--------------------+------+
| 1|[0.0, 0.0, 1.0, 1...| 1.0|
+---+--------------------+------+
Updated: Spark 1.6 Scala version using RDDs
If you are using Spark 1.6, you can calculate the median using Scala code via Eugene Zhulenev's response How can I calculate the exact median with Apache Spark. Below is the modified code that works with our example.
import org.apache.spark.SparkContext._
val rdd: RDD[Double] = sc.parallelize(Seq((0.0), (0.0), (1.0), (1.0), (1.0), (1.0)))
val sorted = rdd.sortBy(identity).zipWithIndex().map {
case (v, idx) => (idx, v)
}
val count = sorted.count()
val median: Double = if (count % 2 == 0) {
val l = count / 2 - 1
val r = l + 1
(sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
} else sorted.lookup(count / 2).head.toDouble
with the output of:
// output
import org.apache.spark.SparkContext._
rdd: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[227] at parallelize at <console>:34
sorted: org.apache.spark.rdd.RDD[(Long, Double)] = MapPartitionsRDD[234] at map at <console>:36
count: Long = 6
median: Double = 1.0
Note, this is calculating the exact median using RDDs - i.e. you will need to convert the DataFrame column into an RDD to perform this calculation.

Does repartition not shuffle data to all nodes?

Consider the following simple example, run the Spark Shell connected to a Cluster with 4 Executors:
scala> val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5, 6), 4).cache.setName("rdd")
rdd: org.apache.spark.rdd.RDD[Int] = rdd ParallelCollectionRDD[0] at parallelize at <console>:27
scala> rdd.count()
res0: Long = 6
scala> val singlePartition = rdd.repartition(1).cache.setName("singlePartition")
singlePartition: org.apache.spark.rdd.RDD[Int] = singlePartition MapPartitionsRDD[4] at repartition at <console>:29
scala> singlePartition.count()
res1: Long = 6
scala> val multiplePartitions = singlePartition.repartition(6).cache.setName("multiplePartitions")
multiplePartitions: org.apache.spark.rdd.RDD[Int] = multiplePartitions MapPartitionsRDD[8] at repartition at <console>:31
scala> multiplePartitions.count()
res2: Long = 6
The original rdd has 4 partitions which when I check in the UI are distributed across the 4 Executors. The singlePartition RDD is obviously only contained on one Executor. And when the multiplePartitions RDD is created by repartitioning the singlePartition RDD and I would expect that to shuffle the data across the 4 Executors. What I see is that there are 6 partitions for multiplePartitions, but they are all on one Executor, the same one where singlePartition has its partition.
Shouldn't the data be shuffled across the 4 Executors by the repartition?

Resources