Spark streaming with cassandra direct join don't work

Spark streaming with cassandra direct join don't work - apache-spark

Hy, guys! I'm trying to dev a spark streaming apps but have some problems.
Some details:
We have Kafka topic, spark 3.2.1 and Cassandra 4.0.4 with datastax spark-cassandra-connector version com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
I need a next route of a data.
Get kafka message and transform to DataFrame in spark -> left join with cassandra existing table on two columns, which bе composite primary key in cassandra table* -> if row with thats keys already exists do nothing, in another way -- write data.
In the documentation wrote about new feature, available since SCC 2.5 in DataFrame API not only from DSE, is a DirectJoin what equal joinWithCassandraTable in RDD API. If i'm trying to use Datasourse V2 API i get usual SortMergeJoin on the spark side. To be frank with you, it's not really "streaming" app, to add data in cassandra i use microbatch way.
== Physical Plan ==
AppendData (12)
+- * Project (11)
+- * Filter (10)
+- * SortMergeJoin LeftOuter (9)
:- * Sort (4)
: +- Exchange (3)
: +- * SerializeFromObject (2)
: +- Scan (1)
+- * Sort (8)
+- Exchange (7)
+- * Project (6)
+- BatchScan (5)
(1) Scan
Output [1]: [obj#342]
Arguments: obj#342: org.apache.spark.sql.Row, MapPartitionsRDD[82] at start at RunnableStream.scala:13
(2) SerializeFromObject [codegen id : 1]
Input [1]: [obj#342]
Arguments: [validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, user_id), LongType) AS user_id#343L, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, user_type), StringType), true, false, true) AS user_type#344, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, order_id), StringType), true, false, true) AS order_id#345, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 3, status_name), StringType), true, false, true) AS status_name#346, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, fromJavaTimestamp, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 4, status_dttm), TimestampType), true, false, true) AS status_dttm#347]
(3) Exchange
Input [5]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347]
Arguments: hashpartitioning(user_id#343L, user_type#344, 16), ENSURE_REQUIREMENTS, [id=#793]
(4) Sort [codegen id : 2]
Input [5]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347]
Arguments: [user_id#343L ASC NULLS FIRST, user_type#344 ASC NULLS FIRST], false, 0
(5) BatchScan
Output [2]: [user_id#348L, user_type#349]
Cassandra Scan: keyspace_name.table_name
- Cassandra Filters: []
- Requested Columns: [user_id,user_type]
(6) Project [codegen id : 3]
Output [2]: [user_id#348L, user_type#349]
Input [2]: [user_id#348L, user_type#349]
(7) Exchange
Input [2]: [user_id#348L, user_type#349]
Arguments: hashpartitioning(user_id#348L, user_type#349, 16), ENSURE_REQUIREMENTS, [id=#801]
(8) Sort [codegen id : 4]
Input [2]: [user_id#348L, user_type#349]
Arguments: [user_id#348L ASC NULLS FIRST, user_type#349 ASC NULLS FIRST], false, 0
(9) SortMergeJoin [codegen id : 5]
Left keys [2]: [user_id#343L, user_type#344]
Right keys [2]: [user_id#348L, user_type#349]
Join condition: None
(10) Filter [codegen id : 5]
Input [7]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347, user_id#348L, user_type#349]
Condition : (isnull(user_id#348L) = true)
(11) Project [codegen id : 5]
Output [5]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347]
Input [7]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347, user_id#348L, user_type#349]
(12) AppendData
Input [5]: [user_id#343L, user_type#344, order_id#345, status_name#346, status_dttm#347]
Arguments: org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$3358/1878168161#32616db8, org.apache.spark.sql.connector.write.WriteBuilder$1#1d354f3b
In another way, if i'm trying to use Datasource V1 with explicitly pointing out directJoinSetting when getting cassandra table as DataFrame, like
spark.read.cassandraFormat("tableName", "keyspace").option("directJoinSetting", "on").load
this calls error on join:
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.UnaryExecNode.children$(Lorg/apache/spark/sql/execution/UnaryExecNode;)Lscala/collection/Seq;
at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinExec.children(CassandraDirectJoinExec.scala:18)
at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinStrategy$.hasCassandraChild(CassandraDirectJoinStrategy.scala:206)
at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinStrategy$$anonfun$1.applyOrElse(CassandraDirectJoinStrategy.scala:241)
at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinStrategy$$anonfun$1.applyOrElse(CassandraDirectJoinStrategy.scala:240)
full spark-submit command
/opt/spark-3.2.1-bin-hadoop3.2/bin/spark-submit --master yarn --deploy-mode cluster --name "name" \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1g \
--conf spark.driver.extraJavaOptions="-XX:+UseG1GC -Duser.timezone=GMT -Dfile.encoding=utf-8 -Dlog4j.configuration=name_Log4j.properties" \
--conf spark.executor.instances=1 \
--conf spark.executor.cores=4 \
--conf spark.executor.memory=8g \
--conf spark.executor.extraJavaOptions="-XX:+UseG1GC -Duser.timezone=GMT -Dfile.encoding=utf-8 -Dlog4j.configuration=name_Log4j.properties" \
--conf spark.yarn.queue=default \
--conf spark.yarn.submit.waitAppCompletion=true \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=hdfs:///spark3-history/ \
--conf spark.eventLog.compress=true \
--conf spark.sql.shuffle.partitions=16 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
--conf spark.sql.catalog.cassandracatalog=com.datastax.spark.connector.datasource.CassandraCatalog \
--conf spark.sql.dse.search.enableOptimization=on \
--conf spark.cassandra.connection.host=cassandra_host \
--conf spark.cassandra.auth.username=user_name \
--conf spark.cassandra.auth.password=*** \
--conf spark.sql.directJoinSetting=on \
--class ...
class connector to cassandra
import org.apache.spark.sql._
class CassandraConnector(
val ss: SparkSession,
catalog: String,
keyspace: String,
table: String
) extends Serializable {
def read: DataFrame = ss.read.table(s"$catalog.$keyspace.$table")
def writeDirect(dataFrame: DataFrame): Unit = dataFrame.writeTo(s"$catalog.$keyspace.$table").append()
}
cassadra ddl table
CREATE KEYSPACE IF NOT EXISTS keyspace_name
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
CREATE TABLE IF NOT EXISTS keyspace_name.table_name
(
user_id BIGINT,
user_type VARCHAR,
order_id VARCHAR,
status_name VARCHAR,
status_dttm timestamp,
PRIMARY KEY (user_id, user_type)
);
method which are making join and writing to cassandra
override def writeBatch(batch: Dataset[Row], batchId: Long): Unit = {
val result =
batch
.as("df")
.join(
cassandraConnector.read
.as("cass"),
col("df.user_id") === col("cass.user_id")
&& col("df.user_type") === col("cass.user_type"),
"left"
)
.withColumn("need_write", when(col("cass.user_id").isNull, true).otherwise(false))
.filter(col("need_write") === true)
.select("df.user_id", "df.user_type", "df.order_id", "df.status_name", "df.status_dttm")
cassandraConnector.writeDirect(result)
}
Can someone explain what i do wrong, please?

Yes, the version of the Spark Cassandra Connector is the source of the problem - advanced functionality, like, Direct Join is heavily dependent on the Spark internal classes that may change between versions. So if you use Spark 3.2, then you need to use corresponding version of the SCC: com.datastax.spark:spark-cassandra-connector_2.12:3.2.0.
Please note that there is no version for Spark 3.3 yet...
P.S. I have a blog post about using direct joins - it could be interesting for you.

Related

Spark RAPIDS does not load (unsupported file format error for CSV and no error for parquet )

I am using a Ubuntu 20.04.4 server with 2xNVidia A100 GPUs. Spark (3.3.0) works fine normally, but when I try to use GPUs through RAPIDS, it just keeps waiting without loading data. I tried loading data as CSV and parquet files, but it fails. The current way in which I am invoking the GPU is shown below, though I have tried many combinations I could find on the internet. I also used spark-submit to submit jobs which caused same problems shown below. I would be grateful for any help in fixing the errors.
$ nvidia-smi
Mon Aug 8 17:00:05 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:25:00.0 Off | 0 |
| N/A 26C P0 35W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... Off | 00000000:E1:00.0 Off | 0 |
| N/A 24C P0 35W / 250W | 0MiB / 40536MiB | 33% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+
The errors I get are as follows:
$ echo $SPARK_RAPIDS_PLUGIN_JAR
/home/softy/soft/rapids-4-spark/rapids-4-spark_2.12-22.06.0-cuda11.jar
(base) softy#genome:~/spark/jclust-3.3.0-gpu$ spark-shell \
> --conf spark.executor.resource.gpu.amount=1 \
> --conf spark.task.resource.gpu.amount=1 \
> --conf spark.executor.resource.gpu.discoveryScript=/home/softy/soft/spark-3.3.0-scala2.12/examples/src/main/scripts/getGpusResources.sh \
> --num-executors 1 \
> --conf spark.executor.cores=10 \
> --conf spark.rapids.sql.concurrentGpuTasks=1 \
> --conf spark.sql.files.maxPartitionBytes=512m \
> --conf spark.sql.shuffle.partitions=10 \
> --conf spark.rapids.sql.explain=ALL \
> --driver-memory=200g \
> --conf spark.local.dir=/tmp \
> --conf spark.rpc.message.maxSize=2047 \
> --conf spark.plugins=com.nvidia.spark.SQLPlugin \
> --jars ${SPARK_RAPIDS_PLUGIN_JAR}
22/08/08 17:27:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/08 17:27:22 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
22/08/08 17:27:22 WARN ResourceUtils: The configuration of cores (exec = 10 task = 1, runnable tasks = 10) will result in wasted resources due to resource gpu limiting the number of runnable tasks per executor to: 1. Please adjust your configuration.
22/08/08 17:27:23 WARN RapidsPluginUtils: RAPIDS Accelerator 22.06.0 using cudf 22.06.0.
22/08/08 17:27:23 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
22/08/08 17:27:23 WARN ResourceUtils: The configuration of cores (exec = 256 task = 1, runnable tasks = 256) will result in wasted resources due to resource gpu limiting the number of runnable tasks per executor to: 1. Please adjust your configuration.
22/08/08 17:27:30 WARN RapidsConf: CUDA runtime/driver does not support the ASYNC allocator, falling back to ARENA
Spark context Web UI available at http://genome:4040
Spark context available as 'sc' (master = local[*], app id = local-1659959843286).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.0
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17.0.3-internal)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.read.format("csv").option("delimiter", "\t").option("header", "true").csv("t.csv").show(4)
22/08/08 17:27:36 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
#Partitioning <SinglePartition$> could run on GPU
*Exec <FilterExec> will run on GPU
*Expression <GreaterThan> (length(trim(value#0, None)) > 0) will run on GPU
*Expression <Length> length(trim(value#0, None)) will run on GPU
*Expression <StringTrim> trim(value#0, None) will run on GPU
!Exec <FileSourceScanExec> cannot run on GPU because unsupported file format: org.apache.spark.sql.execution.datasources.text.TextFileFormat
22/08/08 17:27:39 WARN Signaling: Cancelling all active jobs, this can take a while. Press Ctrl+C again to exit now.
org.apache.spark.SparkException: Job 0 cancelled as part of cancellation of all jobs
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:2554)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$doCancelAllJobs$2(DAGScheduler.scala:1067)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:1066)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2825)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:112)
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:65)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:62)
at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:210)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:411)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:537)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:443)
... 47 elided
scala>
scala> val columns = Seq("Name", "X1", "X2", "X3", "X4")
columns: Seq[String] = List(Name, X1, X2, X3, X4)
scala> val data = Seq(("id1", "1", "2", "3", "4"),("id2", "2", "2", "1", "8"),("id3", "1", "2", "5", "8"))
data: Seq[(String, String, String, String, String)] = List((id1,1,2,3,4), (id2,2,2,1,8), (id3,1,2,5,8))
scala> val rdd = spark.sparkContext.parallelize(data)
rdd: org.apache.spark.rdd.RDD[(String, String, String, String, String)] = ParallelCollectionRDD[6] at parallelize at <console>:23
scala> spark.createDataFrame(rdd).toDF(columns:_*).show()
22/08/08 17:28:04 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
#Partitioning <SinglePartition$> could run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> _1#21 AS Name#46 will run on GPU
*Expression <Alias> _2#22 AS X1#47 will run on GPU
*Expression <Alias> _3#23 AS X2#48 will run on GPU
*Expression <Alias> _4#24 AS X3#49 will run on GPU
*Expression <Alias> _5#25 AS X4#50 will run on GPU
! <SerializeFromObjectExec> cannot run on GPU because not all expressions can be replaced; GPU does not currently support the operator class org.apache.spark.sql.execution.SerializeFromObjectExec
#Expression <Alias> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._1, true, false, true) AS _1#21 could run on GPU
! <StaticInvoke> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._1, true, false, true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
! <Invoke> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._1 cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
!Expression <KnownNotNull> knownnotnull(assertnotnull(input[0, scala.Tuple5, true])) cannot run on GPU because input expression AssertNotNull assertnotnull(input[0, scala.Tuple5, true]) (ObjectType(class scala.Tuple5) is not supported); expression KnownNotNull knownnotnull(assertnotnull(input[0, scala.Tuple5, true])) produces an unsupported type ObjectType(class scala.Tuple5)
! <AssertNotNull> assertnotnull(input[0, scala.Tuple5, true]) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
! <BoundReference> input[0, scala.Tuple5, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
#Expression <Alias> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, true, false, true) AS _2#22 could run on GPU
! <StaticInvoke> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, true, false, true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
! <Invoke> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2 cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
!Expression <KnownNotNull> knownnotnull(assertnotnull(input[0, scala.Tuple5, true])) cannot run on GPU because input expression AssertNotNull assertnotnull(input[0, scala.Tuple5, true]) (ObjectType(class scala.Tuple5) is not supported); expression KnownNotNull knownnotnull(assertnotnull(input[0, scala.Tuple5, true])) produces an unsupported type ObjectType(class scala.Tuple5)
! <AssertNotNull> assertnotnull(input[0, scala.Tuple5, true]) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
! <BoundReference> input[0, scala.Tuple5, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
#Expression <Alias> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, true, false, true) AS _3#23 could run on GPU
! <StaticInvoke> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, true, false, true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
! <Invoke> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3 cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
!Expression <KnownNotNull> knownnotnull(assertnotnull(input[0, scala.Tuple5, true])) cannot run on GPU because input expression AssertNotNull assertnotnull(input[0, scala.Tuple5, true]) (ObjectType(class scala.Tuple5) is not supported); expression KnownNotNull knownnotnull(assertnotnull(input[0, scala.Tuple5, true])) produces an unsupported type ObjectType(class scala.Tuple5)
! <AssertNotNull> assertnotnull(input[0, scala.Tuple5, true]) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
! <BoundReference> input[0, scala.Tuple5, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
#Expression <Alias> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, false, true) AS _4#24 could run on GPU
! <StaticInvoke> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, false, true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
! <Invoke> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4 cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
!Expression <KnownNotNull> knownnotnull(assertnotnull(input[0, scala.Tuple5, true])) cannot run on GPU because input expression AssertNotNull assertnotnull(input[0, scala.Tuple5, true]) (ObjectType(class scala.Tuple5) is not supported); expression KnownNotNull knownnotnull(assertnotnull(input[0, scala.Tuple5, true])) produces an unsupported type ObjectType(class scala.Tuple5)
! <AssertNotNull> assertnotnull(input[0, scala.Tuple5, true]) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
! <BoundReference> input[0, scala.Tuple5, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
#Expression <Alias> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._5, true, false, true) AS _5#25 could run on GPU
! <StaticInvoke> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._5, true, false, true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
! <Invoke> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._5 cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
!Expression <KnownNotNull> knownnotnull(assertnotnull(input[0, scala.Tuple5, true])) cannot run on GPU because input expression AssertNotNull assertnotnull(input[0, scala.Tuple5, true]) (ObjectType(class scala.Tuple5) is not supported); expression KnownNotNull knownnotnull(assertnotnull(input[0, scala.Tuple5, true])) produces an unsupported type ObjectType(class scala.Tuple5)
! <AssertNotNull> assertnotnull(input[0, scala.Tuple5, true]) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
! <BoundReference> input[0, scala.Tuple5, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
! <ExternalRDDScanExec> cannot run on GPU because not all expressions can be replaced; GPU does not currently support the operator class org.apache.spark.sql.execution.ExternalRDDScanExec
!Expression <AttributeReference> obj#20 cannot run on GPU because expression AttributeReference obj#20 produces an unsupported type ObjectType(class scala.Tuple5)
[Stage 1:> (0 + 0) / 1]

Most of the time a hang means that Spark could not allocate all of the resources needed to fulfill the resource request. Here you are running in local mode local[*] which means that Spark is going to try and allocate a task per CPU thread on your computer. But you have launched Spark with
> --conf spark.executor.resource.gpu.amount=1 \
> --conf spark.task.resource.gpu.amount=1 \
Which asks Spark to have 1 GPU per executor and 1 GPU per task. I am assuming that you have more cores than just one on your machine so Spark now is stuck it want to allocate X tasks, which will need X GPUs to run, but there is only 1 GPU available. Spark could be a lot better about throwing errors in these deadlock/miss-configuration cases.
When running in local mode you can only use 1 GPU. So the simplest way to launch it is to just remove all of the resource requests.
~/spark/jclust-3.3.0-gpu$ spark-shell \
> --conf spark.rapids.sql.concurrentGpuTasks=1 \
> --conf spark.sql.files.maxPartitionBytes=512m \
> --conf spark.sql.shuffle.partitions=10 \
> --conf spark.rapids.sql.explain=ALL \
> --driver-memory=200g \
> --conf spark.local.dir=/tmp \
> --conf spark.rpc.message.maxSize=2047 \
> --conf spark.plugins=com.nvidia.spark.SQLPlugin \
> --jars ${SPARK_RAPIDS_PLUGIN_JAR}
That should let you run.

#Quiescent
I would suggest you consider creating a Spark cluster. For example, it could be a Spark standalone cluster, Spark on Yarn cluster, or even Spark on K8s cluster.
Maybe starting with Spark standalone cluster is easier in the begining and you just need to start spark master and worker processes as daemon processes firstly.
Make sure Spark Master UI shows the correct CPU, memory, GPU resources.
Then submit Spark job(no matter spark-shell/spark-sql/spark-submit) towards the Spark Standalone cluster.

Spark Kryo Serialization

We have a Spark Structured Streaming application that consumes from a Kafka topic in Avro format. The payload is part of the state object in the mapGroupWithState function. Given that we enforce FULL compatibility for our Avro schemas, we generally do not face problems when evolving our schemas. However, we have now evolved our schema by adding a nested object and Kryo serialization is failing with the following where xyz is the field which is a nested object within ObjectV1:
Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 020-12-14T19:23:49
Serialization trace:
xyz (x.y.z.ObjectV1)
Logical Plan:
SerializeFromObject [encodeusingserializer(input[0, java.lang.Object, true], true) AS value#38]
+- MapPartitions <function1>, obj#37: scala.Tuple2
+- DeserializeToObject decodeusingserializer(cast(value#34 as binary), scala.Option, true), obj#36: scala.Option
+- SerializeFromObject [encodeusingserializer(input[0, java.lang.Object, true], true) AS value#34]
+- FlatMapGroupsWithState <function3>, newInstance(class scala.Tuple4), decodeusingserializer(cast(value#23 as binary), scala.Tuple2, true), [_1#29, _2#30, _3#31, _4#32L], [value#23], obj#33: scala.Option, class[value[0]: binary], Update, true, ProcessingTimeTimeout
+- AppendColumns <function1>, class scala.Tuple2, [StructField(value,BinaryType,true)], decodeusingserializer(cast(value#23 as binary), scala.Tuple2, true), [assertnotnull(assertnotnull(input[0, scala.Tuple4, true]))._1 AS _1#29, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple4, true]))._2, true, false) AS _2#30, assertnotnull(assertnotnull(input[0, scala.Tuple4, true]))._3 AS _3#31, assertnotnull(assertnotnull(input[0, scala.Tuple4, true]))._4 AS _4#32L]
+- SerializeFromObject [encodeusingserializer(input[0, java.lang.Object, true], true) AS value#23]
+- MapElements <function1>, interface org.apache.spark.sql.Row, [StructField(key,BinaryType,true), StructField(value,BinaryType,true), StructField(topic,StringType,true), StructField(partition,IntegerType,true), StructField(offset,LongType,true), StructField(timestamp,TimestampType,true), StructField(timestampType,IntegerType,true)], obj#22: scala.Tuple2
+- DeserializeToObject createexternalrow(key#7, value#8, topic#9.toString, partition#10, offset#11L, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, ObjectType(class java.sql.Timestamp), toJavaTimestamp, timestamp#12, true, false), timestampType#13, StructField(key,BinaryType,true), StructField(value,BinaryType,true), StructField(topic,StringType,true), StructField(partition,IntegerType,true), StructField(offset,LongType,true), StructField(timestamp,TimestampType,true), StructField(timestampType,IntegerType,true)), obj#21: org.apache.spark.sql.Row
+- StreamingExecutionRelation KafkaV2[Subscribe[ext_object_v1]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]
The Spark version is 2.4.5. Has anyone come across something similar? Deleting the checkpoint folder resolves this but naturally, we would like to avoid this.

Question about joining dataframes in Spark

Suppose I have two partitioned dataframes:
df1 = spark.createDataFrame(
[(x,x,x) for x in range(5)], ['key1', 'key2', 'time']
).repartition(3, 'key1', 'key2')
df2 = spark.createDataFrame(
[(x,x,x) for x in range(7)], ['key1', 'key2', 'time']
).repartition(3, 'key1', 'key2')
(scenario 1) If I join them by [key1, key2] join operation is performed within each partition without shuffle (number of partitions in result dataframe is the same):
x = df1.join(df2, on=['key1', 'key2'], how='left')
assert x.rdd.getNumPartitions() == 3
(scenario 2) But If I joint them by [key1, key2, time] shuffle operation takes place (number of partitions in result dataframe is 200 which is driven by spark.sql.shuffle.partitions option):
x = df1.join(df2, on=['key1', 'key2', 'time'], how='left')
assert x.rdd.getNumPartitions() == 200
At the same time groupby and window operations by [key1, key2, time] preserve number of partitions and done without shuffle:
x = df1.groupBy('key1', 'key2', 'time').agg(F.count('*'))
assert x.rdd.getNumPartitions() == 3
I can’t understand is this a bug or there are some reasons for performing shuffle operation in second scenario? And how can I avoid shuffle if it's possible?

I guess was able to figure out the reason of different result in Python and Scala.
The reason is in broadcast optimisation. If spark-shell is started with broadcast disabled both Python and Scala works identically.
./spark-shell --conf spark.sql.autoBroadcastJoinThreshold=-1
val df1 = Seq(
(1, 1, 1)
).toDF("key1", "key2", "time").repartition(3, col("key1"), col("key2"))
val df2 = Seq(
(1, 1, 1),
(2, 2, 2)
).toDF("key1", "key2", "time").repartition(3, col("key1"), col("key2"))
val x = df1.join(df2, usingColumns = Seq("key1", "key2", "time"))
x.rdd.getNumPartitions == 200
So looks like spark 2.4.0 isn't able to optimise described case out of the box and catalyst optimizer extension needed as suggested by #user10938362.
BTW. Here are info about writing catalyst optimizer extensions https://developer.ibm.com/code/2017/11/30/learn-extension-points-apache-spark-extend-spark-catalyst-optimizer/

The behaviour of Catalyst Optimizer differs between pyspark and Scala (using Spark 2.4 at least).
I ran both and got two different plans.
Indeed you get 200 partitions in pyspark, unless you state for pyspark explicitly:
spark.conf.set("spark.sql.shuffle.partitions", 3)
Then 3 partitions are processed, and thus 3 retained under pyspark.
A little surprised as I would have thought under the hood it would be common. So people keep telling me. It just goes to show.
Physical Plan for pyspark with param set via conf:
== Physical Plan ==
*(5) Project [key1#344L, key2#345L, time#346L]
+- SortMergeJoin [key1#344L, key2#345L, time#346L], [key1#350L, key2#351L, time#352L], LeftOuter
:- *(2) Sort [key1#344L ASC NULLS FIRST, key2#345L ASC NULLS FIRST, time#346L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#344L, key2#345L, time#346L, 3)
: +- *(1) Scan ExistingRDD[key1#344L,key2#345L,time#346L]
+- *(4) Sort [key1#350L ASC NULLS FIRST, key2#351L ASC NULLS FIRST, time#352L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key1#350L, key2#351L, time#352L, 3)
+- *(3) Filter ((isnotnull(key1#350L) && isnotnull(key2#351L)) && isnotnull(time#352L))
+- *(3) Scan ExistingRDD[key1#350L,key2#351L,time#352L]

Spark (Scala) Structured Streaming Aggregation and Self Join

I'm trying to perform an aggregation followed by a self-join on a Structured Streaming DataFrame. Let's suppose the df looks like as follows:
sourceDf.show(false)
+-----+-------+
|owner|fruits |
+-----+-------+
|Brian|apple |
|Brian|pear |
|Brian|melon |
|Brian|avocado|
|Bob |avocado|
|Bob |apple |
+-----+-------+
On a static DataFrame, it is easy:
val aggDf = sourceDf.groupBy($"owner").agg(collect_list(col("fruits")) as "fruitsA")
sourceDf.join(aggDf, Seq("owner")).show(false)
+-----+-------+-----------------------------+
|owner|fruits |fruitsA |
+-----+-------+-----------------------------+
|Brian|apple |[apple, pear, melon, avocado]|
|Brian|pear |[apple, pear, melon, avocado]|
|Brian|melon |[apple, pear, melon, avocado]|
|Brian|avocado|[apple, pear, melon, avocado]|
|Bob |avocado|[avocado, apple] |
|Bob |apple |[avocado, apple] |
+-----+-------+-----------------------------+
Unfortunately, I'm unable to figure out how to do this in the case of a Streaming DataFrame. So, I tried using the following complete code that uses Kafka for both Source and Sink:
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StringType, StructType}
object Test {
val spark: SparkSession = SparkSession.builder().getOrCreate()
import spark.implicits._
val brokers = "kafka:9092"
val inputTopic = "test.kafka.sink.input"
val aggTopic = "test.kafka.sink.agg"
val outputTopicSelf = "test.kafka.sink.output.self"
val outputTopicTwo = "test.kafka.sink.output.two"
val payloadSchema: StructType = new StructType()
.add("owner", StringType)
.add("fruits", StringType)
val payloadSchemaA: StructType = new StructType()
.add("owner", StringType)
.add("fruitsA", StringType)
var joinedDfSchema: StructType = _
val sourceDf: DataFrame = Seq(
("Brian", "apple"),
("Brian", "pear"),
("Brian", "melon"),
("Brian", "avocado"),
("Bob", "avocado"),
("Bob", "apple")
)
.toDF("owner", "fruits")
val additionalData: DataFrame = Seq(("Bob", "grapes")).toDF("owner", "fruits")
def saveDfToKafka(df: DataFrame): Unit = {
df
.select(to_json(struct(df.columns.map(column): _*)).alias("value"))
.write
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("topic", inputTopic)
.save()
}
// save data to kafka (batch)
saveDfToKafka(sourceDf)
// kafka source
val farmDF: DataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("startingOffsets", "earliest")
.option("subscribe", inputTopic)
.load()
.byteArrayToString("value")
.withColumn("value", from_json($"value", payloadSchema))
.expand("value")
farmDF.printSchema()
implicit class DFHelper(df: DataFrame) {
def expand(column: String): DataFrame = {
val wantedColumns = df.columns.filter(_ != column) :+ s"$column.*"
df.select(wantedColumns.map(col): _*)
}
def byteArrayToString(column: String): DataFrame = {
val selectedCols = df.columns.filter(_ != column) :+ s"CAST($column AS STRING)"
df.selectExpr(selectedCols: _*)
}
}
def testSelfAggJoinFail(): Unit = {
// aggregated df
val myFarmDF = farmDF
.groupBy($"owner")
.agg(collect_list(col("fruits")) as "fruitsA")
// joined df
val joinedDF = farmDF
.join(myFarmDF.as("myFarmDF"), Seq("owner"))
.select("owner", "fruits", "myFarmDF.fruitsA")
joinedDfSchema = joinedDF.schema
// stream sink
joinedDF
.select(to_json(struct(joinedDF.columns.map(column): _*)).alias("value"))
.writeStream
.outputMode("append")
.option("kafka.bootstrap.servers", brokers)
.option("checkpointLocation", "/data/kafka/checkpointSelf")
.option("topic", outputTopicSelf)
.format("kafka")
.start()
// let's give time to process the stream
Thread.sleep(10000)
}
def testSelfAggJoin(): Unit = {
// aggregated df
val myFarmDF = farmDF
.withWatermark("timestamp", "30 seconds")
.groupBy(
window($"timestamp", "30 seconds", "15 seconds"),
$"owner"
)
.agg(collect_list(col("fruits")) as "fruitsA")
.select("owner", "fruitsA", "window")
// joined df
val joinedDF = farmDF
.as("farmDF")
.withWatermark("timestamp", "30 seconds")
.join(
myFarmDF.as("myFarmDF"),
expr(
"""
|farmDF.owner = myFarmDF.owner AND
|farmDF.timestamp >= myFarmDF.window.start AND
|farmDF.timestamp <= myFarmDF.window.end
""".stripMargin))
.select("farmDF.owner", "farmDF.fruits", "myFarmDF.fruitsA")
joinedDfSchema = joinedDF.schema
// stream sink
joinedDF
.select(to_json(struct(joinedDF.columns.map(column): _*)).alias("value"))
.writeStream
.outputMode("append")
.option("kafka.bootstrap.servers", brokers)
.option("checkpointLocation", "/data/kafka/checkpointSelf")
.option("topic", outputTopicSelf)
.format("kafka")
.start()
// let's give time to process the stream
Thread.sleep(10000)
}
def testTwoDfAggJoin(): Unit = {
// aggregated df
val myFarmDF = farmDF
.withWatermark("timestamp", "30 seconds")
.groupBy(
$"owner"
)
.agg(collect_list(col("fruits")) as "fruitsA")
.select("owner", "fruitsA")
// save the aggregated df to kafka
myFarmDF
.select(to_json(struct(myFarmDF.columns.map(column):_*)).alias("value"))
.writeStream
.outputMode("update")
.option("kafka.bootstrap.servers", brokers)
.option("checkpointLocation", "/data/kafka/checkpointAgg")
.option("topic", aggTopic)
.format("kafka")
.start()
// let's give time to process the stream
Thread.sleep(10000)
// read the aggregated df from kafka as a stream
val aggDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("startingOffsets", "earliest")
.option("subscribe", aggTopic)
.load()
.byteArrayToString("value")
.withColumn("value", from_json($"value", payloadSchemaA))
.expand("value")
.withWatermark("timestamp", "30 seconds")
// joined df
val joinedDF = farmDF
.as("farmDF")
.join(
aggDF.as("myFarmDF"),
expr(
"""
|farmDF.owner = myFarmDF.owner AND
|farmDF.timestamp >= myFarmDF.timestamp - interval 1 hour AND
|farmDF.timestamp <= myFarmDF.timestamp + interval 1 hour
""".stripMargin))
.select("farmDF.owner", "myFarmDF.fruitsA", "farmDF.fruits")
joinedDfSchema = joinedDF.schema
// stream sink
joinedDF
.select(to_json(struct(joinedDF.columns.map(column):_*)).alias("value"))
.writeStream
.outputMode("append")
.option("kafka.bootstrap.servers", brokers)
.option("checkpointLocation", "/data/kafka/checkpointTwo")
.option("topic", outputTopicTwo)
.format("kafka")
.start()
// let's give time to process the stream
Thread.sleep(10000)
}
def data(topic: String): DataFrame = {
// let's read back the output topic using kafka batch
spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.load()
.byteArrayToString("value")
.withColumn("value", from_json($"value", joinedDfSchema))
.expand("value")
}
}
Now, if I test on a Streaming DataFrame:
scala> Test.testSelfAggJoinFail
org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
Project [structstojson(named_struct(owner, owner#59, fruits, fruits#60, fruitsA, fruitsA#78), Some(Etc/UTC)) AS value#96]
+- Project [owner#59, fruits#60, fruitsA#78]
+- Project [owner#59, key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, fruits#60, fruitsA#78]
+- Join Inner, (owner#59 = owner#82)
:- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, value#51.owner AS owner#59, value#51.fruits AS fruits#60]
: +- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, jsontostructs(StructField(owner,StringType,true), StructField(fruits,StringType,true), value#43, Some(Etc/UTC), true) AS value#51]
: +- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, cast(value#30 as string) AS value#43]
: +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#3269e790, kafka, Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092), [key#29, value#30, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#42eeb996,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092),None), kafka, [key#22, value#23, topic#24, partition#25, offset#26L, timestamp#27, timestampType#28]
+- SubqueryAlias myFarmDF
+- Aggregate [owner#82], [owner#82, collect_list(fruits#83, 0, 0) AS fruitsA#78]
+- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, value#51.owner AS owner#82, value#51.fruits AS fruits#83]
+- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, jsontostructs(StructField(owner,StringType,true), StructField(fruits,StringType,true), value#43, Some(Etc/UTC), true) AS value#51]
+- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, cast(value#30 as string) AS value#43]
+- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#3269e790, kafka, Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092), [key#29, value#30, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#42eeb996,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092),None), kafka, [key#22, value#23, topic#24, partition#25, offset#26L, timestamp#27, timestampType#28]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:110)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:235)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:299)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:296)
at Test$.testSelfAggJoinFail(<console>:123)
... 51 elided
it fails with Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark because I don't use watermarks.
Now, if I can run the second test with
Test.testSelfAggJoin
I get these warnings
2018-09-12 16:07:33 WARN StreamingJoinHelper:66 - Failed to extract state value watermark from condition (window#70-T30000ms.start - timestamp#139-T30000ms) due to window#70-T30000ms.start
2018-09-12 16:07:33 WARN StreamingJoinHelper:66 - Failed to extract state value watermark from condition (timestamp#139-T30000ms - window#70-T30000ms.end) due to window#70-T30000ms.end
2018-09-12 16:07:33 WARN StreamingJoinHelper:66 - Failed to extract state value watermark from condition (window#70-T30000ms.start - timestamp#139-T30000ms) due to window#70-T30000ms.start
2018-09-12 16:07:33 WARN StreamingJoinHelper:66 - Failed to extract state value watermark from condition (timestamp#139-T30000ms - window#70-T30000ms.end) due to window#70-T30000ms.end
And I can check the result with
Test.data(Test.outputTopicSelf).show(false)
2018-09-12 16:08:01 WARN NetworkClient:882 - [Consumer clientId=consumer-5, groupId=spark-kafka-relation-02f5512f-cc3c-40ad-938f-e3dfdca95f8c-driver-0] Error while fetching metadata with correlation id 2 : {test.kafka.sink
.output.self=LEADER_NOT_AVAILABLE}
2018-09-12 16:08:01 WARN NetworkClient:882 - [Consumer clientId=consumer-5, groupId=spark-kafka-relation-02f5512f-cc3c-40ad-938f-e3dfdca95f8c-driver-0] Error while fetching metadata with correlation id 6 : {test.kafka.sink
.output.self=LEADER_NOT_AVAILABLE}
+---+-----+---------+------+---------+-------------+-----+------+-------+
|key|topic|partition|offset|timestamp|timestampType|owner|fruits|fruitsA|
+---+-----+---------+------+---------+-------------+-----+------+-------+
+---+-----+---------+------+---------+-------------+-----+------+-------+
which returns an empty DataFrame (probably because of the warning?).
I was unable to find a solution with a self-join.
Finally I tried by sinking the aggregation to Kafka and re-reading it as a second Streaming DataFrame, as in
scala> Test.data(Test.outputTopicTwo).show(false)
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|key |topic |partition|offset|timestamp |timestampType|owner|fruitsA |fruits |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|null|test.kafka.sink.output.two|0 |0 |2018-09-12 16:57:04.376|0 |Brian|["avocado","apple","pear","melon"]|avocado|
|null|test.kafka.sink.output.two|0 |1 |2018-09-12 16:57:04.376|0 |Bob |["apple","avocado"] |apple |
|null|test.kafka.sink.output.two|0 |2 |2018-09-12 16:57:04.38 |0 |Brian|["avocado","apple","pear","melon"]|apple |
|null|test.kafka.sink.output.two|0 |3 |2018-09-12 16:57:04.38 |0 |Bob |["apple","avocado"] |avocado|
|null|test.kafka.sink.output.two|0 |4 |2018-09-12 16:57:04.381|0 |Brian|["avocado","apple","pear","melon"]|pear |
|null|test.kafka.sink.output.two|0 |5 |2018-09-12 16:57:04.382|0 |Brian|["avocado","apple","pear","melon"]|melon |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
which works (although not very efficiently, I'd say) but if I add additional data to the source topic:
scala> Test.saveDfToKafka(Test.additionalData)
scala> Test.data(Test.outputTopicTwo).show(false)
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|key |topic |partition|offset|timestamp |timestampType|owner|fruitsA |fruits |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|null|test.kafka.sink.output.two|0 |0 |2018-09-12 16:57:04.376|0 |Brian|["avocado","apple","pear","melon"]|avocado|
|null|test.kafka.sink.output.two|0 |1 |2018-09-12 16:57:04.376|0 |Bob |["apple","avocado"] |apple |
|null|test.kafka.sink.output.two|0 |2 |2018-09-12 16:57:04.38 |0 |Brian|["avocado","apple","pear","melon"]|apple |
|null|test.kafka.sink.output.two|0 |3 |2018-09-12 16:57:04.38 |0 |Bob |["apple","avocado"] |avocado|
|null|test.kafka.sink.output.two|0 |4 |2018-09-12 16:57:04.381|0 |Brian|["avocado","apple","pear","melon"]|pear |
|null|test.kafka.sink.output.two|0 |5 |2018-09-12 16:57:04.382|0 |Brian|["avocado","apple","pear","melon"]|melon |
|null|test.kafka.sink.output.two|0 |6 |2018-09-12 16:59:37.125|0 |Bob |["apple","avocado"] |grapes |
|null|test.kafka.sink.output.two|0 |7 |2018-09-12 16:59:40.001|0 |Bob |["apple","avocado","grapes"] |apple |
|null|test.kafka.sink.output.two|0 |8 |2018-09-12 16:59:40.002|0 |Bob |["apple","avocado","grapes"] |avocado|
|null|test.kafka.sink.output.two|0 |9 |2018-09-12 16:59:40.002|0 |Bob |["apple","avocado","grapes"] |grapes |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
I get many more rows, probably because I had to use .outputMode("update") while sinking the aggregation Df.
Is there a way to perform this aggregation without sending the aggregation back to Kafka as a separate topic?
If not, is it possible to modify testTwoDfAggJoin to use .outputMode("append")?

As of Spark 2.3, Join of two streaming DF is not possible when there are some aggregate functions involved before join.
From the spark documentation
Additional details on supported joins:
Joins can be cascaded, that is, you can do df1.join(df2, ...).join(df3, ...).join(df4, ....).
As of Spark 2.3, you can use joins only when the query is in Append output mode. Other output modes are not yet supported.
As of Spark 2.3, you cannot use other non-map-like operations before joins. Here are a few examples of what cannot be used.
Cannot use streaming aggregations before joins.
Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode before joins.

I encountered the similar error info, outputMode is important to agg, I solved by adding df.writeStream.outputMode(OutputMode.Update()) or df.writeStream.outputMode(OutputMode.Complete())
Ref:
Output Modes There are a few types of output modes.
Append mode (default) - This is the default mode, where only the new
rows added to the Result Table since the last trigger will be
outputted to the sink. This is supported for only those queries where
rows added to the Result Table is never going to change. Hence, this
mode guarantees that each row will be output only once (assuming
fault-tolerant sink). For example, queries with only select, where,
map, flatMap, filter, join, etc. will support Append mode.
Complete mode - The whole Result Table will be outputted to the sink
after every trigger. This is supported for aggregation queries.
Update mode - (Available since Spark 2.1.1) Only the rows in the
Result Table that were updated since the last trigger will be
outputted to the sink. More information to be added in future
releases.
http://blog.madhukaraphatak.com/introduction-to-spark-structured-streaming-part-3/

Parquet filter pushdown is not working with Spark Dataset API [duplicate]

This question already has answers here:
Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?
(1 answer)
Spark 2.0 Dataset vs DataFrame
(3 answers)
Closed 4 years ago.
Here is the sample code which i am running.
Creating a test parquet Dataset with mod column as partition.
scala> val test = spark.range(0 , 100000000).withColumn("mod", $"id".mod(40))
test: org.apache.spark.sql.DataFrame = [id: bigint, mod: bigint]
scala> test.write.partitionBy("mod").mode("overwrite").parquet("test_pushdown_filter")
After that, i am reading this data as dataframe and applying filter on partition column i.e. mod.
scala> val df = spark.read.parquet("test_pushdown_filter").filter("mod = 5")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, mod: int]
scala> df.queryExecution.executedPlan
res1: org.apache.spark.sql.execution.SparkPlan =
*FileScan parquet [id#16L,mod#17] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/kprajapa/WorkSpace/places/test_pushdown_filter], PartitionCount: 1, PartitionFilters: [
isnotnull(mod#17), (mod#17 = 5)], PushedFilters: [], ReadSchema: struct<id:bigint>
You can see in execution plan, it is only reading 1 partition.
But if you apply same filter with dataset. its reading all the partition and then applying filter.
scala> case class Test(id: Long, mod: Long)
defined class Test
scala> val ds = spark.read.parquet("test_pushdown_filter").as[Test].filter(_.mod==5)
ds: org.apache.spark.sql.Dataset[Test] = [id: bigint, mod: int]
scala> ds.queryExecution.executedPlan
res2: org.apache.spark.sql.execution.SparkPlan =
*Filter <function1>.apply
+- *FileScan parquet [id#22L,mod#23] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/kprajapa/WorkSpace/places/test_pushdown_filter], PartitionCount: 40, PartitionFilter
s: [], PushedFilters: [], ReadSchema: struct<id:bigint>
Is this how dataset API works? or am i missing something?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark streaming with cassandra direct join don't work - apache-spark

Related

Spark RAPIDS does not load (unsupported file format error for CSV and no error for parquet )

Spark Kryo Serialization

Question about joining dataframes in Spark

Spark (Scala) Structured Streaming Aggregation and Self Join

Parquet filter pushdown is not working with Spark Dataset API [duplicate]

Categories

Resources