Queries with streaming sources must be executed with writeStream.start();; - apache-spark

I am trying to read data from Kafka using spark structured streaming and predict form incoming data. I'm using model which I have trained using Spark ML.
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local")
.getOrCreate()
import spark.implicits._
val toString = udf((payload: Array[Byte]) => new String(payload))
val sentenceDataFrame = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe", "topicname1")
.load().selectExpr("CAST(value AS STRING)").as[(String)]
sentenceDataFrame.printSchema()
val regexTokenizer = new RegexTokenizer()
.setInputCol("value")
.setOutputCol("words")
.setPattern("\\W")
val tokencsv = regexTokenizer.transform(sentenceDataFrame)
val remover = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("filtered")
val removestopdf = remover.transform(tokencsv)
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("filtered")
.setOutputCol("result")
.setVectorSize(300)
.setMinCount(0)
val model = word2Vec.fit(removestopdf)
val result = model.transform(removestopdf)
val featureIndexer = new VectorIndexer()
.setInputCol("result")
.setOutputCol("indexedFeatures")
.setMaxCategories(2)
.fit(result)
val some = featureIndexer.transform(result)
val model1 = RandomForestClassificationModel.load("/home/akhil/Documents/traindata/stages/2_rfc_80e12c5d1259")
val predict = model1.transform(result)
val query = predict.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
When I do prediction on streaming data it gives me following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Queries with streaming sources must be executed with
writeStream.start();;
kafka
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:2547)
at org.apache.spark.sql.Dataset.rdd(Dataset.scala:2544)
at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:175)
at predict1model$.main(predict1model.scala:53)
at predict1model.main(predict1model.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
error is referring to word2vec.fit(removestopdf) line . Any help would be really appreciated .

In general, Structured Streaming cannot (yet - as of Spark 2.2) be used to train Spark ML models.
There are some operations that are not supported in Structured Streaming. One of those is to transform a Dataset to its rdd representation.
In particular the case of word2Vec, it needs to go to the rdd level to implement fit.
Nevertheless, it's possible to train the model on a static dataset and apply the predictions on the streaming data. The transform operation is usable on a streaming Dataset, like above: val result = model.transform(removestopdf)
In a nutshell, we need to fit the model on a static dataset. The resulting transformer can be applied to a streaming Dataset.

You can find a proof of concept on this Github project "Spark Structured Streaming ML"
There is also SPARK-16424 you can follow

Related

Structured Streaming: Reading from multiple Kafka topics at once

I have a Spark Structured Streaming Application which has to read from 12 Kafka topics (Different Schemas, Avro format) at once, deserialize the data and store in HDFS. When I read from a single topic using my code, it works fine and without errors but on running multiple queries together, I'm getting the following error
java.lang.IllegalStateException: Race while writing batch 0
My code is as follows:
def main(args: Array[String]): Unit = {
val kafkaProps = Util.loadProperties(kafkaConfigFile).asScala
val topic_list = ("topic1", "topic2", "topic3", "topic4")
topic_list.foreach(x => {
kafkaProps.update("subscribe", x)
val source= Source.fromInputStream(Util.getInputStream("/schema/topics/" + x)).getLines.mkString
val schemaParser = new Schema.Parser
val schema = schemaParser.parse(source)
val sqlTypeSchema = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
val kafkaStreamData = spark
.readStream
.format("kafka")
.options(kafkaProps)
.load()
val udfDeserialize = udf(deserialize(source), DataTypes.createStructType(sqlTypeSchema.fields))
val transformedDeserializedData = kafkaStreamData.select("value").as(Encoders.BINARY)
.withColumn("rows", udfDeserialize(col("value")))
.select("rows.*")
val query = transformedDeserializedData
.writeStream
.trigger(Trigger.ProcessingTime("5 seconds"))
.outputMode("append")
.format("parquet")
.option("path", "/output/topics/" + x)
.option("checkpointLocation", checkpointLocation + "//" + x)
.start()
})
spark.streams.awaitAnyTermination()
}
Alternative. You can use KAFKA Connect (from Confluent), NIFI, StreamSets etc. as your use case seems to fit "dump/persist to HDFS". That said, you need to have these tools (installed). The small files problem you state is not an issue, so be it.
From Apache Kafka 0.9 or later version you can Kafka Connect API for KAFKA --> HDFS Sink (various supported HDFS formats). You need a KAFKA Connect Cluster though, but that is based on your existing Cluster in any event, so not a big deal. But someone needs to maintain.
Some links to get you on your way:
https://data-flair.training/blogs/kafka-connect/
https://github.com/confluentinc/kafka-connect-hdfs

Getting error saying "Queries with streaming sources must be executed with writeStream.start()" on spark structured streaming [duplicate]

This question already has answers here:
How to display a streaming DataFrame (as show fails with AnalysisException)?
(2 answers)
Closed 4 years ago.
I am getting some issues while executing spark SQL on top of spark structures streaming.
PFA for error.
here is my code
object sparkSqlIntegration {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.config("spark.sql.streaming.checkpointLocation", "file:///C:/checkpoint")
.getOrCreate()
setupLogging()
val userSchema = new StructType().add("name", "string").add("age", "integer")
// Create a stream of text files dumped into the logs directory
val rawData = spark.readStream.option("sep", ",").schema(userSchema).csv("file:///C:/Users/R/Documents/spark-poc-centri/csvFolder")
// Must import spark.implicits for conversion to DataSet to work!
import spark.implicits._
rawData.createOrReplaceTempView("updates")
val sqlResult= spark.sql("select * from updates")
println("sql results here")
sqlResult.show()
println("Otheres")
val query = rawData.writeStream.outputMode("append").format("console").start()
// Keep going until we're stopped.
query.awaitTermination()
spark.stop()
}
}
During execution, I am getting the following error. As I am new to streaming can anyone tell how can I execute spark SQL queries on spark structured streaming
2018-12-27 16:02:40 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, LAPTOP-5IHPFLOD, 6829, None)
2018-12-27 16:02:41 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6731787b{/metrics/json,null,AVAILABLE,#Spark}
sql results here
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///C:/Users/R/Documents/spark-poc-centri/csvFolder]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
You don't need any of these lines
import spark.implicits._
rawData.createOrReplaceTempView("updates")
val sqlResult= spark.sql("select * from updates")
println("sql results here")
sqlResult.show()
println("Otheres")
Most importantly, select * isn't needed. When you print the dataframe, you would already see all the columns. Therefore, you also don't need to register the temp view to give it a name.
And when you format("console"), that eliminates the need for .show()
Refer to the Spark examples for reading from a network socket and output to console.
val words = // omitted ... some Streaming DataFrame
// Generating a running word count
val wordCounts = words.groupBy("value").count()
// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
Take away - use DataFrame operations like .select() and .groupBy() rather than raw SQL
Or you can use Spark Streaming, as shown in those examples, you need to foreachRDD over each stream batch, then convert these to a DataFrame, which you can query
/** Case class for converting RDD to DataFrame */
case class Record(word: String)
val words = // omitted ... some DStream
// Convert RDDs of the words DStream to DataFrame and run SQL query
words.foreachRDD { (rdd: RDD[String], time: Time) =>
// Get the singleton instance of SparkSession
val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
import spark.implicits._
// Convert RDD[String] to RDD[case class] to DataFrame
val wordsDataFrame = rdd.map(w => Record(w)).toDF()
// Creates a temporary view using the DataFrame
wordsDataFrame.createOrReplaceTempView("words")
// Do word count on table using SQL and print it
val wordCountsDataFrame =
spark.sql("select word, count(*) as total from words group by word")
println(s"========= $time =========")
wordCountsDataFrame.show()
}
ssc.start()
ssc.awaitTermination()

Streaming avro files from a directory

I'm trying to set up a structured stream from a directory of Avro files. We already have some non-streaming code to deal with exact the same data, so the least-effort step forward to streaming would be to re-use that code.
To move to StructuredStreaming, I tried the following, which works in the non-streaming manner (using read in stead of readStream) but gives me a serialization error in the streaming approach.
import com.databricks.spark.avro._
import org.apache.avro._
import org.apache.spark.sql.types._
import com.databricks.spark.avro._
val schemaStr = """ {our_schema_here} """
val parser = new Schema.Parser()
val avroSchema = parser.parse(schemaStr)
val structType = SchemaConverters.toSqlType(avroSchema).dataType match {
case t: StructType => Some(t)
case _ => throw new RuntimeException(
s"""Avro schema cannot be converted to a Spark SQL StructType:
|
|${avroSchema.toString(true)}
|""".stripMargin)
}
val path = "dbfs://path/to/avro/files/*"
val avroStream = sqlContext
.readStream
.schema(structType.get)
.format("com.databricks.spark.avro")
.option("maxFilesPerTrigger", 5)
.load(path)
.writeStream
.outputMode("append")
.format("memory")
.queryName("counts")
.start()
The exception I get is shown below. Note, I can't get the full stacktrace as I'm on Databricks and get access the executors logs. I'm a bit at loss what exactly is the object that can't be serialized.
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2125)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:937)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:936)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:291)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2966)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2456)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2456)
at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2950)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:99)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2949)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2456)
at org.apache.spark.sql.execution.streaming.MemorySink.addBatch(memory.scala:217)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(StreamExecution.scala:731)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply(StreamExecution.scala:731)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply(StreamExecution.scala:731)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:99)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:62)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:729)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:328)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:62)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:312)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:226)
Caused by: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
Serialization stack:
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 41 more

Decoding Java enums/custom non case classes using Structured Spark Streaming

I am trying to use structured streaming in Spark 2.1.1 to read from Kafka and decode Avro encoded messages. I have the a UDF defined as per this question.
val sr = new CachedSchemaRegistryClient(conf.kafkaSchemaRegistryUrl, 100)
val deser = new KafkaAvroDeserializer(sr)
val decodeMessage = udf { bytes:Array[Byte] => deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead] }
val topic = conf.inputTopic
val df = session
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.kafkaServers)
.option("subscribe", topic)
.load()
df.printSchema()
val result = df.selectExpr("CAST(key AS STRING)", """decodeMessage($"value") as "value_des"""")
val query = result.writeStream
.format("console")
.outputMode(OutputMode.Append())
.start()
However I get the following failure.
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type DeviceRelayStateEnum is not supported
It fails on this line
val decodeMessage = udf { bytes:Array[Byte] => deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead] }
An alternate approach was to define encoders for the custom classes I have
implicit val enumEncoder = Encoders.javaSerialization[DeviceRelayStateEnum]
implicit val messageEncoder = Encoders.product[DeviceRead]
but that fails with the following error when the messageEncoder is getting registered.
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for DeviceRelayStateEnum
- option value class: "DeviceRelayStateEnum"
- field (class: "scala.Option", name: "deviceRelayState")
- root class: "DeviceRead"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:476)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
When I attempt to do this using a map after the load() I get the following compilation error.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Error:(76, 26) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[DeviceRead])org.apache.spark.sql.Dataset[DeviceRead].
Unspecified value parameter evidence$6.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Error:(76, 26) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Does that essentially mean that I cannot use Structured Streaming for Java enums? And it can only be used with either primitives or case classes?
I read a few related questions 1, 2, 3 around this and it seems the possibility of specifying a custom Encoder for a class i.e. UDT was removed in 2.1 and the new functionality was not added.
Any help will be appreciated.
I think you may be asking for too much in the current version of Structured Streaming (and Spark SQL) in general.
I've been yet unable to fully understand how to deal with the issue of missing encoders in a so-called more professional way, but the same issue you'd get when you tried to create a Dataset of enums. That might not simply be supported yet.
Structured Streaming is just a streaming library on top of Spark SQL and uses it for serialization-deserialization (SerDe).
To make the story short and to get you going (until you figure out a better way), I'd recommend avoid using enums in the business objects you use to represent the schema of your datasets.
So, I'd recommend doing something along the lines:
val decodeMessage = udf { bytes:Array[Byte] =>
val dr = deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead]
// do additional transformation here so you use a custom streaming-specific class
// Here I'm using a simple tuple to hold what might be relevant
// You could create a case class instead to have proper names
(dr.id, dr.value)
}

How to write null value from Spark sql expression of DataFrame to a database table? (IllegalArgumentException: Can't get JDBC type for null)

I receive the error java.lang.IllegalArgumentException: Can't get JDBC type for null when try to run the following
example:
...
val spark = SparkSession.builder
.master("local[*]")
.appName("Demo")
.detOrCreate()
import spark.implicits._
//load first table
val df_one = spark.read
.format("jdbc")
.option("url",myDbUrl)
.option("dbtable",myTableOne)
.option("user",myUser)
.option("password",myPassw)
.load()
df_one.createGlobalTempView("table_one")
//load second table
val df_two = spark.read
.format("jdbc")
.option("url",myUrl)
.option("dbtable",myTableTwo)
.option("user",myUser)
.option("password",myPassw)
.load()
df_two.createGlobalTempView("table_two")
//perform join of two tables
val df_result = spark.sql(
"select o.field_one, t.field_two, null as field_three "+
" from global_temp.table_one o, global_temp.table_two t where o.key_one = t.key_two"
)
//Error there:
df_result.write
.format(jdbc)
.option("dbtable",myResultTable)
.option("url",myDbUrl)
.option("user",myUser)
.option("password",myPassw)
.mode(SaveMode.Append)
.save
...
I receive the error:
Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC type for null
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getJdbcType$2.apply(JdbcUtils.scala:148)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getJdbcType$2.apply(JdbcUtils.scala:148)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getJdbcType(JdbcUtils.scala:147)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$18.apply(JdbcUtils.scala:663)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$18.apply(JdbcUtils.scala:662)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:662)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:77)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
Workaround, which dramastically slows down the workflow:
...
// create case class for DataSet
case class ResultCaseClass(field_one: Option[Int], field_two: Option[Int], field_three: Option[Int])
//perform join of two tables
val ds_result = spark.sql(
"select o.field_one, t.field_two, null as field_three "+
" from global_temp.table_one o, global_temp.table_two t where o.key_one = t.key_two"
)
.withColumn("field_one",$"field_one".cast(IntegerType))
.withColumn("field_two",$"field_two".cast(IntegerType))
.withColumn("field_three",$"field_three".cast(IntegerType))
.as[ResultCaseClass]
//Success:
ds_result.write......
...
I encountered the same question as yours.Then I found the error information from java source code. If you insert a null value into a database without specifying the datatype,you will get "Can't get JDBC type for null".The way to fix this problem is casting null to the datatype which is equal to database's filed type.
example:
lit(null).cast(StringType) or lit(null).cast("string")

Resources