I am executing a Spark Streaming job and I want to publish my result_dstream which is type of DStream[GenericData.Record] , so I used the code below for that purpose , but getting Task Not serializable Exception
val prod_props : Properties = new Properties()
prod_props.put("bootstrap.servers" , "localhost:9092")
prod_props.put("key.serializer" , "org.apache.kafka.common.serialization.StringSerializer")
prod_props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer")
val _producer : KafkaProducer[String , Array[Byte]] = new KafkaProducer(prod_props)
result_DStream.foreachRDD(r => {
r.foreachPartition(it => {
while(it.hasNext)
{
val schema = new Schema.Parser().parse(schema_string)
val recordInjection : Injection[GenericRecord , Array[Byte]] = GenericAvroCodecs.toBinary(schema)
val record : GenericData.Record = it.next()
val byte : Array[Byte] = recordInjection.apply(record)
val prod_record : ProducerRecord[String , Array[Byte]] = new ProducerRecord("sample_topic_name_9" , byte)
_producer.send(prod_record)
}
})
})
What can I do to solve this problem? I have tried suggestion like use the not serializable class in lambda function and use foreachPartition instead of foreach , and the problem is according to me is in schema or in recordInjection.
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2062)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at HadoopMetrics_Online$$anonfun$main$3.apply(HadoopMetrics_Online.scala:187)
at HadoopMetrics_Online$$anonfun$main$3.apply(HadoopMetrics_Online.scala:186)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.kafka.clients.producer.KafkaProducer
Serialization stack:
- object not serializable (class: org.apache.kafka.clients.producer.KafkaProducer, value: org.apache.kafka.clients.producer.KafkaProducer#252f5489)
- field (class: HadoopMetrics_Online$$anonfun$main$3, name: _producer$1, type: class org.apache.kafka.clients.producer.KafkaProducer)
- object (class HadoopMetrics_Online$$anonfun$main$3, <function1>)
- field (class: HadoopMetrics_Online$$anonfun$main$3$$anonfun$apply$1, name: $outer, type: class HadoopMetrics_Online$$anonfun$main$3)
- object (class HadoopMetrics_Online$$anonfun$main$3$$anonfun$apply$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 30 more
KafkaProducer isn't serializable, and you're closing over it in your foreachPartition method. You'll need to declare it internally:
resultDStream.foreachRDD(r => {
r.foreachPartition(it => {
val producer : KafkaProducer[String , Array[Byte]] = new KafkaProducer(prod_props)
while(it.hasNext) {
val schema = new Schema.Parser().parse(schema_string)
val recordInjection : Injection[GenericRecord , Array[Byte]] = GenericAvroCodecs.toBinary(schema)
val record : GenericData.Record = it.next()
val byte : Array[Byte] = recordInjection.apply(record)
val prod_record : ProducerRecord[String , Array[Byte]] = new ProducerRecord("sample_topic_name_9" , byte)
producer.send(prod_record)
}
})
})
Side note - Scala naming conventions are camelCase for variable names, not snake_case.
Related
I have written one simple code in spark.
That is getting the file location from the dataframe columns and returns the string whether it is exist or not.
But once i run this it will throw a "task not serializable".
Can someone please help me to get out of this error?
object filetospark{
def main(args: Array[String]) : Unit = {
val spark = SparkSession
.builder()
.appName("app1")
.master("local")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val path: String => String = (Path: String) => {
val exists = fs.exists(new Path(Path))
var result = " "
if (exists) {
result = "Y"
}
else {
result = "N"
}
result
}
val PATH = udf(path)
val config_df=spark.read.
option("header","true").
option("inferSchema","true").
csv("pathlocation")
val current_date=LocalDate.now()
val instance_table_df=instance_df.withColumn("is_available",PATH(col("file_name")))
error like this
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
at org.apache.spark.sql.Dataset.show(Dataset.scala:746)
at org.apache.spark.sql.Dataset.show(Dataset.scala:705)
at org.apache.spark.sql.Dataset.show(Dataset.scala:714)
at filetospark$.main(filetospark.scala:40)
at filetospark.main(filetospark.scala)
Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.LocalFileSystem
Serialization stack:
- object not serializable (class: org.apache.hadoop.fs.LocalFileSystem, value: org.apache.hadoop.fs.LocalFileSystem#7fd3fd06)
- field (class: filetospark$$anonfun$1, name: fs$1, type: class org.apache.hadoop.fs.FileSystem)
- object (class filetospark$$anonfun$1, <function1>)
- element of array (index: 4)
- array (class [Ljava.lang.Object;, size 5)
- field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
... 36 more
It shows this error someone could please solve this problem
object filetospark{
val spark = SparkSession
.builder()
.appName("app1")
.master("local")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val path: String => String = (Path: String) => {
val exists = fs.exists(new Path(Path))
var result = " "
if (exists) {
result = "Y"
}
else {
print("N")
result = "N"
}
result
}
def main(args: Array[String]) : Unit = {
val PATH = udf(path)
val newfu=udf(newfun)
val config_df=spark.read.
option("header","true").
option("inferSchema","true").
csv("filepath")
val current_date=LocalDate.now()
val instance_table_df=instance_df.withColumn("is_available",PATH(col("file_name")))
instance_table_df.show()
}
}
I don't know what is happening here.Now that error was cleared.But my doubt is still here.
I just create the spark session outside the main function.it works fine.But i dont know what is happening here.If any one knows please post here.
I have seen many such posts for serialization error. But I am new to this.
There is a dataframe-modProductsData and a map L2L3Map Map. I want to replace the values in column-PRIMARY_CATEGORY with values of map-L2L3Map.
val L2L3Map = L2.collect.map(row => (row.get(0).toString, row.get(1).toString)).toMap
val L2L3MapUDF = udf { s: String => L2L3Map.get(s) }
val productsData = spark.read.format("solr").options(readFromSourceClusterOpts).load
var modProductsData = productsData.withColumn("Prime_L2_s", when(col("PRIMARY_CATEGORY").isNotNull, when(col("PRIMARY_CATEGORY").isin(L3ids:_*), L2L3MapUDF(col("PRIMARY_CATEGORY"))).otherwise(when(col("PRIMARY_CATEGORY").isin(L2ids:_*),col("PRIMARY_CATEGORY")).otherwise(lit(null)))).otherwise(lit(null)))
Below are the more error log:
java.lang.RuntimeException: org.apache.spark.SparkException: Task not serializable
at solr.DefaultSource.createRelation(DefaultSource.scala:31)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:518)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 89 elided
Caused by: org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:371)
at
It worked with below code :)
def mapPhantom(flagMap: Map[String, String]): (String) => String = {
(id: String) =>
{
flagMap.getOrElse(id,null)
}
}
val L2L3Map = L2.collect.map(row => (row.get(0).toString, row.get(1).toString)).toMap
val L2L3UDF = udf(mapPhantom(L2L3Map))
var modProductsData = productsData.withColumn("Prime_L2_s", when(col("PRIMARY_CATEGORY").isNotNull, when(col("PRIMARY_CATEGORY").isin(L3ids:_*), L2L3UDF(col("PRIMARY_CATEGORY"))).otherwise(when(col("PRIMARY_CATEGORY").isin(L2ids:_*),col("PRIMARY_CATEGORY")).otherwise(lit(null)))).otherwise(lit(null)))
Below is part of my spark job:
def parse(evt: Event): String = {
try {
val config = new java.util.HashMap[java.lang.String, AnyRef] // Line1
config.put("key", "value") // Line2
val decoder = new DeserializerHelper(config, classOf[GenericRecord]) // Line3
val payload = decoder.deserializeData(evt.getId, evt.toBytes)
val record = payload.get("data")
record.toString
} catch {
case e :Exception => "exception:" + e.toString
}
}
try {
val inputStream = KafkaUtils.createDirectStream(
ssc,
PreferConsistent,
Subscribe[String, String](Array(inputTopic), kafkaParams)
)
val processedStream = inputStream.map(record => parse(record.value()))
processedStream.print()
} finally {
}
If I moved LINE1-LINE3 in the above codes outside parse() function, I got
Caused by: java.io.NotSerializableException: SchemaDeserializerHelper
Serialization stack:
- object not serializable (class: SchemaDeserializerHelper, value: SchemaDeserializerHelper#2e23c180)
- field (class: App$$anonfun$1, name: decoder$1, type: class SchemaDeserializerHelper)
- object (class App$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
... 22 more
Why? I do not like to put Line1~Line3 in the parse() function, how to optimize this?
Thanks
I have run to a wall on getting around a Task not serializable when trying to break out a spark application into classes and use Try also.
The Code pulls from S3 for schema, does a streaming read from Kafka (which the topic is avro format with schema reg).
I have tried using the class and not using the class... but in both cases I'm getting a serz error relating to a closure.. which I guess something is being pulled in when it is trying to serz. This error haunts me always.. such a huge pain to get around. If someone could shed some light on how I can avoid this issue that would be awesome. These Java classes seem to have more issues than they are worth sometimes.
import java.util.Properties
import com.databricks.spark.avro._
import io.confluent.kafka.schemaregistry.client.rest.RestService
import io.confluent.kafka.serializers.{AbstractKafkaAvroSerDeConfig, KafkaAvroDecoder, KafkaAvroDeserializerConfig}
import org.apache.avro.Schema
import org.apache.avro.generic.GenericData
import org.apache.spark.sql.functions.{col, from_json}
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, SparkSession}
import scala.util.{Failure, Success, Try}
case class DeserializedFromKafkaRecord(value: String)
class sparkS3() extends Serializable {
def readpeopleSchemaDF(spark: SparkSession, topicSchemaLocation: String): scala.util.Try[StructType] = {
val read: scala.util.Try[StructType] = Try(
spark
.read
.option("header", "true")
.format("com.databricks.spark.avro")
.load(topicSchemaLocation)
.schema
)
read
}
def writeTopicDF(peopleDFstream: DataFrame,
peopleDFstreamCheckpoint: String,
peopleDFstreamLocation: String): scala.util.Try[StreamingQuery] = {
val write: scala.util.Try[StreamingQuery] = Try(
peopleDFstream
.writeStream
.option("checkpointLocation", peopleDFstreamCheckpoint)
.format("com.databricks.spark.avro")
.option("path", peopleDFstreamLocation)
.start()
)
write
}
}
class sparkKafka() extends Serializable {
def readpeopleTopicDF(spark: SparkSession, topicSchema: StructType): scala.util.Try[DataFrame] = {
val brokers = "URL:9092"
val schemaRegistryURL = "URL:8081"
val kafkaParams = Map[String, String](
"kafka.bootstrap.servers" -> brokers,
"key.deserializer" -> "KafkaAvroDeserializer",
"value.deserializer" -> "KafkaAvroDeserializer",
"group.id" -> "structured-kafka",
//"auto.offset.reset" -> "latest",
"failOnDataLoss" -> "false",
"schema.registry.url" -> schemaRegistryURL
)
var kafkaTopic = "people"
object avroDeserializerWrapper {
val props = new Properties()
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryURL)
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, "true")
val vProps = new kafka.utils.VerifiableProperties(props)
val deser = new KafkaAvroDecoder(vProps)
val avro_schema = new RestService(schemaRegistryURL).getLatestVersion(kafkaTopic + "-value")
val messageSchema = new Schema.Parser().parse(avro_schema.getSchema)
}
import spark.implicits._
val read: scala.util.Try[DataFrame] = Try(
{
val peopleStringDF = {
spark
.readStream
.format("kafka")
.option("subscribe", kafkaTopic)
.option("kafka.bootstrap.servers", brokers)
.options(kafkaParams)
.load()
.map(x => {
DeserializedFromKafkaRecord(avroDeserializerWrapper.deser.fromBytes(
x
.getAs[Array[Byte]]("value"), avroDeserializerWrapper.messageSchema)
.asInstanceOf[GenericData.Record].toString)
})
}
val peopleJsonDF = {
peopleStringDF
.select(
from_json(col("value")
.cast("string"), topicSchema)
.alias("people"))
}
peopleJsonDF.select("people.*")
})
read
}
}
object peopleDataLakePreprocStage1 {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("peoplePreProcConsumerStage1")
.getOrCreate()
val topicSchemaLocation = "URL"
val topicDFstreamCheckpoint = "URL"
val topicDFstreamLocation = "URL"
val sparkKafka = new sparkKafka()
val sparkS3 = new sparkS3()
sparkS3.readpepleSchemaDF(spark, topicSchemaLocation) match {
case Success(topicSchema) => {
sparkKafka.readpeopletTopicDF(spark, topicSchema) match {
case Success(df) => {
sparkS3.writeTopicDF(df, topicDFstreamCheckpoint, topicDFstreamLocation) match {
case Success(query) => {
query.awaitTermination()
}
case Failure(f) => println(f)
}
}
case Failure(f) => println(f)
}
}
case Failure(f) => println(f)
}
}
}
Here is the error
java.lang.IllegalStateException: s3a://... when compacting batch 9 (compactInterval: 10)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4$$anonfun$apply$1.apply(CompactibleFileStreamLog.scala:174)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4$$anonfun$apply$1.apply(CompactibleFileStreamLog.scala:174)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4.apply(CompactibleFileStreamLog.scala:173)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4.apply(CompactibleFileStreamLog.scala:172)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:73)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.compact(CompactibleFileStreamLog.scala:172)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.add(CompactibleFileStreamLog.scala:156)
at org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:64)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:213)
at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:123)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:474)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
18/08/10 13:04:07 ERROR MicroBatchExecution: Query [id = 2876ded4-f223-40c4-8634-0c8feec94bf6, runId = 9b9a1347-7a80-4295-bb6e-ff2de18eeaf4] terminated with error
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:123)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:474)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: java.lang.IllegalStateException: s3a://..../_spark_metadata/0 doesn't exist when compacting batch 9 (compactInterval: 10)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4$$anonfun$apply$1.apply(CompactibleFileStreamLog.scala:174)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4$$anonfun$apply$1.apply(CompactibleFileStreamLog.scala:174)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4.apply(CompactibleFileStreamLog.scala:173)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$4.apply(CompactibleFileStreamLog.scala:172)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:73)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.compact(CompactibleFileStreamLog.scala:172)
at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.add(CompactibleFileStreamLog.scala:156)
at org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:64)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:213)
... 17 more
The resolution was one of two (or both) things.. extend serialzation on the class , separate files in the same namespace. I have updated the code above to reflect
Just a stab. In class sparkS3 you are using 'var' to define those values - did you mean 'val'?
I have a requirement to create my own UnaryTransformer instance that accepts a Dataframe Column of type Array[String] and should also output the same type.In trying to do so,I encountered a ClassCastException on my Spark version 2.1.0.
I've put together a sample test that shows my case.
import org.apache.spark.SparkConf
import org.apache.spark.ml.UnaryTransformer
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{ArrayType, DataType, StringType}
class MyTransformer(override val uid:String) extends UnaryTransformer[Array[String],Array[String],MyTransformer] {
override protected def createTransformFunc: (Array[String]) => Array[String] = {
param1 => {
param1.foreach(println(_))
param1
}
}
override protected def outputDataType: DataType = ArrayType(StringType)
override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == ArrayType(StringType), s"Data type mismatch between Array[String] and provided type $inputType.")
}
def this() = this( Identifiable.randomUID("tester") )
}
object Tester {
def main(args: Array[String]): Unit = {
val config = new SparkConf().setAppName("Tester")
implicit val sparkSession = SparkSession.builder().config(config).getOrCreate()
import sparkSession.implicits._
val dataframe = Seq(Array("Firstly" , "F1"),Array("Driving" , "S1" ),Array("Ran" , "T3" ),Array("Fourth" ,"F4"), Array("Running" , "F5")
,Array("Gone" , "S6")).toDF("input")
val transformer = new MyTransformer().setInputCol("input").setOutputCol("output")
val transformed = transformer.transform(dataframe)
transformed.select("output").show()
println("Complete....")
sparkSession.close()
}
}
Attaching the stack trace for reference
Exception in thread "main" org.apache.spark.SparkException: Failed to
execute user defined function($anonfun$createTransformFunc$1:
(array) => array) at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072)
at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:144)
at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392) at
scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296) at
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$21.applyOrElse(Optimizer.scala:1078)
at
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$21.applyOrElse(Optimizer.scala:1073)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:277)
at
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$.apply(Optimizer.scala:1073)
at
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$.apply(Optimizer.scala:1072)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at
scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:392) at
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
at
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
at
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
at
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
at
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2791)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) at
org.apache.spark.sql.Dataset.take(Dataset.scala:2327) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:248) at
org.apache.spark.sql.Dataset.show(Dataset.scala:636) at
org.apache.spark.sql.Dataset.show(Dataset.scala:595) at
org.apache.spark.sql.Dataset.show(Dataset.scala:604) at
Tester$.main(Tester.scala:45) at Tester.main(Tester.scala)
Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to
[Ljava.lang.String; at
MyTransformer$$anonfun$createTransformFunc$1.apply(Tester.scala:9)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
... 53 more
ArrayType is represented as Seq not Array:
override protected def createTransformFunc: (Seq[String]) => Seq[String] = {
param1 => {
param1.foreach(println(_))
param1
}
}