I want to save a DStream into HDFS using parquet format. The problem is that my case class use joda.DateTime while Spark SQL doesn't support this. For example:
case class Log (timestamp: DateTime, ...dozen of other fields here...)
But I got error: java.lang.UnsupportedOperationException: Schema for type org.joda.time.DateTime is not supported when trying to convert RDD to DF:
def output(logdstream: DStream[Log]) {
logdstream.foreachRDD(elem => {
val df = elem.toDF()
df.saveAsParquet(...)
});
}
My models are complex and have a lot of fields, so I don't want to write different case classes to get rid of the joda.DateTime. Another option would be save directly from json to parquet but it's not ideal. Is there an easy way to do automatic conversion from joda.DateTime to sql.Timestamp to be used with spark (convert to Spark's dataframe).
Thanks.
It's a little bit verbose, but you an try mapping Log to Spark SQL Row:
logdstream.foreachRDD(rdd => {
rdd.map(log => Row(
log.timestamp.toDate,
log.field2,
...
)).toDF().saveAsParquest(...)
})
Related
I have a long xmlstring that I am converting to JSON for easy processing in spark. But I am experiencing some issues with auto Infer schema. what is the efficient way to convert a Dataset xmlStringData -> Dataset with a structure? In this case should I generate a schema using StructType to read this again in spark Row as shown below:
Dataset<Row> jsonDatset = sparkSession.read().schema(schema).json(xmlStringData );
OR
Dataset<myClass> jsonDataset = xmlStringData.map((MapFunction<Row, String>) xmlRow -> {
return new myClass(xmlRow);
}, myClassEncode);
What is the difference in later processing going either route?
All I need to do later is process the data and save to CSV.
thank you
I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}
I am trying to use structured streaming in Spark 2.1.1 to read from Kafka and decode Avro encoded messages. I have the a UDF defined as per this question.
val sr = new CachedSchemaRegistryClient(conf.kafkaSchemaRegistryUrl, 100)
val deser = new KafkaAvroDeserializer(sr)
val decodeMessage = udf { bytes:Array[Byte] => deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead] }
val topic = conf.inputTopic
val df = session
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.kafkaServers)
.option("subscribe", topic)
.load()
df.printSchema()
val result = df.selectExpr("CAST(key AS STRING)", """decodeMessage($"value") as "value_des"""")
val query = result.writeStream
.format("console")
.outputMode(OutputMode.Append())
.start()
However I get the following failure.
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type DeviceRelayStateEnum is not supported
It fails on this line
val decodeMessage = udf { bytes:Array[Byte] => deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead] }
An alternate approach was to define encoders for the custom classes I have
implicit val enumEncoder = Encoders.javaSerialization[DeviceRelayStateEnum]
implicit val messageEncoder = Encoders.product[DeviceRead]
but that fails with the following error when the messageEncoder is getting registered.
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for DeviceRelayStateEnum
- option value class: "DeviceRelayStateEnum"
- field (class: "scala.Option", name: "deviceRelayState")
- root class: "DeviceRead"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:476)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
When I attempt to do this using a map after the load() I get the following compilation error.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Error:(76, 26) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[DeviceRead])org.apache.spark.sql.Dataset[DeviceRead].
Unspecified value parameter evidence$6.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Error:(76, 26) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Does that essentially mean that I cannot use Structured Streaming for Java enums? And it can only be used with either primitives or case classes?
I read a few related questions 1, 2, 3 around this and it seems the possibility of specifying a custom Encoder for a class i.e. UDT was removed in 2.1 and the new functionality was not added.
Any help will be appreciated.
I think you may be asking for too much in the current version of Structured Streaming (and Spark SQL) in general.
I've been yet unable to fully understand how to deal with the issue of missing encoders in a so-called more professional way, but the same issue you'd get when you tried to create a Dataset of enums. That might not simply be supported yet.
Structured Streaming is just a streaming library on top of Spark SQL and uses it for serialization-deserialization (SerDe).
To make the story short and to get you going (until you figure out a better way), I'd recommend avoid using enums in the business objects you use to represent the schema of your datasets.
So, I'd recommend doing something along the lines:
val decodeMessage = udf { bytes:Array[Byte] =>
val dr = deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead]
// do additional transformation here so you use a custom streaming-specific class
// Here I'm using a simple tuple to hold what might be relevant
// You could create a case class instead to have proper names
(dr.id, dr.value)
}
I'm trying to write a DataFrame from Spark to Kafka and I couldn't find any solution out there. Can you please show me how to do that?
Here is my current code:
activityStream.foreachRDD { rdd =>
val activityDF = rdd
.toDF()
.selectExpr(
"timestamp_hour", "referrer", "action",
"prevPage", "page", "visitor", "product", "inputProps.topic as topic")
val producerRecord = new ProducerRecord(topicc, activityDF)
kafkaProducer.send(producerRecord) // <--- this shows an error
}
type mismatch; found : org.apache.kafka.clients.producer.ProducerRecord[Nothing,org.apache.spark.sql.DataFrame] (which expands to) org.apache.kafka.clients.producer.ProducerRecord[Nothing,org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] required: org.apache.kafka.clients.producer.ProducerRecord[Nothing,String] Error occurred in an application involving default arguments.
Do collect on the activityDF to get the records (not Dataset[Row]) and save them to Kafka.
Note that you'll end up with a collection of records after collect so you probably have to iterate over it, e.g.
val activities = activityDF.collect()
// the following is pure Scala and has nothing to do with Spark
activities.foreach { a: Row =>
val pr: ProducerRecord = // map a to pr
kafkaProducer.send(pr)
}
Use pattern matching on Row to destructure it to fields/columns, e.g.
activities.foreach { case Row(timestamp_hour, referrer, action, prevPage, page, visitor, product, topic) =>
// ...transform a to ProducerRecord
kafkaProducer.send(pr)
}
PROTIP: I'd strongly suggest using a case class and transform DataFrame (= Dataset[Row]) to Dataset[YourCaseClass].
See Spark SQL's Row and Kafka's ProducerRecord docs.
As Joe Nate pointed out in the comments:
If you do "collect" before writing to any endpoint, it's going to make all the data aggregate at the driver and then make the driver write it out. 1) Can crash the driver if too much data (2) no parallelism in write.
That's 100% correct. I wished I had said it :)
You may want to use the approach as described in Writing Stream Output to Kafka instead.
I can collect a column like this using the RDD API.
df.map(r => r.getAs[String]("column")).collect
However, as I am initially using a Dataset I rather would like to not switch the API level. A simple df.select("column).collect returns an Array[Row] where the .flatten operator no longer works.
How can I collect to Array[T e.g. String] directly?
With Datasets ( Spark version >= 2.0.0 ), you just need to convert the dataframe to dataset and then collect it.
df.select("column").as[String].collect()
would return you an Array[String]