Spark SQL : custom Hive UDF GenericInternalRow cannot be cast ArrayData

Spark SQL : custom Hive UDF GenericInternalRow cannot be cast ArrayData - apache-spark

I'm using Spark 1.6 with Scala and R (throught SparkR and SparkLyr)
I have a dataframe containing binary data representing a Double 2D array. I want to deserialize binary data with an Hive UDF (for compatibility with R), but Spark crash with the error :
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getArray(rows.scala:48)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:221)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toScalaImpl(CatalystTypeConverters.scala:190)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toScalaImpl(CatalystTypeConverters.scala:153)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toScala(CatalystTypeConverters.scala:110)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toScala(CatalystTypeConverters.scala:283)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toScala(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToScalaConverter$2.apply(CatalystTypeConverters.scala:414)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollectPublic$1.apply(SparkPlan.scala:174)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollectPublic$1.apply(SparkPlan.scala:174)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:350)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:311)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:319)
Here is the UDF class :
class DeserBytesTo2DArrayDouble extends UDF {
def evaluate(input: BytesWritable): Array[Array[Double]] = {
if (input == null) return null
val res = SerializationUtils.deserializeFromByteArray(input.getBytes,
classOf[Array[Array[Double]]])
logger.info("Deserialized data : {}",res)
res
}
}
And an example for using it :
val data = Array(Array(1d,2d,3d))
val bytArr = SerializationUtils.serializeToByteArray(data)
val df = List(("a", bytArr)).toDF("ID", "DATAB")
df.registerTempTable("toto")
sqlContext.sql("CREATE TEMPORARY FUNCTION b2_2darrD as 'package.to.DeserBytesTo2DArrayDouble'")
sqlContext.sql("select id, b2_2darrD(datab) from toto").show

Related

Value Type is binary after Spark Dataset mapGroups operation even return a String in the function

Environment:
Spark version: 2.3.0
Run Mode: Local
Java version: Java 8
The spark application trys to do the following
1) Convert input data into a Dataset[GenericRecord]
2) Group by the key propery of the GenericRecord
3) Using mapGroups after group to iterate the value list and get some result in String format
4) Output the result as String in text file.
The error happens when writing to text file. Spark deduced that the Dataset generated in step 3 has a binary column, not a String column. But actually it returns a String in the mapGroups function.
Is there a way to do the column data type convertion or let Spark knows that it is actually a string column not binary?
val dslSourcePath = args(0)
val filePath = args(1)
val targetPath = args(2)
val df = spark.read.textFile(filePath)
implicit def kryoEncoder[A](implicit ct: ClassTag[A]): Encoder[A] = Encoders.kryo[A](ct)
val mapResult = df.flatMap(abc => {
JavaConversions.asScalaBuffer(some how return a list of Avro GenericRecord using a java library).seq;
})
val groupResult = mapResult.groupByKey(result => String.valueOf(result.get("key")))
.mapGroups((key, valueList) => {
val result = StringBuilder.newBuilder.append(key).append(",").append(valueList.count(_=>true))
result.toString()
})
groupResult.printSchema()
groupResult.write.text(targetPath + "-result-" + System.currentTimeMillis())
And the output said it is a bin
root
|-- value: binary (nullable = true)
Spark gives out an error that it can't write binary as text:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Text data source supports only a string column, but you have binary.;
at org.apache.spark.sql.execution.datasources.text.TextFileFormat.verifySchema(TextFileFormat.scala:55)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat.prepareWrite(TextFileFormat.scala:78)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:595)

As #user10938362 said, the reason is the following code will encode all data to bytes
implicit def kryoEncoder[A](implicit ct: ClassTag[A]): Encoder[A] = Encoders.kryo[A](ct)
Replacing it with the following code will just enable this encoding for GenericRecord
implicit def kryoEncoder: Encoder[GenericRecord] = Encoders.kryo

Streaming avro files from a directory

I'm trying to set up a structured stream from a directory of Avro files. We already have some non-streaming code to deal with exact the same data, so the least-effort step forward to streaming would be to re-use that code.
To move to StructuredStreaming, I tried the following, which works in the non-streaming manner (using read in stead of readStream) but gives me a serialization error in the streaming approach.
import com.databricks.spark.avro._
import org.apache.avro._
import org.apache.spark.sql.types._
import com.databricks.spark.avro._
val schemaStr = """ {our_schema_here} """
val parser = new Schema.Parser()
val avroSchema = parser.parse(schemaStr)
val structType = SchemaConverters.toSqlType(avroSchema).dataType match {
case t: StructType => Some(t)
case _ => throw new RuntimeException(
s"""Avro schema cannot be converted to a Spark SQL StructType:
|
|${avroSchema.toString(true)}
|""".stripMargin)
}
val path = "dbfs://path/to/avro/files/*"
val avroStream = sqlContext
.readStream
.schema(structType.get)
.format("com.databricks.spark.avro")
.option("maxFilesPerTrigger", 5)
.load(path)
.writeStream
.outputMode("append")
.format("memory")
.queryName("counts")
.start()
The exception I get is shown below. Note, I can't get the full stacktrace as I'm on Databricks and get access the executors logs. I'm a bit at loss what exactly is the object that can't be serialized.
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2125)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:937)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:936)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:291)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2966)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2456)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2456)
at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2950)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:99)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2949)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2456)
at org.apache.spark.sql.execution.streaming.MemorySink.addBatch(memory.scala:217)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(StreamExecution.scala:731)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply(StreamExecution.scala:731)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply(StreamExecution.scala:731)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:99)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:62)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:729)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:328)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:62)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:312)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:226)
Caused by: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
Serialization stack:
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 41 more

Convert org.apache.avro.generic.GenericRecord to org.apache.spark.sql.Row

I have list of org.apache.avro.generic.GenericRecord, avro schemausing this we need to create dataframe with the help of SQLContext API, to create dataframe it needs RDD of org.apache.spark.sql.Row and avro schema. Pre-requisite to create DF is we should have RDD of org.apache.spark.sql.Row and it can be achieved using below code but some how it is not working and giving error, sample code.
1. Convert GenericRecord to Row
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.avro.Schema
import org.apache.spark.sql.types.StructType
def convertGenericRecordToRow(genericRecords: Seq[GenericRecord], avroSchema: Schema, schemaType: StructType): Seq[Row] =
{
val fields = avroSchema.getFields
var rows = new Seq[Row]
for (avroRecord <- genericRecords) {
var avroFieldsSeq = Seq[Any]();
for (i <- 0 to fields.size - 1) {
avroFieldsSeq = avroFieldsSeq :+avroRecord.get(fields.get(i).name)
}
val avroFieldArr = avroFieldsSeq.toArray
val genericRow = new GenericRowWithSchema(avroFieldArr, schemaType)
rows = rows :+ genericRow
}
return rows;
}
2. Convert `Avro schema` to `Structtype`
Use `com.databricks.spark.avro.SchemaConverters -> toSqlType` function , it will convert avro schema to StructType
3. Create `Dataframe` using `SQLContext`
val rowSeq= convertGenericRecordToRow(genericRecords, avroSchema, schemaType)
val rowRdd = sc.parallelize(rowSeq, 1)
val finalDF =sqlContext.createDataFrame(rowRDD,structType)
But it is throwing an error at creation of DataFrame. Can someone please help me what is wrong in above code. Apart from this if someone has different logic for converting and creation of dataframe.
Whenever I will invoke any action on Dataframe, it will execute DAG and try to create DF object but in this it is failing with below exception as
ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
Error :Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hdpoc-c01-r06-01, executor 1): java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateFormat; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 1
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
After this I am trying to give correct version jar in jar parameter of spark submit and with other parameter as --conf spark.driver.userClassPathFirst=true
but now it is failing with MapR as
ERROR CLDBRpcCommonUtils: Exception during init
java.lang.UnsatisfiedLinkError: com.mapr.security.JNISecurity.SetClusterOption(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)
at com.mapr.security.JNISecurity.SetClusterOption(Native Method)
at com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils.init(CLDBRpcCommonUtils.java:163)
at com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils.<init>(CLDBRpcCommonUtils.java:73)
at com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils.<clinit>(CLDBRpcCommonUtils.java:63)
at org.apache.hadoop.conf.CoreDefaultProperties.<clinit>(CoreDefaultProperties.java:69)
at java.lang.Class.forName0(Native Method)
We are using MapR distribution and after class path change in spark-submit, it is failing with above exception.
Can someone please help here or my basic need it to convert Avro GenericRecord into Spark Row so i can create Dataframe with it, please help
Thanks.

Maybe this helps somebody coming a bit later to the game.
Since spark-avro is deprecated and now integrated in Spark, there is a different way this can be accomplished.
import org.apache.spark.sql.avro._
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.catalyst.encoders.RowEncoder
...
val avroSchema = data.head.getSchema
val sparkTypes = SchemaConverters.toSqlType(avroSchema).dataType.asInstanceOf[StructType]
val converter = new AvroDeserializer(avroSchema, sparkTypes)
val enconder = RowEncoder.apply(sparkTypes).resolveAndBind()
val rows = data.map { record =>
enconder.fromRow(converter.deserialize(record).asInstanceOf[InternalRow])
}
val df = sparkSession.sqlContext.createDataFrame(sparkSession.sparkContext.parallelize(rows), sparkTypes)

While creating dataframe from RDD[GenericRecord] there are few steps
First need to convert org.apache.avro.generic.GenericRecord into org.apache.spark.sql.Row
Use com.databricks.spark.avro.SchemaConverters.createConverterToSQL(
sourceAvroSchema: Schema,targetSqlType: DataType)
this is private method in spark-avro 3.2 version. If we are having same or less than 3.2 then copy this method into your own util class and use it else directly use it.
Create Dataframe from collection of Row (rowSeq).
val rdd = ssc.sparkContext.parallelize(rowSeq,numParition) val
dataframe = sparkSession.createDataFrame(rowRDD, schemaType)
This resolves my problem.

Hopefully this will help. In the first part you can find how convert from GenericRecord to Row
How to convert RDD[GenericRecord] to dataframe in scala?

How to write null value from Spark sql expression of DataFrame to a database table? (IllegalArgumentException: Can't get JDBC type for null)

I receive the error java.lang.IllegalArgumentException: Can't get JDBC type for null when try to run the following
example:
...
val spark = SparkSession.builder
.master("local[*]")
.appName("Demo")
.detOrCreate()
import spark.implicits._
//load first table
val df_one = spark.read
.format("jdbc")
.option("url",myDbUrl)
.option("dbtable",myTableOne)
.option("user",myUser)
.option("password",myPassw)
.load()
df_one.createGlobalTempView("table_one")
//load second table
val df_two = spark.read
.format("jdbc")
.option("url",myUrl)
.option("dbtable",myTableTwo)
.option("user",myUser)
.option("password",myPassw)
.load()
df_two.createGlobalTempView("table_two")
//perform join of two tables
val df_result = spark.sql(
"select o.field_one, t.field_two, null as field_three "+
" from global_temp.table_one o, global_temp.table_two t where o.key_one = t.key_two"
)
//Error there:
df_result.write
.format(jdbc)
.option("dbtable",myResultTable)
.option("url",myDbUrl)
.option("user",myUser)
.option("password",myPassw)
.mode(SaveMode.Append)
.save
...
I receive the error:
Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC type for null
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getJdbcType$2.apply(JdbcUtils.scala:148)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getJdbcType$2.apply(JdbcUtils.scala:148)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getJdbcType(JdbcUtils.scala:147)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$18.apply(JdbcUtils.scala:663)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$18.apply(JdbcUtils.scala:662)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:662)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:77)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
Workaround, which dramastically slows down the workflow:
...
// create case class for DataSet
case class ResultCaseClass(field_one: Option[Int], field_two: Option[Int], field_three: Option[Int])
//perform join of two tables
val ds_result = spark.sql(
"select o.field_one, t.field_two, null as field_three "+
" from global_temp.table_one o, global_temp.table_two t where o.key_one = t.key_two"
)
.withColumn("field_one",$"field_one".cast(IntegerType))
.withColumn("field_two",$"field_two".cast(IntegerType))
.withColumn("field_three",$"field_three".cast(IntegerType))
.as[ResultCaseClass]
//Success:
ds_result.write......
...

I encountered the same question as yours.Then I found the error information from java source code. If you insert a null value into a database without specifying the datatype,you will get "Can't get JDBC type for null".The way to fix this problem is casting null to the datatype which is equal to database's filed type.
example:
lit(null).cast(StringType) or lit(null).cast("string")

How to convert row rdd to typed rdd

Is it possible to convert a Row RDD to Typed RDD. In code below, can I convert row JavaRDD to Counter type JavaRDD
code :
JavaRDD<Counter> rdd = sc.parallelize(counters);
Dataset<Counter> ds = sqlContext.createDataset(rdd.rdd(), encoder);
DataFrame df = ds.toDF();
df.show()
df.write().parquet(path);
DataFrame newDataDF = sqlContext.read().parquet(path);
newDataDF.toJavaRDD(); // This gives a row type rdd
In Scala :
case class A(countId: Long, bytes: Array[Byte], blist: List[B])
case class B(id: String, count: Long)
val b1 = B("a", 1L)
val b2 = B("b", 2L)
val a1 = A(1L, Array(1.toByte,2.toByte), List(a1, a2))
val rdd = sc.parallelize(List(a1))
val dataSet: Dataset[A] = sqlContext.createDataset(rdd)
val df = dataSet.toDF()
// this shows, so this last entry is for List[B] in which it is storing string as null
|1|[01 02]| [[null,3984726108...|]
df.show
df.write.parquet(path)
val roundTripRDD = sqlContext.read.parquet(path).as[A].rdd
//throws error here when run show on df
Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java',
Line 300, Column 68:
No applicable constructor/method found for actual parameters
"long, byte[], scala.collection.Seq"; candidates are:
"test.data.A(long, byte[], scala.collection.immutable.List)"
roundTripRDD.toDF.show
assertEquals(roundTripRDD, rdd)
DO I need to provide some kind of constructor for case class?

Try:
sqlContext.read().parquet(path).as(encoder).rdd().toJavaRDD();

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark SQL : custom Hive UDF GenericInternalRow cannot be cast ArrayData - apache-spark

Related

Value Type is binary after Spark Dataset mapGroups operation even return a String in the function

Streaming avro files from a directory

Convert org.apache.avro.generic.GenericRecord to org.apache.spark.sql.Row

How to write null value from Spark sql expression of DataFrame to a database table? (IllegalArgumentException: Can't get JDBC type for null)

How to convert row rdd to typed rdd

Categories

Resources