Default schema value conversion fails in to_avro() while publishing data to Kafka using databricks spark-avro - databricks

Trying to publish data into Kafka topic using confluent schema registry.
Following is my schema registry
schemaRegistryClient.register("primitive_type_str_avsc", new Schema.Parser().parse(
s"""
|{
| "type": "record",
| "name": "RecordLevel",
| "fields": [
| {"name": "id", "type":["string","null"], "default": null}
| ]
|}
""".stripMargin
))
Following case class is used to match the schema
case class myCaseClass (id:Option[String] = None)
Here is my notebook code snippet
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
import scala.util.Try
import spark.implicits._
val df1 = Seq(("Welcome")).toDF("a")
.map(row => myCaseClass(Some(row.getAs("a"))))
val cols = df1.columns
df1.select(struct(cols.map(column):_*).as('struct))
.select(to_avro('struct, lit("primitive_type_str_avsc"), schemaRegistryAddress).as('value))
.show()
Facing following exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 77.0 failed 4 times, most recent failure: Lost task 0.3 in stage 77.0 (TID 186, 10.73.122.72, executor 3): org.spark_project.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:191)
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:218)
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:284)
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:272)
at org.spark_project.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getIdFromRegistry(CachedSchemaRegistryClient.java:78)
at org.spark_project.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getId(CachedSchemaRegistryClient.java:205)
at org.apache.spark.sql.avro.SchemaRegistryClientProxy.getId(SchemaRegistryClientProxy.java:52)
at org.apache.spark.sql.avro.SchemaRegistryAvroEncoder.encoder(SchemaRegistryUtils.scala:97)
at org.apache.spark.sql.avro.CatalystDataToAvroWithSchemaRegistry.nullSafeEval(CatalystDataToAvroWithSchemaRegistry.scala:57)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:544)
Could you please help in resolving this issue. Thanks in advance.

Related

kafka-python : avro.io.SchemaResolutionException: Can't access branch index 55 for union with 2 branches

I am using kafka-python 2.0.1 for consuming avro data. Following is the code I have tried:
from kafka import KafkaConsumer
import avro.schema
from avro.io import DatumReader, BinaryDecoder
import io
schema_path="schema.avsc"
schema = avro.schema.parse(open(schema_path).read())
reader = DatumReader(schema)
consumer = KafkaConsumer(
bootstrap_servers='xxx.xxx.xxx.xxx:9093',
security_protocol='SASL_SSL',
sasl_mechanism = 'GSSAPI',
auto_offset_reset = 'latest',
ssl_check_hostname=False,
api_version=(1,0,0))
consumer.subscribe(['test'])
for message in consumer:
message_val = message.value
print(message_val)
bytes_reader = io.BytesIO(message_val)
bytes_reader.seek(5)
decoder = avro.io.BinaryDecoder(bytes_reader)
record = reader.read(decoder)
print(record)
I am getting following error:
avro.io.SchemaResolutionException: Can't access branch index 55 for union with 2 branches
Writer's Schema: [
"null",
"int"
]
Reader's Schema: [
"null",
"int"
]
Can anyone please suggest what can be the possible cause of this error? I already followed this thread to skip initial 5 bytes:
How to decode/deserialize Avro with Python from Kafka
I got it working. Issue was with the wrong schema being referred. Thanks.

Spark spark.sql.session.timeZone doesn't work with JSON source

Does Spark v2.3.1 depends on local timezone when reading from JSON file?
My src/test/resources/data/tmp.json:
[
{
"timestamp": "1970-01-01 00:00:00.000"
}
]
and Spark code:
SparkSession.builder()
.appName("test")
.master("local")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate()
.read()
.option("multiLine", true).option("mode", "PERMISSIVE")
.schema(new StructType()
.add(new StructField("timestamp", DataTypes.TimestampType, true, Metadata.empty())))
.json("src/test/resources/data/tmp.json")
.show();
Result:
+-------------------+
| timestamp|
+-------------------+
|1969-12-31 22:00:00|
+-------------------+
How to make spark return 1970-01-01 00:00:00.000?
P.S. This question is not a duplicate of Spark Strutured Streaming automatically converts timestamp to local time, because provided there solution not work for me and is already included (see .config("spark.sql.session.timeZone", "UTC")) into my question.

org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

I'm trying to write a Spark Structured Streaming (2.3) dataset to ScyllaDB (Cassandra).
My code to write the dataset:
def saveStreamSinkProvider(ds: Dataset[InvoiceItemKafka]) = {
ds
.writeStream
.format("cassandra.ScyllaSinkProvider")
.outputMode(OutputMode.Append)
.queryName("KafkaToCassandraStreamSinkProvider")
.options(
Map(
"keyspace" -> namespace,
"table" -> StreamProviderTableSink,
"checkpointLocation" -> "/tmp/checkpoints"
)
)
.start()
}
My ScyllaDB Streaming Sinks:
class ScyllaSinkProvider extends StreamSinkProvider {
override def createSink(sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): ScyllaSink =
new ScyllaSink(parameters)
}
class ScyllaSink(parameters: Map[String, String]) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit =
data.write
.cassandraFormat(
parameters("table"),
parameters("keyspace")
//parameters("cluster")
)
.mode(SaveMode.Append)
.save()
}
However, when I run this code, I receive an exception:
...
[error] +- StreamingExecutionRelation KafkaSource[Subscribe[transactions_load]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]
[error] at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
[error] at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
[error] Caused by: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame;
[error] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[error] at org.apache.spark.sql.Dataset.write(Dataset.scala:3103)
[error] at cassandra.ScyllaSink.addBatch(CassandraDriver.scala:113)
[error] at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
...
I have seen a similar question, but that is for CosmosDB - Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame
You could convert it to an RDD first and then write:
class ScyllaSink(parameters: Map[String, String]) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = synchronized {
val schema = data.schema
// this ensures that the same query plan will be used
val rdd: RDD[Row] = df.queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row])
}
// write the RDD to Cassandra
}
}

Using Sparksql and SparkCSV with SparkJob Server

Am trying to JAR a simple scala application which make use of SparlCSV and spark sql to create a Data frame of the CSV file stored in HDFS and then just make a simple query to return the Max and Min of specific column in CSV file.
I am getting error when i use the sbt command to create the JAR which later i will curl to jobserver /jars folder and execute from remote machine
Code:
import com.typesafe.config.{Config, ConfigFactory}
import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object sparkSqlCSV extends SparkJob {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName("sparkSqlCSV")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val config = ConfigFactory.parseString("")
val results = runJob(sc, config)
println("Result is " + results)
}
override def validate(sc: sqlContext, config: Config): SparkJobValidation = {
SparkJobValid
}
override def runJob(sc: sqlContext, config: Config): Any = {
val value = "com.databricks.spark.csv"
val ControlDF = sqlContext.load(value,Map("path"->"hdfs://mycluster/user/Test.csv","header"->"true"))
ControlDF.registerTempTable("Control")
val aggDF = sqlContext.sql("select max(DieX) from Control")
aggDF.collectAsList()
}
}
Error:
[hduser#ptfhadoop01v spark-jobserver]$ sbt ashesh-jobs/package
[info] Loading project definition from /usr/local/hadoop/spark-jobserver/project
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
[info] Set current project to root (in build file:/usr/local/hadoop/spark-jobserver/)
[info] scalastyle using config /usr/local/hadoop/spark-jobserver/scalastyle-config.xml
[info] Processed 2 file(s)
[info] Found 0 errors
[info] Found 0 warnings
[info] Found 0 infos
[info] Finished in 9 ms
[success] created output: /usr/local/hadoop/spark-jobserver/ashesh-jobs/target
[warn] Credentials file /home/hduser/.bintray/.credentials does not exist
[info] Updating {file:/usr/local/hadoop/spark-jobserver/}ashesh-jobs...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] scalastyle using config /usr/local/hadoop/spark-jobserver/scalastyle-config.xml
[info] Processed 5 file(s)
[info] Found 0 errors
[info] Found 0 warnings
[info] Found 0 infos
[info] Finished in 1 ms
[success] created output: /usr/local/hadoop/spark-jobserver/job-server-api/target
[info] Compiling 2 Scala sources and 1 Java source to /usr/local/hadoop/spark-jobserver/ashesh-jobs/target/scala-2.10/classes...
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:8: object sql is not a member of package org.apache.spark
[error] import org.apache.spark.sql.SQLContext
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:14: object sql is not a member of package org.apache.spark
[error] val sqlContext = new org.apache.spark.sql.SQLContext(sc)
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:25: not found: type sqlContext
[error] override def runJob(sc: sqlContext, config: Config): Any = {
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:21: not found: type sqlContext
[error] override def validate(sc: sqlContext, config: Config): SparkJobValidation = {
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:27: not found: value sqlContext
[error] val ControlDF = sqlContext.load(value,Map("path"->"hdfs://mycluster/user/Test.csv","header"->"true"))
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:29: not found: value sqlContext
[error] val aggDF = sqlContext.sql("select max(DieX) from Control")
[error] ^
[error] 6 errors found
[error] (ashesh-jobs/compile:compileIncremental) Compilation failed
[error] Total time: 10 s, completed May 26, 2016 4:42:52 PM
[hduser#ptfhadoop01v spark-jobserver]$
I guess the main issue being that its missing the dependencies for sparkCSV and sparkSQL , But i have no idea where to place the dependencies before compiling the code using sbt.
I am issuing the following command to package the application , The source codes are placed under "ashesh_jobs" directory
[hduser#ptfhadoop01v spark-jobserver]$ sbt ashesh-jobs/package
I hope someone can help me to resolve this issue.Can you specify me the file where i can specify the dependency and the format to input
The following link has more information in creating other contexts https://github.com/spark-jobserver/spark-jobserver/blob/master/doc/contexts.md
Also you need job-server-extras
add library dependency in buil.sbt
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.2"

Error when using SparkJob with NamedRddSupport

Goal is to create the following on a local instance of Spark JobServer:
object foo extends SparkJob with NamedRddSupport
Question: How can I fix the following error which happens on every job:
{
"status": "ERROR",
"result": {
"message": "Ask timed out on [Actor[akka://JobServer/user/context-supervisor/439b2467-spark.jobserver.genderPrediction#884262439]] after [10000 ms]",
"errorClass": "akka.pattern.AskTimeoutException",
"stack: ["akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)", "akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)", "scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)", "scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)", "akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)", "akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)", "akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)", "akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)", "java.lang.Thread.run(Thread.java:745)"]
}
}
A more detailed error description by the Spark JobServer:
job-server[ERROR] Exception in thread "pool-100-thread-1" java.lang.AbstractMethodError: spark.jobserver.genderPrediction$.namedObjectsPrivate()Ljava/util/concurrent/atomic/AtomicReference;
job-server[ERROR] at spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:248)
job-server[ERROR] at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
job-server[ERROR] at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
job-server[ERROR] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
job-server[ERROR] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
job-server[ERROR] at java.lang.Thread.run(Thread.java:745)
In case somebody wants to see the code:
package spark.jobserver
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkContext}
import com.typesafe.config.{Config, ConfigFactory}
import collection.JavaConversions._
import scala.io.Source
object genderPrediction extends SparkJob with NamedRddSupport
{
// Main function
def main(args: scala.Array[String])
{
val sc = new SparkContext()
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
val config = ConfigFactory.parseString("")
val results = runJob(sc, config)
}
def validate(sc: SparkContext, config: Config): SparkJobValidation = {SparkJobValid}
def runJob(sc: SparkContext, config: Config): Any =
{
return "ok";
}
}
Version information:
Spark is 1.5.0 - SparkJobServer is latest version
Thank you all very much in advance!
Adding more explanation to #noorul 's answer
It seems like you compiled the code with an old version of SJS and you are running it with the latest.
NamedObjects were recently added. You are getting AbstractMethodError because your server expects NamedObjects support and you didn't compile the code with that.
Also: you don't need the main method there since it won't be executed by SJS.
Ensure that your.compile and run time library versions of dependent packages are same.

Resources