How to convert messages from socket streaming source to custom domain object?

How to convert messages from socket streaming source to custom domain object? - apache-spark

I'm very new to spark streaming. I have a Spark Standalone 2.2 running with one worker. I'm using a socket source and trying to read the incoming stream into an object called MicroserviceMessage.
val message = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
val df = message.as[MicroserviceMessage].flatMap(microserviceMessage =>
microserviceMessage.DataPoints.map(datapoint => (datapoint, microserviceMessage.ServiceProperties, datapoint.EpochUTC)))
.toDF("datapoint", "properties", "timestamp")
I'm hoping this will a DataFrame with columns of "datapoint", "properties" and "timestamp"
The data i'm pasting into my netcat terminal looks like this (this is what I'm trying to read in as MicroserviceMessage):
{
"SystemType": "mytype",
"SystemGuid": "6c84fb90-12c4-11e1-840d-7b25c5ee775a",
"TagType": "Raw Tags",
"ServiceType": "FILTER",
"DataPoints": [
{
"TagName": "013FIC003.PV",
"EpochUTC": 1505247956001,
"ItemValue": 25.47177,
"ItemValueStr": "NORMAL",
"Quality": "Good",
"TimeOffset": "P0000"
},
{
"TagName": "013FIC003.PV",
"EpochUTC": 1505247956010,
"ItemValue": 26.47177,
"ItemValueStr": "NORMAL",
"Quality": "Good",
"TimeOffset": "P0000"
}
],
"ServiceProperties": [
{
"Key": "OutputTagName",
"Value": "FI12102.PV_CL"
},
{
"Key": "OutputTagType",
"Value": "Cleansing Flow Tags"
}
]
}
Instead what I see is:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`SystemType`' given input columns: [value];
MicroserviceMessage case class looks like this:
case class DataPoints
(
TagName: String,
EpochUTC: Double,
ItemValue: Double,
ItemValueStr: String,
Quality: String,
TimeOffset: String
)
case class ServiceProperties
(
Key: String,
Value: String
)
case class MicroserviceMessage
(
SystemType: String,
SystemGuid: String,
TagType: String,
ServiceType: String,
DataPoints: List[DataPoints],
ServiceProperties: List[ServiceProperties]
)
EDIT:
After reading this post I was able to start the job by doing
val messageEncoder = Encoders.bean(classOf[MicroserviceMessage])
val df = message.select($"value").as(messageEncoder).map(
msmg => (msmg.ServiceType, msmg.SystemGuid)
).toDF("service", "guid")
But this causes issues when I start sending data.
Caused by: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/LambdaDeserialize
Full stacktrace

This:
message.as[MicroserviceMessage]
is incorrect as explained by the error message:
cannot resolve 'SystemType' given input columns: [value];
Data that comes from SocketStream is just string (or string and timestamp). To make it usable for strongly typed Dataset you have to parse it, for example with org.apache.spark.sql.functions.from_json.

The reason for the exception
Caused by: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/LambdaDeserialize
is that you compiled your Spark Structured Streaming application using Scala 2.12.4 (or any other in 2.12 stream) which is unsupported in Spark 2.2.
From the scaladoc of scala.runtime.LambdaDeserializer:
This class is only intended to be called by synthetic $deserializeLambda$ method that the Scala 2.12 compiler will add to classes hosting lambdas.
Spark 2.2 supports up to and including Scala 2.11.12 with 2.11.8 being the most "blessed" version.

Related

How to encode structs into Avro record in Spark?

I'm trying to use to_avro() function to create Avro records. However, I'm not able to encode multiple columns, as some columns are simply lost after encoding. A simple example to recreate the problem:
val schema = StructType(List(
StructField("entity_type", StringType),
StructField("entity", StringType)
))
val rdd = sc.parallelize(Seq(
Row("PERSON", "John Doe")
))
val df = sqlContext.createDataFrame(rdd, schema)
df
.withColumn("struct", struct(col("entity_type"), col("entity")))
.select("struct")
.collect()
.foreach(println)
// prints [[PERSON, John Doe]]
df
.withColumn("struct", struct(col("entity_type"), col("entity")))
.select(to_avro(col("struct")).as("value"))
.select(from_avro(col("value"), entitySchema).as("entity"))
.collect()
.foreach(println)
// prints [[, PERSON]]
My schema looks like this
{
"type" : "record",
"name" : "Entity",
"fields" : [ {
"name" : "entity_type",
"type" : "string"
},
{
"name" : "entity",
"type" : "string"
} ]
}
What's interesting, is if I change the column order in the struct, the result would be [, John Doe]
I'm using Spark 2.4.5. According to Spark documentation: "to_avro() can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka."

It's working after changing field types from "string" to ["string", "null"]. Not sure if this behavior is intended though.

Spark can not process recursive avro data

I have avsc schema like below:
{
"name": "address",
"type": [
"null",
{
"type":"record",
"name":"Address",
"namespace":"com.data",
"fields":[
{
"name":"address",
"type":[ "null","com.data.Address"],
"default":null
}
]
}
],
"default": null
}
On loading this data in pyspark:
jsonFormatSchema = open("Address.avsc", "r").read()
spark = SparkSession.builder.appName('abc').getOrCreate()
df = spark.read.format("avro")\
.option("avroSchema", jsonFormatSchema)\
.load("xxx.avro")
I got such exception:
"Found recursive reference in Avro schema, which can not be processed by Spark"
I tried many other configurations, but without any success.
To execute I use with spark-submit:
--packages org.apache.spark:spark-avro_2.12:3.0.1

This is a intended feature, you can take a look at the "issue" :
https://issues.apache.org/jira/browse/SPARK-25718

Cannot convert Catalyst type IntegerType to Avro type ["null","int"]

I've Spark Structured Streaming process build with Pyspark that reads a avro message from a kafka topic, make some transformations and load the data as avro in a target topic.
I use the ABRIS package (https://github.com/AbsaOSS/ABRiS) to serialize/deserialize the Avro from Confluent, integrating with Schema Registry.
The schema contains integer columns as follows:
{
"name": "total_images",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "total_videos",
"type": [
"null",
"int"
],
"default": null
},
The process raises the following error: Cannot convert Catalyst type IntegerType to Avro type ["null","int"].
I've tried to convert the columns to be nullable but the error persists.
If someone have a suggestion I would appreciate that

I burned hours on this one
Actually, It is unrelated to Abris dependency (behaviour is the same with native spark-avro apis)
There may be several root causes but in my case … using Spark 3.0.1, Scala with Dataset : it was related to encoder and wrong type in the case class handling datas.
Shortly, avro field defined with "type": ["null","int"] can’t be mapped to scala Int, it needs Option[Int]
Using the following code:
test("Avro Nullable field") {
val schema: String =
"""
|{
| "namespace": "com.mberchon.monitor.dto.avro",
| "type": "record",
| "name": "TestAvro",
| "fields": [
| {"name": "strVal", "type": ["null", "string"]},
| {"name": "longVal", "type": ["null", "long"]}
| ]
|}
""".stripMargin
val topicName = "TestNullableAvro"
val testInstance = TestAvro("foo",Some(Random.nextInt()))
import sparkSession.implicits._
val dsWrite:Dataset[TestAvro] = Seq(testInstance).toDS
val allColumns = struct(dsWrite.columns.head, dsWrite.columns.tail: _*)
dsWrite
.select(to_avro(allColumns,schema) as 'value)
.write
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("topic", topicName)
.save()
val dsRead:Dataset[TestAvro] = sparkSession.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.load()
.select(from_avro(col("value"), schema) as 'Metric)
.select("Metric.*")
.as[TestAvro]
assert(dsRead.collect().contains(testInstance))
}
It fails if case class is defined as follow:
case class TestAvro(strVal:String,longVal:Long)
Cannot convert Catalyst type LongType to Avro type ["null","long"].
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type LongType to Avro type ["null","long"].
at org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:219)
at org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$1(AvroSerializer.scala:239)
It works properly with:
case class TestAvro(strVal:String,longVal:Option[Long])
Btw, it would be more than nice to have support for SpecificRecord within Spark encoders (you can use Kryo but it is sub efficient)
Since, in order to use efficiently typed Dataset with my avro data … I need to create additional case classes (which duplicates of my SpecificRecords).

Hive table creation in HDP using Apache Spark job

I have written following Scala program in Eclipse for reading a csv file from a location in HDFS and then saving that data into a hive table [I am using HDP2.4 sandbox running on my VMWare present on my local machine]:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
object HDFS2HiveFileRead {
def main(args:Array[String]){
val conf = new SparkConf()
.setAppName("HDFS2HiveFileRead")
.setMaster("local")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
println("loading data")
val loadDF = hiveContext.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter",",")
.load("hdfs://192.168.159.129:8020/employee.csv")
println("data loaded")
loadDF.printSchema()
println("creating table")
loadDF.write.saveAsTable("%s.%s".format( "default" , "tblEmployee2" ))
println("table created")
val selectQuery = "SELECT * FROM default.tblEmployee2"
println("selecting data")
val result = hiveContext.sql(selectQuery)
result.show()}}
When I run this program from my Eclipse; using
Run As -> Scala Application
option: It shows me following results on Eclipse Console:
loading data
data loaded
root
|-- empid: string (nullable = true)
|-- empname: string (nullable = true)
|-- empage: string (nullable = true)
creating table
17/06/29 13:27:08 INFO CatalystWriteSupport: Initialized Parquet
WriteSupport with Catalyst schema: { "type" : "struct", "fields" :
[ {
"name" : "empid",
"type" : "string",
"nullable" : true,
"metadata" : { } }, {
"name" : "empname",
"type" : "string",
"nullable" : true,
"metadata" : { } }, {
"name" : "empage",
"type" : "string",
"nullable" : true,
"metadata" : { } } ] } and corresponding Parquet message type: message spark_schema { optional binary empid (UTF8); optional
binary empname (UTF8); optional binary empage (UTF8); }
table created
selecting data
+-----+--------+------+
|empid| empname|empage|
+-----+--------+------+
| 1201| satish| 25|
| 1202| krishna| 28|
| 1203| amith| 39|
| 1204| javed| 23|
| 1205| prudvi| 23|
+-----+--------+------+
17/06/29 13:27:14 ERROR ShutdownHookManager: Exception while deleting
Spark temp dir:
C:\Users\c.b\AppData\Local\Temp\spark-c65aa16b-6448-434f-89dc-c318f0797e10
java.io.IOException: Failed to delete:
C:\Users\c.b\AppData\Local\Temp\spark-c65aa16b-6448-434f-89dc-c318f0797e10
This shows that csv data has been loaded from desired HDFS location [present in HDP] and table with name tblEmployee2 has also been created in hive, as I could read and see the results in the console. I could even read this table again and again by running any spark job to read data from this table
BUT, the issue is as soon as I go to my HDP2.4 through putty and try to see this table in hive,
1) I could not see this table there.
2) I am considering that this code will create a managed/internal table in hive, hence the csv file present at given location in HDFS should also get moved from its base location to hive metastore location, which is not happening?
3) I could also see metastore_db folder getting created in my Eclipse, does that mean that this tblEmployee2 is getting created in my local/windows machine?
4) How can I resolve this issue and ask my code to create hive table in hdp? Is there any configuration which I am missing here?
5) Why am I getting last error in my execution?
Any quick response/pointer would be appreciated.
UPDATE After thinking a lot when I added hiveContext.setConf("hive.metastore.uris","thrift://192.168.159.129:9083")
Code moved a bit but with some permission related issues started appearing. I could now see this table [tblEmployee2] in my hive's default database present in my VMWare but it does that with SparkSQL by itself:
17/06/29 22:43:21 WARN HiveContext$$anon$2: Could not persist `default`.`tblEmployee2` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.
Hence, I am still not able to use HiveContext, and my above mentioned issues 2-5 still persists.
Regards,
Bhupesh

You are running the spark in local mode.
val conf = new SparkConf()
.setAppName("HDFS2HiveFileRead")
.setMaster("local")
In local mode, when you specify saveAsTable, it will try to create the table in local machine. Change your configuration to run in yarn mode.
You can refer to the below URL, for details:
http://www.coding-daddy.xyz/node/7

Trying to deserialize Avro in Spark with specific type

I have some Avro classes that i generated, and am now trying to use them in Spark. So I imported my avro generated java class, “twitter_schema”, and refer to it when I deserialize. Seems to work but getting a Cast exception at the end.
My Schema:
$ more twitter.avsc
{ "type" : "record", "name" : "twitter_schema", "namespace" :
"com.miguno.avro", "fields" : [ {
"name" : "username",
"type" : "string",
"doc" : "Name of the user account on Twitter.com" }, {
"name" : "tweet",
"type" : "string",
"doc" : "The content of the user's Twitter message" }, {
"name" : "timestamp",
"type" : "long",
"doc" : "Unix epoch time in seconds" } ], "doc:" : "A basic schema for storing Twitter messages" }
My code:
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.avro.mapred.AvroKey
import org.apache.hadoop.io.NullWritable
import org.apache.avro.mapred.AvroInputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import com.miguno.avro.twitter_schema
val path = "/app/avro/data/twitter.avro"
val conf = new Configuration
var avroRDD = sc.newAPIHadoopFile(path,classOf[AvroKeyInputFormat[twitter_schema]],
classOf[AvroKey[ByteBuffer]], classOf[NullWritable], conf)
var avroRDD = sc.hadoopFile(path,classOf[AvroInputFormat[twitter_schema]],
classOf[AvroWrapper[twitter_schema]], classOf[NullWritable], 5)
avroRDD.map(l => {
//transformations here
new String(l._1.datum.username)
}
).first
And I get an error on the last line:
scala> avroRDD.map(l => {
| new String(l._1.datum.username)}).first
<console>:30: error: overloaded method constructor String with alternatives:
(x$1: StringBuilder)String <and>
(x$1: StringBuffer)String <and>
(x$1: Array[Byte])String <and>
(x$1: Array[Char])String <and>
(x$1: String)String
cannot be applied to (CharSequence)
new String(l._1.datum.username)}).first
What am I doing wrong – not understanding the error?
Is it the right way of deserializing? I read about Kryo but seems to add to the complexity, and read about the Spark SQL context accepting Avro in 1.2, but it sounds like a performance hog/workaround.. Best practices for this anyone?
thanks,
Matt

I think your problem is that avro has deserialized string into CharSequence but spark expected java String. Avro has 3 ways to deserialize string in java: into CharSequence, into String and into UTF8 (avro class for storing strings, kinda like Hadoop's Text).
You control that by adding "avro.java.string" property into your avro schema. Possible values are (case sensitive): "String", "CharSequence", "Utf8". There may be a way to control that dynamically through the input format as well but I don't know exactly.

Ok since CharSequence is the interface to String, i can keep my Avro schema the way it was, and just make my Avro string a String via toString(), i.e.:
scala> avroRDD.map(l => {
| new String(l._1.datum.get("username").toString())
| } ).first
res2: String = miguno

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to convert messages from socket streaming source to custom domain object? - apache-spark

Related

How to encode structs into Avro record in Spark?

Spark can not process recursive avro data

Cannot convert Catalyst type IntegerType to Avro type ["null","int"]

Hive table creation in HDP using Apache Spark job

Trying to deserialize Avro in Spark with specific type

Categories

Resources