I use Spark 1.6 and Kafka 0.8.2.1.
I am trying to fetch some data from Kafka using Spark Streaming and do some operations on that data.
For that I should know the schema of the fetched data, is there some way for this or can we get values from stream by using field names?
TL;DR It's not possible directly (esp. with the old Spark 1.6), but not impossible either.
Kafka sees bytes and that's what Spark Streaming expects. You'd have to somehow pass some extra information on fixed fields to get the schema (possibly as a JSON-encoded string) and decode the other field. It is not available out of the box, but is certainly doable.
As a suggestion, I'd send a message where value field would always be two-field data structure with the schema (of a value field) and the value itself (in JSON format).
You could then use one of from_json functions:
from_json(e: Column, schema: StructType): Column Parses a column containing a JSON string into a StructType with the specified schema.
Given from_json was added in Spark 2.1.0, you'd have to register your own custom user-defined function (UDF) that'd deserialize the string value into a corresponding structure (just see how from_json does it and copy it).
Note that DataType object comes with fromJson method that can "map" a JSON-encoded string into a DataType that would describe your schema.
fromJson(json: String): DataType
Related
Is it possible to extract schema of kafka input data in spark streaming??
Even if I was able to extract schema from rdd, streaming works fine when there is data in kafka topics, but fails to work when there is an empty rdd.
Data in Kafka is stored as JSON.
JSON is another format for data that is written to Kafka. You can use the built-in from_json function along with the expected schema to convert a binary value into a Spark SQL struct.
The application is streaming from a kafka topic that gets messages with different structure. The only way to know which structure it is, is to use the message key. Is there a way to apply a schema based on the message type. Such as use a function in the from_json method and pass the key into this fuction.
Suppose we subscribe 2 topics in the stream, one topic is of avro and another one is string, is it possible to dynamically deserialize based on the topic name?
In theory, yes
The Deserializer interface accepts the topic name as a parameter, which you could perform a check against.
However, getting access to this in Spark would require your own UDF wrapper around it.
Ultimately, I think it would be better if you define two streaming dataframes for each topic of a different format, or simply produce the strings as Avro encoded.
I'm looking for how to read avro messages which has complex structure from Kafka using Spark structure streaming
I then want to parse these message and compare with hbase reference values, and then save outcome into hdfs or another hbase table.
I started with below sample code :
https://github.com/Neuw84/spark-continuous-streaming/blob/master/src/main/java/es/aconde/structured/StructuredDemo.java
Avro message schema:
struct[mTimeSeries:
struct[cName:string,
eIpAddr:string,
pIpAddr:string,
pTime:string,
mtrcs:array[struct[mName:string,
xValues:array[bigint],
yValues:array[string],
rName:string]]]]
I am struggling to create a row using RowFactory.create for this schema. So do i need to iterate through array fields? I understand that we can use explode functions on dataset to denormalize or access inner fields of struct array once we create dataset with this structure as I do it in Hive. So I would like to create a row as is i.e.exactly how a avro message looks like and then use sql functions to further transform.
sparkSession.udf().register("deserialize", (byte[] data) -> {
GenericRecord record = recordInjection.invert(data).get();
return ***RowFactory.create(record.get("machine").toString(), record.get("sensor").toString(), record.get("data"), record.get("eventTime"));***
}, DataTypes.createStructType(type.fields())
I used to_avro to serialize a struct field in dataframe into 'value' and write to a kafka topic.
My struct has several string fields. I tested and all field has value.
Using Spark Streaming, I read the topic and use from_avro to deserialize the value with the exact schema that was used to serialize the struct.
select(from_avro($"value", schema).as("value"))
The return is a struct field in result dataframe. However, always there're some field in the struct without value. Just some of the fields have correct value.
Can this be a bug in to_avro/from_avro functions or I did not use them correctly?