Spark Streaming extracting schema from input data - apache-spark

Is it possible to extract schema of kafka input data in spark streaming??
Even if I was able to extract schema from rdd, streaming works fine when there is data in kafka topics, but fails to work when there is an empty rdd.

Data in Kafka is stored as JSON.
JSON is another format for data that is written to Kafka. You can use the built-in from_json function along with the expected schema to convert a binary value into a Spark SQL struct.

Related

Convert Spark SQL DataFrames to Structured Streaming DataFrames

I'd like to convert Java Spark SQL DataFrames to Structured Streaming DataFrames, in such a way that every batch would be unioned to the Structured Streaming DataFrame. Therefore I could use the Spark Structured Streaming functionalities (such as a continuous job) on DataFrames that I've got from a batch source.
Nothing to do with Java and title a little off-beam.
Unsupported standard operation as you state.
Look in the docs at the the foreachBatch implementation. See https://spark.apache.org/docs/3.1.2/structured-streaming-programming-guide.html#foreachbatch and do the UNION there are having reading the static DF in.

Xml parsing on spark Structured Streaming

I'm trying to analyze data using Kinesis source in PySpark Structured Streaming on Databricks.
I created a Dataframe as shown below.
kinDF = spark.readStream.format("kinesis").("streamName", "test-stream-1").load()
Later I converted the data from base64 encoding as below.
df = kinDF.withColumn("xml_data", expr("CAST(data as string)"))
Now, I need to extract few fields from df.xml_data column using xpath. Can you please suggest any possible solution?
If I create a dataframe directly for these xml files as xml_df = spark.read.format("xml").options(rowTag='Consumers').load("s3a://bkt/xmldata"), I'm able to query using xpath:
xml_df.select("Analytics.Amount1").show()
But, not sure how to do extract elements similarly on a Spark Streaming dataframe where data is in text format.
Are there any xml functions to convert text data using schema? I saw an example for json data using from_json.
Is it possible to use spark.read on a dataframe column?
I need to find aggregated "Amount1" for every 5 minutes window.
Thanks for your help
You can use com.databricks.spark.xml.XmlReader to read xml data from column but requires an RDD, which means that you need to transform your df to RDD using df.rdd which may impact performance.
Below is untested code from spark java:
import com.databricks.spark.xml
xmlRdd = df = kinDF.select("xml_data").map(r -> r[0])
new XmlReader().xmlRdd(spark, xmlRdd)

Java code for reading Kafka avro messages in spark 2.1.1 structure streaming

I'm looking for how to read avro messages which has complex structure from Kafka using Spark structure streaming
I then want to parse these message and compare with hbase reference values, and then save outcome into hdfs or another hbase table.
I started with below sample code :
https://github.com/Neuw84/spark-continuous-streaming/blob/master/src/main/java/es/aconde/structured/StructuredDemo.java
Avro message schema:
struct[mTimeSeries:
struct[cName:string,
eIpAddr:string,
pIpAddr:string,
pTime:string,
mtrcs:array[struct[mName:string,
xValues:array[bigint],
yValues:array[string],
rName:string]]]]
I am struggling to create a row using RowFactory.create for this schema. So do i need to iterate through array fields? I understand that we can use explode functions on dataset to denormalize or access inner fields of struct array once we create dataset with this structure as I do it in Hive. So I would like to create a row as is i.e.exactly how a avro message looks like and then use sql functions to further transform.
sparkSession.udf().register("deserialize", (byte[] data) -> {
GenericRecord record = recordInjection.invert(data).get();
return ***RowFactory.create(record.get("machine").toString(), record.get("sensor").toString(), record.get("data"), record.get("eventTime"));***
}, DataTypes.createStructType(type.fields())

How to find the schema of values in DStream at runtime?

I use Spark 1.6 and Kafka 0.8.2.1.
I am trying to fetch some data from Kafka using Spark Streaming and do some operations on that data.
For that I should know the schema of the fetched data, is there some way for this or can we get values from stream by using field names?
TL;DR It's not possible directly (esp. with the old Spark 1.6), but not impossible either.
Kafka sees bytes and that's what Spark Streaming expects. You'd have to somehow pass some extra information on fixed fields to get the schema (possibly as a JSON-encoded string) and decode the other field. It is not available out of the box, but is certainly doable.
As a suggestion, I'd send a message where value field would always be two-field data structure with the schema (of a value field) and the value itself (in JSON format).
You could then use one of from_json functions:
from_json(e: Column, schema: StructType): Column Parses a column containing a JSON string into a StructType with the specified schema.
Given from_json was added in Spark 2.1.0, you'd have to register your own custom user-defined function (UDF) that'd deserialize the string value into a corresponding structure (just see how from_json does it and copy it).
Note that DataType object comes with fromJson method that can "map" a JSON-encoded string into a DataType that would describe your schema.
fromJson(json: String): DataType

Convert Xml to Avro from Kafka to hdfs via spark streaming or flume

I want to convert xml files to avro. The data will be in xml format and will be hit the kafka topic first. Then, I can either use flume or spark-streaming to ingest and convert from xml to avro and land the files in hdfs. I have a cloudera enviroment.
When the avro files hit hdfs, I want the ability to read them into hive tables later.
I was wondering what is the best method to do this? I have tried automated schema conversion such as spark-avro (this was without spark-streaming) but the problem is spark-avro converts the data but hive cannot read it. Spark avro converts the xml to dataframe and then from dataframe to avro. The avro file can only be read by my spark application. I am not sure if I am using this correctly.
I think I will need to define an explicit schema for the avro schema. Not sure how to go about this for the xml file. It has multiple namespaces and is quite massive.
If you are on cloudera(since u have flume, may u have it), you can use morphline to work on conversion at record level. You can use batch/streaming. You can see here for more info.

Resources