Kafka Spark Streaming json file with varying schema - apache-spark

I have created a pipeline to consume streaming data from kafka topic which is in json format. But the problem is the jsons files with varying schema. so when I apply schema on the dataframe, it produces corrupted record.
How can I handle the varying schema in kafka streaming in Databricks using Pyspark

There are several approaches here, depending on what kind of schema variation is in your topic:
if all schemas have the compatible data types (no columns with the same name but with different types), then you can just create a schema that is superset of all schemas, and apply that schema when performing from_json.
if your schemas are incompatible, for example, you have fields that have the same name but different data types, then you can read data as text lines first, and then decode with different schemas depending on the row type - suppose that all JSONs have the type field with string type (not tested):
import pyspark.sql.functions as F
df = spark.readStream.format("kafka")....load().withColumn("value")\
.withColumn("value", col("value").cast("string"))
df_typed = df.withColumn("jsn", F.from_json(F.col("value"), "type string"))\
.select("value", "jsn.*")
df1 = df_typed.filter("type == some_value)\
.withColumn("jsn", F.from_json(F.col("value"), schema1)).select("jsn.*")
# process df1

Related

Spark Streaming extracting schema from input data

Is it possible to extract schema of kafka input data in spark streaming??
Even if I was able to extract schema from rdd, streaming works fine when there is data in kafka topics, but fails to work when there is an empty rdd.
Data in Kafka is stored as JSON.
JSON is another format for data that is written to Kafka. You can use the built-in from_json function along with the expected schema to convert a binary value into a Spark SQL struct.

How to select columns that contain any of the given strings as part of the column name in Pyspark [duplicate]

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Java code for reading Kafka avro messages in spark 2.1.1 structure streaming

I'm looking for how to read avro messages which has complex structure from Kafka using Spark structure streaming
I then want to parse these message and compare with hbase reference values, and then save outcome into hdfs or another hbase table.
I started with below sample code :
https://github.com/Neuw84/spark-continuous-streaming/blob/master/src/main/java/es/aconde/structured/StructuredDemo.java
Avro message schema:
struct[mTimeSeries:
struct[cName:string,
eIpAddr:string,
pIpAddr:string,
pTime:string,
mtrcs:array[struct[mName:string,
xValues:array[bigint],
yValues:array[string],
rName:string]]]]
I am struggling to create a row using RowFactory.create for this schema. So do i need to iterate through array fields? I understand that we can use explode functions on dataset to denormalize or access inner fields of struct array once we create dataset with this structure as I do it in Hive. So I would like to create a row as is i.e.exactly how a avro message looks like and then use sql functions to further transform.
sparkSession.udf().register("deserialize", (byte[] data) -> {
GenericRecord record = recordInjection.invert(data).get();
return ***RowFactory.create(record.get("machine").toString(), record.get("sensor").toString(), record.get("data"), record.get("eventTime"));***
}, DataTypes.createStructType(type.fields())

Applied Schema on JSON kafka topic gives all null fields

I'm using the hortonworks suite of tools and trying to parse json data that is coming in from a kafka topic into a dataframe. However, when I query the in-memory table, the schema of the dataframe seems to be correct, but all the values are null and I don't really know why.
The json data going into the kafka topic looks as such:
{"index":"0","Conrad":"Persevering system-worthy intranet","address":"8905 Robert Prairie\nJoefort, LA 41089","bs":"envisioneer web-enabled mindshare","city":"Davidland","date_time":"1977-06-26 06:12:48","email":"eric56#parker-robinson.com","paragraph":"Kristine Nash","randomdata":"Growth special factor bit only. Thing agent follow moment seat. Nothing agree that up view write include.","state":"1030.0"}
The code in my Zeppelin notebook is as such:
%dep
z.load("org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1")
%pyspark
#Defining my schema
from pyspark.sql.types import StructType , StringType , LongType , IntegerType
schema = StructType().add("index", IntegerType()).add("Conrad", StringType()).add("address",StringType()).add("bs",StringType()).add("city",StringType()).add("date_time",LongType()).add("email",StringType()).add("name",StringType()).add("paragraph",StringType()).add("randomdata",IntegerType()).add("state",StringType())
# Read data from kafka topic
lines = spark.readStream.format("kafka").option("kafka.bootstrap.servers","x.x.x.x:2181").option("startingOffsets", "latest").option("subscribe","testdata").load().select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
# Start the stream and query the in-memory table
query=lines.writeStream.format("memory").queryName("t10").start()
raw= spark.sql("select parsed_value.* from t10")
I am currently explicitly defining the schema, but my ultimate goal is to get the avro schema from the Hortonworks Schema Registry . Would be good if someone could show me how to do this as well.
Thanks!

How to find the schema of values in DStream at runtime?

I use Spark 1.6 and Kafka 0.8.2.1.
I am trying to fetch some data from Kafka using Spark Streaming and do some operations on that data.
For that I should know the schema of the fetched data, is there some way for this or can we get values from stream by using field names?
TL;DR It's not possible directly (esp. with the old Spark 1.6), but not impossible either.
Kafka sees bytes and that's what Spark Streaming expects. You'd have to somehow pass some extra information on fixed fields to get the schema (possibly as a JSON-encoded string) and decode the other field. It is not available out of the box, but is certainly doable.
As a suggestion, I'd send a message where value field would always be two-field data structure with the schema (of a value field) and the value itself (in JSON format).
You could then use one of from_json functions:
from_json(e: Column, schema: StructType): Column Parses a column containing a JSON string into a StructType with the specified schema.
Given from_json was added in Spark 2.1.0, you'd have to register your own custom user-defined function (UDF) that'd deserialize the string value into a corresponding structure (just see how from_json does it and copy it).
Note that DataType object comes with fromJson method that can "map" a JSON-encoded string into a DataType that would describe your schema.
fromJson(json: String): DataType

Resources