I'm using the hortonworks suite of tools and trying to parse json data that is coming in from a kafka topic into a dataframe. However, when I query the in-memory table, the schema of the dataframe seems to be correct, but all the values are null and I don't really know why.
The json data going into the kafka topic looks as such:
{"index":"0","Conrad":"Persevering system-worthy intranet","address":"8905 Robert Prairie\nJoefort, LA 41089","bs":"envisioneer web-enabled mindshare","city":"Davidland","date_time":"1977-06-26 06:12:48","email":"eric56#parker-robinson.com","paragraph":"Kristine Nash","randomdata":"Growth special factor bit only. Thing agent follow moment seat. Nothing agree that up view write include.","state":"1030.0"}
The code in my Zeppelin notebook is as such:
%dep
z.load("org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1")
%pyspark
#Defining my schema
from pyspark.sql.types import StructType , StringType , LongType , IntegerType
schema = StructType().add("index", IntegerType()).add("Conrad", StringType()).add("address",StringType()).add("bs",StringType()).add("city",StringType()).add("date_time",LongType()).add("email",StringType()).add("name",StringType()).add("paragraph",StringType()).add("randomdata",IntegerType()).add("state",StringType())
# Read data from kafka topic
lines = spark.readStream.format("kafka").option("kafka.bootstrap.servers","x.x.x.x:2181").option("startingOffsets", "latest").option("subscribe","testdata").load().select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
# Start the stream and query the in-memory table
query=lines.writeStream.format("memory").queryName("t10").start()
raw= spark.sql("select parsed_value.* from t10")
I am currently explicitly defining the schema, but my ultimate goal is to get the avro schema from the Hortonworks Schema Registry . Would be good if someone could show me how to do this as well.
Thanks!
Related
I have created a pipeline to consume streaming data from kafka topic which is in json format. But the problem is the jsons files with varying schema. so when I apply schema on the dataframe, it produces corrupted record.
How can I handle the varying schema in kafka streaming in Databricks using Pyspark
There are several approaches here, depending on what kind of schema variation is in your topic:
if all schemas have the compatible data types (no columns with the same name but with different types), then you can just create a schema that is superset of all schemas, and apply that schema when performing from_json.
if your schemas are incompatible, for example, you have fields that have the same name but different data types, then you can read data as text lines first, and then decode with different schemas depending on the row type - suppose that all JSONs have the type field with string type (not tested):
import pyspark.sql.functions as F
df = spark.readStream.format("kafka")....load().withColumn("value")\
.withColumn("value", col("value").cast("string"))
df_typed = df.withColumn("jsn", F.from_json(F.col("value"), "type string"))\
.select("value", "jsn.*")
df1 = df_typed.filter("type == some_value)\
.withColumn("jsn", F.from_json(F.col("value"), schema1)).select("jsn.*")
# process df1
I'm looking for how to read avro messages which has complex structure from Kafka using Spark structure streaming
I then want to parse these message and compare with hbase reference values, and then save outcome into hdfs or another hbase table.
I started with below sample code :
https://github.com/Neuw84/spark-continuous-streaming/blob/master/src/main/java/es/aconde/structured/StructuredDemo.java
Avro message schema:
struct[mTimeSeries:
struct[cName:string,
eIpAddr:string,
pIpAddr:string,
pTime:string,
mtrcs:array[struct[mName:string,
xValues:array[bigint],
yValues:array[string],
rName:string]]]]
I am struggling to create a row using RowFactory.create for this schema. So do i need to iterate through array fields? I understand that we can use explode functions on dataset to denormalize or access inner fields of struct array once we create dataset with this structure as I do it in Hive. So I would like to create a row as is i.e.exactly how a avro message looks like and then use sql functions to further transform.
sparkSession.udf().register("deserialize", (byte[] data) -> {
GenericRecord record = recordInjection.invert(data).get();
return ***RowFactory.create(record.get("machine").toString(), record.get("sensor").toString(), record.get("data"), record.get("eventTime"));***
}, DataTypes.createStructType(type.fields())
I have a spark structured steaming application that I'm reading in from Kafka.
Here is the basic structure of my code.
I create the Spark session.
val spark = SparkSession
.builder
.appName("app_name")
.getOrCreate()
Then I read from the stream
val data_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server_list")
.option("subscribe", "topic")
.load()
In Kafka record, I cast the "value" as a string. It converts from binary to string. At this point there is 1 column in the data frame
val df = data_stream
.select($"value".cast("string") as "json")
Based off of a pre-defined schema, I try to parse out the JSON structure into columns. However, the problem here is if the data is "bad", or a different format then it doesn't match the defined schema. So the next dataframe (df2) get's null values into the columns.
val df2 = df.select(from_json($"json", schema) as "data")
.select("data.*")
I'd like to be able to filter out from df2 the row's that have "null" in a certain column (one that I use as a primary key in a database) i.e. ignore bad data that doesn't match the schema?
EDIT: I was somewhat able to accomplish this but not the way I intended to.
In my process, I use a query that uses the .foreach(writer) process. What this does is it opens a connection to a database, processes each row, and then closes the connection. The documentation for structured streaming mentions the necessities you need for this process. In the process method, I get the values from each row and check if my primary key is null, if it is null I don't insert it into the database.
Just filter out any null values you don't want:
df2
.filter(row => row("colName") != null)
Kafka stores data as raw byte array format. Data producers and consumers need to agree a structure of data for processing.
If there is change in produced message format, consumer need to adjust to read same format. The problem comes when your data structure is evolving, you might need to have compatible at consumer side.
Defining message format by Protobuff solves this problem.
I read Spark Structured Streaming doesn't support schema inference for reading Kafka messages as JSON. Is there a way to retrieve schema the same as Spark Streaming does:
val dataFrame = spark.read.json(rdd.map(_.value()))
dataFrame.printschema
Here is one possible way to do this:
Before you start streaming, get a small batch of the data from Kafka
Infer the schema from the small batch
Start streaming the data using the extracted schema.
The pseudo-code below illustrates this approach.
Step 1:
Extract a small (two records) batch from Kafka,
val smallBatch = spark.read.format("kafka")
.option("kafka.bootstrap.servers", "node:9092")
.option("subscribe", "topicName")
.option("startingOffsets", "earliest")
.option("endingOffsets", """{"topicName":{"0":2}}""")
.load()
.selectExpr("CAST(value AS STRING) as STRING").as[String].toDF()
Step 2:
Write the small batch to a file:
smallBatch.write.mode("overwrite").format("text").save("/batch")
This command writes the small batch into hdfs directory /batch. The name of the file that it creates is part-xyz*. So you first need to rename the file using hadoop FileSystem commands (see org.apache.hadoop.fs._ and org.apache.hadoop.conf.Configuration, here's an example https://stackoverflow.com/a/41990859) and then read the file as json:
val smallBatchSchema = spark.read.json("/batch/batchName.txt").schema
Here, batchName.txt is the new name of the file and smallBatchSchema contains the schema inferred from the small batch.
Finally, you can stream the data as follows (Step 3):
val inputDf = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "node:9092")
.option("subscribe", "topicName")
.option("startingOffsets", "earliest")
.load()
val dataDf = inputDf.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=smallBatchSchema).as("data"))
.select("data.*")
Hope this helps!
It is possible using this construct:
myStream = spark.readStream.schema(spark.read.json("my_sample_json_file_as_schema.json").schema).json("my_json_file")..
How can this be? Well, as the spark.read.json("..").schema returns exactly a wanted inferred schema, you can use this returned schema as an argument for the mandatory schema parameter of spark.readStream
What I did was to specify a one-liner sample-json as input for inferring the schema stuff so it does not unnecessary take up memory. In case your data changes, simply update your sample-json.
Took me a while to figure out (constructing StructTypes and StructFields by hand was pain in the ..), therefore I'll be happy for all upvotes :-)
It is not possible. Spark Streaming supports limited schema inference in development with spark.sql.streaming.schemaInference set to true:
By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting spark.sql.streaming.schemaInference to true.
but it cannot be used to extract JSON from Kafka messages and DataFrameReader.json doesn't support streaming Datasets as arguments.
You have to provide schema manually How to read records in JSON format from Kafka using Structured Streaming?
It is possible to convert JSON to a DataFrame without having to manually type the schema, if that is what you meant to ask.
Recently I ran into a situation where I was receiving massively long nested JSON packets via Kafka, and manually typing the schema would have been both cumbersome and error-prone.
With a small sample of the data and some trickery you can provide the schema to Spark2+ as follows:
val jsonstr = """ copy paste a representative sample of data here"""
val jsondf = spark.read.json(Seq(jsonstr).toDS) //jsondf.schema has the nested json structure we need
val event = spark.readStream.format..option...load() //configure your source
val eventWithSchema = event.select($"value" cast "string" as "json").select(from_json($"json", jsondf.schema) as "data").select("data.*")
Now you can do whatever you want with this val as you would with Direct Streaming. Create temp view, run SQL queries, whatever..
Taking Arnon's solution to the next step (since it's deprecated in spark's newer versions, and would require iterating the whole dataframe just for a type casting)
spark.read.json(df.as[String])
Anyways, as for now, it's still experimental.
I have existing Hive data stored in Avro format. For whatever reason reading these data by executing SELECT is very slow. I didn't figure out yet why. The data is partitioned and my WHERE clause always follows the partition columns. So I decided to read the data directly by navigating to the partition path and using Spark SQLContext. This works much faster. However, the problem I have is reading the DOUBLE values. Avro stores them in a binary format.
When I execute the following query in Hive:
select myDoubleValue from myTable;
I'm getting the correct expected values
841.79
4435.13
.....
but the following Spark code:
val path="PathToMyPartition"
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.avro(path)
df.select("myDoubleValue").rdd.map(x => x.getAs[Double](0))
gives me this exception
java.lang.ClassCastException : [B cannot be cast to java.lang.Double
What would be the right way either to provide a schema or convert the value that is stored in a binary format into a double format?
I found a partial solution how to convert the Avro schema to a Spark SQL StructType. There is com.databricks.spark.avro.SchemaConverters developed by Databricks that has a bug in converting Avro logical data types in their toSqlType(avroSchema: Schema) method which was incorrectly converting the logicalType
{"name":"MyDecimalField","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":18}],"doc":"","default":null}
into
StructField("MyDecimalField",BinaryType,true)
I fixed this bug in my local version of the code and now it is converting into
StructField("MyDecimalField",DecimalType(38,18),true)
Now, the following code reads the Avro file and creates a Dataframe:
val avroSchema = new Schema.Parser().parse(QueryProvider.getQueryString(pathSchema))
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.schema(MyAvroSchemaConverter.toSqlType(avroSchema).dataType.asInstanceOf[StructType]).avro(path)
However, when I'm selecting the filed that I expect to be decimal by
df.select("MyDecimalField")
I'm getting the following exception:
scala.MatchError: [B#3e6e0d8f (of class [B)
This is where I stuck at this time and would appreciate if anyone can suggest what to do next or any other work around.