There is some problem trying to deserialize data from .avro file. My process consists of these steps:
reading from Kafka
df = (
spark.read.format("kafka")
.option("kafka.security.protocol", "PLAINTEXT")
.option("kafka.sasl.mechanism", "GSSAPI")
.option("kafka.bootstrap.servers", KAFKA_BROKERS)
.option("subscribe", KAKFA_TOPIC)
.option("group.id", "noid")
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.option("maxOffsetsPerTrigger", 2)
.load()
.selectExpr(
"CAST(value AS binary)"
)
)
Writing to hdfs
df = df.select("value")
df.write.format("avro").save("path/to/avro_files")
Loading .avro file
schema = """{
"type":"record",
"name":"GPRSIOT",
"fields":[
{
"name":"field_1",
"type":"string",
"default":"null"
}
...
]
}
dff= spark.read.format("avro") \
.option("avroSchema", schema) \
.load("path/to/avro_files/part-00000-ff1186a6-dce4-492e-87b8-8bb4332f4bc1-c000.avro")
The problem is that I get nulls as values.
+-------+-------+-------+
|field_1|field_2|field_3|
+-------+-------+-------+
|null |0 |null |
+-------+-------+-------+
A small part of my .avro file:
Ɖ�v|y
f�30387f��������������������������������������Z�19807F� ؖ��1.2�A0Z��"�b
0FDDZv�
169r&Z��~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~Z~32210F�+��ߘv~y�1.1Z}*�L2FF3E�N�
�~841576��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+05321F�
Any ideas on how to deserialize an avro file with real values instead of nulls?
Related
My Dataframe df looks like
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
I creating a Kafka Sink for streaming queries, but I received nothing from kafka. why?
ds = df \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("topic", "topic1") \
.start()
You will not receive anything from Kafka because based on your code you are trying to select the columns key and value from a Dataframe which has only columns age and name. You need to select those as shown below.
Also, you do not need writeStream if your Dataframe is static. In that case you need to apply write and save.
import org.apache.spark.sql.functions.{col, struct, to_json}
import org.apache.spark.sql.SparkSession
object Main extends App {
val spark = SparkSession.builder()
.appName("myAppName")
.master("local[*]")
.getOrCreate()
// create DataFrame
import spark.implicits._
val df = Seq((3, "Alice"), (5, "Bob")).toDF("age", "name")
df.show(false)
// +---+-----+
// |age|name |
// +---+-----+
// |3 |Alice|
// |5 |Bob |
// +---+-----+
// write to Kafka as is with "age" as key and "name" as value
df.selectExpr("CAST(age AS STRING) as key", "CAST(name AS STRING) as value")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test-topic")
.save()
If you want to store your data into a json string you can apply the following"
// convert columns into json string
val df2 = df.select(col("name"),to_json(struct($"*"))).toDF("key", "value")
df2.show(false)
// +-----+------------------------+
// |key |value |
// +-----+------------------------+
// |Alice|{"age":3,"name":"Alice"}|
// |Bob |{"age":5,"name":"Bob"} |
// +-----+------------------------+
My end goal is to write out and read the aggregated data to the new Kafka topic in the batches it gets processed. I followed the official documentation and a couple of other posts but no luck. I would first read the topic, perform aggregation, save the results in another Kafka topic, and again read the topic and print it in the console. Below is my code:
package com.sparkKafka
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming._
import scala.concurrent.duration._
object SparkKafkaTopic3 {
def main(ar: Array[String]) {
val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemo5")
.option("startingOffsets", "earliest")
.load()
import spark.implicits._
df.printSchema()
val newDf = df.select($"value".cast("string"), $"timestamp").select(split(col("value"), ",")(0).as("userName"), split(col("value"), ",")(1).as("songName"), col("timestamp"))
val windowedCount = newDf
.withWatermark("timestamp", "40000 milliseconds")
.groupBy(
window(col("timestamp"), "20 seconds"), col("songName"))
.agg(count(col("songName")).alias("numberOfTimes"))
val outputTopic = windowedCount
.select(struct("*").cast("string").as("value")) // Added this line.
.writeStream
.format("kafka")
.option("topic", "songDemo6")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/spark_ss/")
.start()
val finalOutput = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemo6").option("startingOffsets", "earliest")
.load()
.writeStream.format("console")
.outputMode("append").start()
spark.streams.awaitAnyTermination()
}
}
When I run this, in the console initially there is a below exception
java.lang.IllegalStateException: Cannot find earliest offsets of Set(songDemo4-0). Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
Also, if I try to run this code without writing to the topic part and reading it again everything works fine.
I tried to read the topic from the shell using consumer command but no records are displayed. Is there anything that I am missing over here?
Below is my dataset:
>sid,Believer
>sid,Thunder
>sid,Stairway to heaven
>sid,Heaven
>sid,Heaven
>sid,thunder
>sid,Believer
When I ran #Srinivas's code and after reading the new topic I am getting data as below:
[[2020-06-07 18:18:40, 2020-06-07 18:19:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Believer, 1]
[[2020-06-07 18:18:40, 2020-06-07 18:19:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Believer, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Stairway to heaven, 1]
[[2020-06-07 18:40:40, 2020-06-07 18:41:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Thunder, 1]
Here you can see for Believer the window frame is the same but still, the entries are separate. Why is it so? It should be single entry with count 2 since the window frame is the same
Check below code.
Added this windowedCount.select(struct("*").cast("string").as("value")) before you write anything to kafka you have to convert all columns of type string alias of that column is value
val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemo")
.option("startingOffsets", "earliest")
.load()
import spark.implicits._
df.printSchema()
val newDf = df.select($"value".cast("string"),$"timestamp").select(split(col("value"), ",")(0).as("userName"), split(col("value"), ",")(1).as("songName"), col("timestamp"))
val windowedCount = newDf
.withWatermark("timestamp", "40000 milliseconds")
.groupBy(
window(col("timestamp"), "20 seconds"), col("songName"))
.agg(count(col("songName")).alias("numberOfTimes"))
val outputTopic = windowedCount
.select(struct("*").cast("string").as("value")) // Added this line.
.writeStream
.format("kafka")
.option("topic", "songDemoA")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/spark_ss/")
.start()
val finalOutput = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "songDemoA").option("startingOffsets", "earliest")
.load()
.writeStream.format("console")
.outputMode("append").start()
spark.streams.awaitAnyTermination()
Updated - Ordering Output
val windowedCount = newDf
.withWatermark("timestamp", "40000 milliseconds")
.groupBy(
window(col("timestamp"), "20 seconds"), col("songName"))
.agg(count(col("songName")).alias("numberOfTimes"))
.orderBy($"window.start".asc) // Add this line if you want order.
Ordering or sorting result works only if you use output mode is complete for any other values it will throw an error.
For example check below code.
val outputTopic = windowedCount
.writeStream
.format("console")
.option("truncate","false")
.outputMode("complete")
.start()
I am reading data from Kafka in spark(structured streaming) But Data getting in spark from kafka in spark is not in string format.
Spark: 2.3.4
Kafka Data format:
{"Patient_ID":316,"Name":"Richa","MobileNo":{"long":7049123177},"BDate":{"int":740},"Gender":"female"}
Here is the code for kafka to spark structured streaming:
# spark-submit --jars kafka-clients-0.10.0.1.jar --packages org.apache.spark:spark-avro_2.11:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0,org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.3.4,org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 /home/kinjalpatel/kafka_sppark.py
import pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
import json
from pyspark.sql.functions import from_json, col, struct
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from pyspark.sql.column import Column, _to_java_column
sc = SparkContext()
sc.setLogLevel("ERROR")
spark = SparkSession(sc)
schema_registry_client = CachedSchemaRegistryClient(
url='http://localhost:8081')
serializer = MessageSerializer(schema_registry_client)
df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "mysql-01-Patient") \
.option("partition.assignment.strategy", "range") \
.option("valueConverter", "org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter") \
.load()
df.printSchema()
mta_stream=df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(topic AS STRING)", "CAST(partition AS STRING)", "CAST(offset AS STRING)", "CAST(timestamp AS STRING)", "CAST(timestampType AS STRING)")
mta_stream.printSchema()
qry = mta_stream.writeStream.outputMode("append").format("console").start()
qry.awaitTermination()
This is the output I get:
+----+--------------------+----------------+---------+------+--------------------+-------------+
| key| value| topic|partition|offset| timestamp|timestampType|
+----+--------------------+----------------+---------+------+--------------------+-------------+
|null|�
Richa���...|mysql-01-Patient| 0| 160|2019-12-27 11:56:...| 0|
+----+--------------------+----------------+---------+------+--------------------+-------------+
How to get value column in string format?
from Spark documentation
import org.apache.spark.sql.avro._
// `from_avro` requires Avro schema in JSON string format.
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc" )))
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
val output = df
.select(from_avro('value, jsonFormatSchema) as 'user)
.where("user.favorite_color == \"red\"")
.select(to_avro($"user.name") as 'value)
val query = output
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic2")
.start()
from databricks documentation
import org.apache.spark.sql.avro._
import org.apache.avro.SchemaBuilder
// When reading the key and value of a Kafka topic, decode the
// binary (Avro) data into structured data.
// The schema of the resulting DataFrame is: <key: string, value: int>
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t")
.load()
.select(
from_avro($"key", SchemaBuilder.builder().stringType()).as("key"),
from_avro($"value", SchemaBuilder.builder().intType()).as("value"))
For Reading Avro message from Kafka topic and parsing in pyspark structured streaming, don't have direct libraries for the same . But we can read/parsing Avro message by writing small wrapper and call that function as UDF in your pyspark streaming code.
Please refer:
Reading avro messages from Kafka in spark streaming/structured streaming
I am reading data from Kafka with Spark Structured Streaming and want to include the Kafka timestamp in the message:
sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker:10000")
.option("subscribe", "topicname")
.option("includeTimestamp", true)
.load()
.selectExpr("CAST(topic AS STRING)", "CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
.as[(String, String, String, Long)]
When I then look at the timestamp it is truncated from milliseconds to seconds. Is there any way I can get the millisecond precision back after reading?
The truncation happens when the timestamps is read as a Long value. This happens in the last line of:
sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker:10000")
.option("subscribe", "topicname")
.option("includeTimestamp", true)
.load()
.selectExpr("CAST(topic AS STRING)", "CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
.as[(String, String, String, Long)]
It does not truncate when you change the last line to:
.as[(String, String, String, Timestamp)]
I just tried this quickly in IntelliJ with my local Kafka set-up.
If you are referring to the three dots in the end of the timestamp field as truncation (as in the output below):
Batch: 1
-------------------------------------------
+-----+----+--------+--------------------+
|topic| key| value| timestamp|
+-----+----+--------+--------------------+
| test|null|test-123|2018-10-07 03:10:...|
| test|null|test-234|2018-10-07 03:10:...|
+-----+----+--------+--------------------+
Then you just need to add the following line:
.option("truncate", false)
in your writeStream() portion like:
Dataset<Row> df = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("includeTimestamp", "true")
.load()
.selectExpr("CAST(topic AS STRING)", "CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(timestamp as STRING)");
try {
df.writeStream()
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
.awaitTermination();
} catch (StreamingQueryException e) {
e.printStackTrace();
}
This change gave me the full timestamp in the output:
Batch: 1
-------------------------------------------
+-----+----+--------+-----------------------+
|topic|key |value |timestamp |
+-----+----+--------+-----------------------+
|test |null|test-123|2018-10-07 03:19:50.677|
|test |null|test-234|2018-10-07 03:19:52.673|
+-----+----+--------+-----------------------+
I hope this helps.
Is it possible to have two separate ReadStreams in one app? I'm trying to listen to two separate Kafka topics and do calculations based on both DataFrames.
You could simply subscribe to multiple topics:
// Subscribe to multiple topics
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Or, if you specifically want to use two separated readStream definitions within one app:
// read stream A
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
// read stream B
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic2")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
You should be able to achieve this by using join() in Spark 2.3.0:
val stream1 = spark.readStream. ...
val stream2 = spark.readStream. ...
val joinedDf = stream1.join(stream2, "join_column_id")