Can't Tranform Kafka Json Data in Spark Structured Streaming - apache-spark

I am trying to get Kafka messages and processing it with Spark in standalone. Kafka stores data as json format. I can get Kafka messages but can not parse json data with defining schema.
When I run the bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my_kafka_topic --from-beginning command to see the kafka messages in kafka topic, it outputs as follows:
"{\"timestamp\":1553792312117,\"values\":[{\"id\":\"Simulation.Simulator.Temperature\",\"v\":21,\"q\":true,\"t\":1553792311686}]}"
"{\"timestamp\":1553792317117,\"values\":[{\"id\":\"Simulation.Simulator.Temperature\",\"v\":22,\"q\":true,\"t\":1553792316688}]}"
And I can get this data succesfully with this code block in Spark:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "my_kafka_topic") \
.load() \
.select(col("value").cast("string"))
The schema is like this:
df.printSchema()
root
|-- value: string (nullable = true)
And then writing this dataframe to console and it prints the kafka messages:
Batch: 9
-------------------------------------------
+--------------------+
| value|
+--------------------+
|"{\"timestamp\":1...|
+--------------------+
But I want to parse json data to define schema and the code block that I've tried to do it:
schema = StructType([ StructField("timestamp", LongType(), False), StructField("values", ArrayType( StructType([ StructField("id", StringType(), True), StructField("v", IntegerType(), False), StructField("q", BooleanType(), False), StructField("t", LongType(), False) ]), True ), True) ])
parsed = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "my_kafka_topic") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("opc"))
And the schema of parsed dataframe:
parsed.printSchema()
root
|-- opc: struct (nullable = true)
| |-- timestamp: string (nullable = true)
| |-- values: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- v: integer (nullable = true)
| | |-- q: boolean (nullable = true)
| | |-- t: string (nullable = true)
These code blocks run without error. But when I want to write parsed dataframe to the console:
query = parsed\
.writeStream\
.format("console")\
.start()
query.awaitTermination()
it is writing null like this in console:
+----+
| opc|
+----+
|null|
+----+
So, it seems there is problem with parsing json data but can't figure out it.
Can you tell me what is wrong?

It seems that the schema was not correct for your case please try to apply the next one:
schema = StructType([
StructField("timestamp", LongType(), False),
StructField("values", ArrayType(
StructType([StructField("id", StringType(), True),
StructField("v", IntegerType(), False),
StructField("q", BooleanType(), False),
StructField("t", LongType(), False)]), True), True)])
Also remember that the inferSchema option works pretty well so you could let Spark discover the schema and save it.
Another issue is that your json data has leading and trailing double quotes " also it contains \ those make an invalid JSON which was preventing Spark from parsing the message.
In order to remove the invalid characters your code should modified as next:
parsed = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "my_kafka_topic") \
.load() \
.withColumn("value", regexp_replace(col("value").cast("string"), "\\\\", "")) \
.withColumn("value", regexp_replace(col("value"), "^\"|\"$", "")) \
.select(from_json(col("value"), schema).alias("opc"))
Now your output should be:
+------------------------------------------------------------------------------------------------------------------+
|value |
+------------------------------------------------------------------------------------------------------------------+
|{"timestamp":1553588718638,"values":[{"id":"Simulation.Simulator.Temperature","v":26,"q":true,"t":1553588717036}]}|
+------------------------------------------------------------------------------------------------------------------+
Good luck!

Related

Null in output when reading/writing structured streams from multiple topics in Pyspark

I am consuming data as structured stream from multiple topics with Pyspark. I have a value column and in that column I have json like string: {"message":"abc", "metrics":{"metric1":"abc", "metric2":123, "metric3":"01/01/2022 00:00:00"}} . I am getting some specific keys (columns) from metrics column this way:
value_schema = StructType([StructField("metrics", StringType(), True)])
topic1_schema = StructType([StructField("metrics1", StringType(), True),
...
StructField("metricsN", StringType(), True)])
topic1_raw = spark \
.readStream \
...
.load() \
.selectExpr("CAST(value AS STRING)") \
.withColumn('value', from_json(col('value'), value_schema)) \
.withColumn('metrics', from_json(col('value.metrics'), topic1_schema)) \
.select(col('metrics.*'))
topic1_batch = topic1_raw \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
topic2_schema = StructType([StructField("metricsA", StringType(), True),
...
StructField("metricsZ", StringType(), True)])
topic2_raw = spark \
.readStream \
...
.load() \
.selectExpr("CAST(value AS STRING)") \
.withColumn('value', from_json(col('value'), value_schema)) \
.withColumn('metrics', from_json(col('value.metrics'), topic2_schema)) \
.select(col('metrics.*'))
topic2_batch = topic2_raw \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
topic1_batch.awaitTermination()
topic2_batch.awaitTermination()
When I run above code separately for each topic I can get my expecting result something like that:
+--------------------+-------+--------+---------+
| metric1|metric3| metric7| metricN|
+--------------------+-------+--------+---------+
|01/01/2022 00:00:...| abc...|12345678| 0|
+--------------------+-------+--------+---------+
+-------+--------+---------+
|metricA| metricB| metricC|
+-------+--------+---------+
| abc...|01/01/22| 123...|
+-------+--------+---------+
But when I run the whole code together, meaning streaming/writing data from all topics I have some values missed:
+--------------------+-------+--------+---------+
| metric1|metric3| metric7| metricN|
+--------------------+-------+--------+---------+
|01/01/2022 00:00:...| abc...|12345678| ***|
+--------------------+-------+--------+---------+
+-------+--------+---------+
|metricA| metricB| metricC|
+-------+--------+---------+
| abc...| null | 123...|
+-------+--------+---------+
What might be the possible reason for that and how to fix it?

Read data from Kafka and print to console with Spark Structured Sreaming in Python

I have kafka_2.13-2.7.0 in Ubuntu 20.04. I run kafka server and zookeeper then create a topic and send a text file in it via nc -lk 9999. The topic is full of data. Also, I have spark-3.0.1-bin-hadoop2.7 on my system. In fact, I want to use the kafka topic as a source for Spark Structured Streaming with python. My code is like this:
spark = SparkSession \
.builder \
.appName("APP") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sparktest") \
.option("startingOffsets", "earliest") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
df.printSchema()
I run the above code via spark-submit with this command:
./spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 /home/spark/PycharmProjects/testSparkStream/KafkaToSpark.py
The code run without any exception and I receive this output as it is in Spark site:
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
My question is that the kafka topic is full of data;but there is no any data as a result of running the code in output. Would you please guide me what is wrong here?
The code as is will not print out any data but only provide you the schema once.
You can follow the instructions given in the general Structured Streaming Guide and the Structured Streaming + Kafka integration Guide to see how to print out data to the console. Remember that reading data in Spark is a lazy operation and nothing is done without an action (typically a writeStream operation).
If you complement the code as below you should see the selected data (key and value) printed out to the console:
spark = SparkSession \
.builder \
.appName("APP") \
.getOrCreate()
df = spark\
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sparktest") \
.option("startingOffsets", "earliest") \
.load()
query = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("console") \
.option("checkpointLocation", "path/to/HDFS/dir") \
.start()
query.awaitTermination()

Issue in writing records in into MYSQL from Spark Structured Streaming Dataframe

I am using below code to write spark Streaming dataframe into MQSQL DB .Below is the kafka topic JSON data format and MYSQL table schema.Column name and types are same to same.
But I am unable to see records written in MYSQL table. Table is empty with zero records.Please suggest.
Kafka Topic Data Fomat
ssingh#RENLTP2N073:/mnt/d/confluent-6.0.0/bin$ ./kafka-console-consumer --topic sarvtopic --from-beginning --bootstrap-server localhost:9092
{"id":1,"firstname":"James ","middlename":"","lastname":"Smith","dob_year":2018,"dob_month":1,"gender":"M","salary":3000}
{"id":2,"firstname":"Michael ","middlename":"Rose","lastname":"","dob_year":2010,"dob_month":3,"gender":"M","salary":4000}
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("SSKafka") \
.getOrCreate()
dsraw = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sarvtopic") \
.option("startingOffsets", "earliest") \
.load()
ds = dsraw.selectExpr("CAST(value AS STRING)")
dsraw.printSchema()
from pyspark.sql.types import StructField, StructType, StringType,LongType
from pyspark.sql.functions import *
custom_schema = StructType([
StructField("id", LongType(), True),
StructField("firstname", StringType(), True),
StructField("middlename", StringType(), True),
StructField("lastname", StringType(), True),
StructField("dob_year", StringType(), True),
StructField("dob_month", LongType(), True),
StructField("gender", StringType(), True),
StructField("salary", LongType(), True),
])
Person_details_df2 = ds\
.select(from_json(col("value"), custom_schema).alias("Person_details"))
Person_details_df3 = Person_details_df2.select("Person_details.*")
from pyspark.sql import DataFrameWriter
def foreach_batch_function(df, epoch_id):
Person_details_df3.write.jdbc(url='jdbc:mysql://172.16.23.27:30038/securedb', driver='com.mysql.jdbc.Driver', dbtable="sparkkafka", user='root',password='root$1234')
pass
query = Person_details_df3.writeStream.trigger(processingTime='20 seconds').outputMode("append").foreachBatch(foreach_batch_function).start()
query
Out[14]: <pyspark.sql.streaming.StreamingQuery at 0x1fb25503b08>
MYSQL table Schema:
create table sparkkafka(
id int,
firstname VARCHAR(40) NOT NULL,
middlename VARCHAR(40) NOT NULL,
lastname VARCHAR(40) NOT NULL,
dob_year int(40) NOT NULL,
dob_month int(40) NOT NULL,
gender VARCHAR(40) NOT NULL,
salary int(40) NOT NULL,
PRIMARY KEY (id)
);
I presume Person_details_df3 is your streaming dataframe and your spark version is above 2.4.0 version.
To use foreachBatch API write as below:
db_target_properties = {"user":"xxxx", "password":"yyyyy"}
def foreach_batch_function(df, epoch_id):
df.write.jdbc(url='jdbc:mysql://172.16.23.27:30038/securedb', table="sparkkafka", properties=db_target_properties)
pass
query = Person_details_df3.writeStream.outputMode("append").foreachBatch(foreach_batch_function).start()
query.awaitTermination()

Pyspark : Problem with FloatType while writing data to parquet file

I am having following schema,
root
|-- A: string (nullable = true)
|-- B: float (nullable = true)
And when I apply schema on data, dataframe values for float column is populating as wrong.
Original Data :-
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)
Please help me to understand what exactly spark is doing here and generating below output.
After Applying Schema Dataframe:-
+---------+----------+
| A| B|
+---------+----------+
|floadVal1|0.40441337|
|floadVal2| 0.28563|
|floadVal3| 0.5912903|
|floadVal4|0.40441337|
|floadVal5| 15.376102|
|floadVal6| 15.261798|
|floadVal7| 19.887815|
|floadVal8| 0.0|
+---------+----------+
After writing to parquet :-
A B
0 floadVal1 0.404413
1 floadVal2 0.285630
2 floadVal3 0.591290
3 floadVal4 0.404413
4 floadVal5 15.376102
5 floadVal6 15.261798
6 floadVal7 19.887815
7 floadVal8 0.000000
And
AS per the spark doc 2.4.5
FloatType: Represents 4-byte single-precision floating point numbers.
Sample Code
spark = SparkSession.builder.master('local').config(
"spark.sql.parquet.writeLegacyFormat", 'true').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
schema = StructType([
StructField("A", StringType(), True),
StructField("B", FloatType(), True)])
df = spark.createDataFrame([
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)
], schema)
df.printSchema()
df.show()
df.write.format("parquet").save('floatTestParFile')

PySpark Structured Streaming data writing into Cassandra not populating data

I want to write spark structured streaming data into cassandra. My spark version is 2.4.0.
My input source from Kafka is with JSON, so when writing to the console, it is OK, but when I query in the cqlsh Cassandra there is no record appended to the table. Can you tell me what is wrong?
schema = StructType() \
.add("humidity", IntegerType(), True) \
.add("time", TimestampType(), True) \
.add("temperature", IntegerType(), True) \
.add("ph", IntegerType(), True) \
.add("sensor", StringType(), True) \
.add("id", StringType(), True)
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra") \
.mode('append') \
.options("spark.cassandra.connection.host", "cassnode1, cassnode2") \
.options(table="sensor", keyspace="sensordb") \
.save()
# Load json format to dataframe
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafkanode") \
.option("subscribe", "iot-data-sensor") \
.load() \
.select([
get_json_object(col("value").cast("string"), "$.{}".format(c)).alias(c)
for c in ["humidity", "time", "temperature", "ph", "sensor", "id"]])
df.writeStream \
.foreachBatch(writeToCassandra) \
.outputMode("update") \
.start()
I had the same issue in pyspark. try below steps
First, validate if it is connecting to cassandra. You can either point to a table which is not available and see if it is failing because of "table not found"
Try writeStream as below (include trigger and output mode before calling the cassandra update)
df.writeStream \
.trigger(processingTime="10 seconds") \
.outputMode("update") \
.foreachBatch(writeToCassandra) \

Resources