Mapping Kafka to Spark dataFrame with the Schema - apache-spark

I have application which runs query on Kafka topics with the schema specified,
Below is my code :
SparkSession spark = SparkSession.builder()
.appName("Spark-Kafka-Integration")
.config("spark.master", "local")
.getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "abc:9092,bcs:9092")
.option("subscribe","topic")
.option("auto.offset.reset", "latest")
.option("checkpointLocation", "/tmp")
.load();
// Mapping it to the schema
Dataset<Row> ds2 = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp"));
ds2.createOrReplaceTempView("ds2");
// Making a Row having timestamp and the values
Dataset<Row> ds3 = spark.sql("select rows.* , timestamp from ds2 ");
ds3.createOrReplaceTempView("table");
Dataset<Row> result2 = spark.sql(query.getQuery());
This runs fine, now I have view table which will have all columns and timestamp. Then I can run SQL like Select column1 , column2 from table group by window(timestamp,'1 minutes'),column1 , column2
My Question :
Is this is an efficient way to do it ? Because if I have multiple topics i.e .option("subscribe","topic1,topics2,...") then I have to create multiple data frame in order to run Join Query on them and how I can handle timestamp column ?
In case of multiple topics I will have the following code :
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "abc:9092,bcs:9092")
.option("subscribe","topic1, topic2,....topicn")
.option("auto.offset.reset", "latest")
.option("checkpointLocation", "/tmp")
.load();
Dataset<Row> ds = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp")).where("topic=topic1");
Dataset<Row> ds1 = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp")).where("topic=topic2");
.... so on and have to same for other data frame

Related

How to guarantee sequence of execution of multiple sinks in spark structured streaming

In my scenario, I have a structured streaming application which reads from kafka and writes to hdfs and kafka using 3 different sinks. Primary sink is the hdfs one and others are secondary. I want the primary sink to run first and then secondary sinks. All have a triggertime of 60seconds. Is there a way to achieve that in spark structured streaming. Adding the code snippet:
val spark = SparkSession
.builder
.master(StreamerConfig.sparkMaster)
.appName(StreamerConfig.sparkAppName)
.getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.streaming.stopGracefullyOnShutdown","true")
spark.conf.set("spark.sql.files.ignoreCorruptFiles","true")
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("spark.shuffle.service.enabled","true")
val readData = spark
.readStream
.format("kafka") .option("kafka.bootstrap.servers",StreamerConfig.kafkaBootstrapServer)
.option("subscribe",StreamerConfig.topicName)
.option("failOnDataLoss", false)
.option("startingOffsets",StreamerConfig.kafkaStartingOffset) .option("maxOffsetsPerTrigger",StreamerConfig.maxOffsetsPerTrigger)
.load()
val deserializedRecords = StreamerUtils.deserializeAndMapData(readData,spark)
val streamingQuery = deserializedRecords.writeStream
.queryName(s"Persist data to hive table for ${StreamerConfig.topicName}")
.outputMode("append")
.format("orc")
.option("path",StreamerConfig.hdfsLandingPath)
.option("checkpointLocation",StreamerConfig.checkpointLocation)
.partitionBy("date","hour")
.option("truncate","false")
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase1Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase2Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
PS: I am using spark 2.3.2

How to write data to Apache Iceberg tables using Spark SQL?

I am trying to familiarize myself with Apache Iceberg and I'm having some trouble understanding how to write some external data to a table using Spark SQL.
I have a file, one.csv, sitting in a directory, /data
my Iceberg catalog is configured to point to this directory, /warehouse
I want to write this one.csv to an Apache Iceberg table (preferably using Spark SQL)
Is it even possible to read external data using Spark SQL? And then write it to the iceberg tables? Do I have to use scala or python to do this? I've been through the Iceberg and Spark 3.0.1 documentation a bunch but maybe I'm missing something.
Code Update
Here is some code that I hope will help
spark.conf.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
spark.conf.set("spark.sql.catalog.spark_catalog.type", "hive")
spark.conf.set("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.local.type", "hadoop")
spark.conf.set("spark.sql.catalog.local.warehouse", "data/warehouse")
I have the data I need to use sitting in a directory /one/one.csv
How do I get it into an Iceberg table using Spark? Can all of this be done purely using SparkSQL?
spark.sql(
"""
CREATE or REPLACE TABLE local.db.one
USING iceberg
AS SELECT * FROM `/one/one.csv`
"""
)
Then the goal is I can work with this iceberg table directly for example:
select * from local.db.one
and this would give me all the content from the /one/one.csv file.
To use the SparkSQL, read the file into a dataframe, then register it as a temp view. This temp view can now be referred in the SQL as:
var df = spark.read.format("csv").load("/data/one.csv")
df.createOrReplaceTempView("tempview");
spark.sql("CREATE or REPLACE TABLE local.db.one USING iceberg AS SELECT * FROM tempview");
To answer your other question, Scala or Python is not required; the above example is in Java.
val sparkConf = new SparkConf()
sparkConf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
sparkConf.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
sparkConf.set("spark.sql.catalog.spark_catalog.type", "hive")
sparkConf.set("spark.sql.catalog.hive_catalog", "org.apache.iceberg.spark.SparkCatalog")
sparkConf.set("spark.sql.catalog.hive_catalog.type", "hadoop")
sparkConf.set("spark.sql.catalog.hive_catalog.warehouse", "hdfs://host:port/user/hive/warehouse")
sparkConf.set("hive.metastore.uris", "thrift://host:19083")
sparkConf.set("spark.sql.catalog.hive_prod", " org.apache.iceberg.spark.SparkCatalog")
sparkConf.set("spark.sql.catalog.hive_prod.type", "hive")
sparkConf.set("spark.sql.catalog.hive_prod.uri", "thrift://host:19083")
sparkConf.set("hive.metastore.warehouse.dir", "hdfs://host:port/user/hive/warehouse")
val spark: SparkSession = SparkSession.builder()
.enableHiveSupport()
.config(sparkConf)
.master("yarn")
.appName("kafkaTableTest")
.getOrCreate()
spark.sql(
"""
|
|create table if not exists hive_catalog.icebergdb.kafkatest1(
| company_id int,
| event string,
| event_time timestamp,
| position_id int,
| user_id int
|)using iceberg
|PARTITIONED BY (days(event_time))
|""".stripMargin)
import spark.implicits._
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka_server")
.option("subscribe", "topic")
.option("startingOffsets", "latest")
.load()
//.selectExpr("cast (value as string)")
val value: DataFrame = df.selectExpr("CAST(value AS STRING)")
.as[String]
.map(data => {
val json_str: JSONObject = JSON.parseObject(data)
val company_id: Integer = json_str.getInteger("company_id")
val event: String = json_str.getString("event")
val event_time: String = json_str.getString("event_time")
val position_id: Integer = json_str.getInteger("position_id")
val user_id: Integer = json_str.getInteger("user_id")
(company_id, event, event_time, position_id, user_id)
})
.toDF("company_id", "event", "event_time", "position_id", "user_id")
value.createOrReplaceTempView("table")
spark.sql(
"""
|select
| company_id,
| event,
| to_timestamp(event_time,'yyyy-MM-dd HH:mm:ss') as event_time,
| position_id,
| user_id
|from table
|""".stripMargin)
.writeStream
.format("iceberg")
.outputMode("append")
.trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
.option("path","hive_catalog.icebergdb.kafkatest1") // tablePath: catalog.db.tableName
.option("checkpointLocation","hdfspath")
.start()
.awaitTermination()
This example is reading data from Kafka and writing data to Iceberg table

Performance of Spark structured streaming Stream-Stream join

I am trying out the stream-stream join feature of spark Structured streaming using Spark 2.4.0.
I am just joining two simple set of data just to observe the performance of stream-stream join. I am currently running this in my local machine with just a few input records. I observe that it takes more than a couple of minutes to join data from two streams and write output to Kafka.
Here is what I have been trying :
val in1Df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("subscribe", config.getString("SparkStrucStreamingPoc.inTopic1"))
.load()
.select($"timestamp" as "timestamp1",$"value" cast "string" as "value1")
val in2Df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("subscribe", config.getString("SparkStrucStreamingPoc.inTopic2"))
.load()
.select($"timestamp" as "timestamp2", $"value" cast "string" as "value2")
val in1DfWithWatermark = in1Df
.select($"timestamp1",$"value1")
.withWatermark("timestamp1", "10 seconds")
val in2DfWithWatermark = in2Df
.select($"timestamp2",$"value2")
.withWatermark("timestamp2", "20 seconds")
val joinedDf = in1DfWithWatermark.join(in2DfWithWatermark,
expr(("""value1 = value2 AND
timestamp2 >= timestamp1 AND
timestamp2 <= timestamp1 + interval 1 minutes""")))
joinedDf.select(($"value1").alias("value"))
.writeStream
.format("kafka")
.option("topic", config.getString("SparkStrucStreamingPoc.outTopic"))
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("checkpointLocation", config.getString("SparkStrucStreamingPoc.checkpoint"))
.start()
.awaitTermination()
Has anyone else observed this kind of a behavior ? Does it usually take this long to join two streams ?

read a data from kafka topic and aggregate using spark tempview?

i want a read a data from kafka topic, and create spark tempview to group by some columns?
+----+--------------------+
| key| value|
+----+--------------------+
|null|{"e":"trade","E":...|
|null|{"e":"trade","E":...|
|null|{"e":"trade","E":...|
but i can't able to aggregate data from tempview?? value column data stored as a String???
Dataset<Row> data = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092,localhost:9093")
.option("subscribe", "data2-topic")
.option("startingOffsets", "latest")
.option ("group.id", "test")
.option("enable.auto.commit", "true")
.option("auto.commit.interval.ms", "1000")
.load();
data.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
data.createOrReplaceTempView("Tempdata");
data.show();
Dataset<Row> df2=spark.sql("SELECT e FROM Tempdata group by e");
df2.show();
value column data stored as a String???
Yes.. Because you CAST(value as STRING)
You'll want to use a from_json function that'll load the row into a proper dataframe that you can search within.
See Databrick's blog on Structured Streaming on Kafka for some examples
If the primary goal is just grouping of some fields, then KSQL might be an alternative.

Spark Streaming: Text data source supports only a single column

I am consuming Kafka data and then stream the data to HDFS.
The data stored in Kafka topic trial is like:
hadoop
hive
hive
kafka
hive
However, when I submit my codes, it returns:
Exception in thread "main"
org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 7 columns.;
=== Streaming Query ===
Identifier: [id = 2f3c7433-f511-49e6-bdcf-4275b1f1229a, runId = 9c0f7a35-118a-469c-990f-af00f55d95fb]
Current Committed Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":13}}}
Current Available Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":14}}}
My question is: as shown above, the data stored in Kafka comprises only ONE column, why the program says there are 7 columns ?
Any help is appreciated.
My spark-streaming codes:
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder.master("local[4]")
.appName("SpeedTester")
.config("spark.driver.memory", "3g")
.getOrCreate()
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.95.20:9092")
.option("subscribe", "trial")
.option("startingOffsets" , "earliest")
.load()
.writeStream
.format("text")
.option("path", "hdfs://192.168.95.21:8022/tmp/streaming/fixed")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
.awaitTermination()
}
That is explained in the Structured Streaming + Kafka Integration Guide:
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int
Which gives exactly seven columns. If you want to write only payload (value) select it and cast to string:
spark.readStream
...
.load()
.selectExpr("CAST(value as string)")
.writeStream
...
.awaitTermination()

Resources