How does Structured Streaming execute separate streaming queries (in parallel or sequentially)? - apache-spark

I'm writing a test application that consumes messages from Kafka's topcis and then push data into S3 and into RDBMS tables (flow is similar to presented here: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html). So I read data from Kafka and then:
each message want to save into S3
some messages save to table A in an external database (based on filter condition)
some other messages save to table B in an external database (other filter condition)
So I have sth like:
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2,topic3")
.option("startingOffsets", "earliest")
.load()
.select(from_json(col("value").cast("string"), schema, jsonOptions).alias("parsed_value"))
(please notice that I'm reading from more than one Kafka topic).
Next I define required datasets:
Dataset<Row> allMessages = df.select(.....)
Dataset<Row> messagesOfType1 = df.select() //some unique conditions applied on JSON elements
Dataset<Row> messagesOfType2 = df.select() //some other unique conditions
and now for each Dataset I create query to start processing:
StreamingQuery s3Query = allMessages
.writeStream()
.format("parquet")
.option("startingOffsets", "latest")
.option("path", "s3_location")
.start()
StreamingQuery firstQuery = messagesOfType1
.writeStream()
.foreach(new CustomForEachWiriterType1()) // class that extends ForeachWriter[T] and save data into external RDBMS table
.start();
StreamingQuery secondQuery = messagesOfType2
.writeStream()
.foreach(new CustomForEachWiriterType2()) // class that extends ForeachWriter[T] and save data into external RDBMS table (may be even another database than before)
.start();
Now I'm wondering:
Will be those queries executed in parallel (or one after another in FIFO order and I should assign those queries to separate scheduler pools)?

Will be those queries executed in parallel
Yes. These queries are going to be executed in parallel (every trigger which you did not specify and hence is to run them as fast as possible).
Internally, when you execute start on a DataStreamWriter, you create a StreamExecution that in turn creates immediately so-called daemon microBatchThread (quoted from the Spark source code below):
val microBatchThread =
new StreamExecutionThread(s"stream execution thread for $prettyIdString") {
override def run(): Unit = {
// To fix call site like "run at <unknown>:0", we bridge the call site from the caller
// thread to this micro batch thread
sparkSession.sparkContext.setCallSite(callSite)
runBatches()
}
}
You can see every query in its own thread with name:
stream execution thread for [prettyIdString]
You can check the separate threads using jstack or jconsole.

Related

Sink from Delta Live Table to Kafka, initial sink works, but any subsequent updates fail

I have a DLT pipeline that ingests a topic from my kafka stream, transforms it into a DLT, then I wish to write that table back into Kafka under a new topic.
So far, I have this working, however it only works on first load of the table, then after any subsequent updates will crash my read and write stream.
My DLT tables updates correctly, so I see updates from my pipeline into the Gold table,
CREATE OR REFRESH LIVE TABLE deal_gold1
TBLPROPERTIES ("quality" = "gold")
COMMENT "Gold Deals"
AS SELECT
documentId,
eventTimestamp,
substring(fullDocument.owner_id, 11, 24) as owner_id,
fullDocument.owner_type as owner_type,
substring(fullDocument.account_id, 11, 24) as account_id,
substring(fullDocument.manager_account_id, 11, 24) as manager_account_id,
fullDocument.hubspot_deal_id as hubspot_deal_id,
fullDocument.stage as stage,
fullDocument.status as status,
fullDocument.title as title
FROM LIVE.deal_bronze_cleansed
but then when I try to read from it via a separate notebook, these updates cause it to crash
import pyspark.sql.functions as fn
from pyspark.sql.types import StringType
# this one is the problem not the write stream
df = spark.readStream.format("delta").table("deal_stream_test.deal_gold1")
display(df)
writeStream= (
df
.selectExpr("CAST(documentId AS STRING) AS key", "to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.outputMode("append")
.option("ignoreChanges", "true")
.option("checkpointLocation", "/tmp/benperram21/checkpoint")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("ignoreChanges", "true")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("topic", confluentTopicName)
.start()
)
I was looking and can see this might be as a result of it not being read as "Append". But yeah any thoughts on this? Everything works upset updates.
Right now DLT doesn't support output to the arbitrary sinks. Also, all Spark operations should be done inside the nodes of the execution graph (functions labeled with dlt.table or dlt.view).
Right now the workaround would be to run that notebook outside of the DLT pipeline, as a separate task in the multitask job (workflow).

Change filter/where condition when restarting a Structured Streaming query reading data from Delta Table

In Structured Streaming, will the checkpoints keep track of which data has already been processed from a Delta Table?
def fetch_data_streaming(source_table: str):
print("Fetching now")
streamingInputDF = (
spark
.readStream
.format("delta")
.option("maxBytesPerTrigger",1024)
.table(source_table)
.where("measurementId IN (1351,1350)")
.where("year >= '2021'")
)
query = (
streamingInputDF
.writeStream
.outputMode("append")
.option("checkpointLocation", "/streaming_checkpoints/5")
.foreachBatch(customWriter)
.start()
.awaitTermination()
)
return query
def customWriter(batchDF,batchId):
print(batchId)
print(batchDF.count())
batchDF.show(10)
length = batchDF.count()
print("batchId,batch size:",batchId,length)
If I change the where clause in the streamingInputDF to add more measurentId, the structured streaming job doesn't always acknowledge the change and fetch the new data values. It continues to run as if nothing has changed, whereas at times it starts fetching new values.
Isn't the checkpoint supposed to identify the change?
Edit: Schema of delta table:
col_name
data_type
measurementId
int
year
int
time
timestamp
q
smallint
v
string
"In structured streaming, will the checkpoints will keep track of which data has already been processed?"
Yes, the Structured Streaming job will store the read version of the Delta table in its checkpoint files to avoid producing duplicates.
Within the checkpoint directory in the folder "offsets", you will see that Spark stored the progress per batchId. For example it will look like below:
v1
{"batchWatermarkMs":0,"batchTimestampMs":1619695775288,"conf":[...]}
{"sourceVersion":1,"reservoirId":"d910a260-6aa2-4a7c-9f5c-1be3164127c0","reservoirVersion":2,"index":2,"isStartingVersion":true}
Here, the important part is the "reservoirVersion":2 which tells you that the streaming job has consumed all data from the Delta Table as of version 2.
Re-starting your Structured Streaming query with an additional filter condition will therefore not be applied to historic records but only to those that were added to the Delta Table after version 2.
In order to see this behavior in action you can use below code and analyse the content in the checkpoint files.
val deltaPath = "file:///tmp/delta/table"
val checkpointLocation = "file:///tmp/checkpoint/"
// run the following two lines once
val deltaDf = Seq(("1", "foo1"), ("2", "foo2"), ("3", "foo2")).toDF("id", "value")
deltaDf.write.format("delta").mode("append").save(deltaPath)
// run this code for the first time, then add filter condition, then run again
val query = spark.readStream
.format("delta")
.load(deltaPath)
.filter(col("id").isin("1")) // in the second run add "2"
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
query.awaitTermination()
Now, if you append some more data to the Delta table while the streaming query is shut down and then restart is with the new filter condition it will be applied to the new data.

Why a new batch is triggered without getting any new offsets in streaming source?

I have multiple spark structured streaming jobs and the usual behaviour that I see is that a new batch is triggered only when there are any new offsets in Kafka which is used as source to create streaming query.
But when I run this example which demonstrates arbitrary stateful operations using mapGroupsWithState , then I see that a new batch is triggered even if there is no new data in Streaming source. Why is it so and can it be avoided?
Update-1
I modified the above example code and remove state related operation like updating/removing it. Function simply outputs zero. But still a batch is triggered every 10 seconds without any new data on netcat server.
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming._
object Stateful {
def main(args: Array[String]): Unit = {
val host = "localhost"
val port = "9999"
val spark = SparkSession
.builder
.appName("StructuredSessionization")
.master("local[2]")
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to host:port
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.option("includeTimestamp", true)
.load()
// Split the lines into words, treat words as sessionId of events
val events = lines
.as[(String, Timestamp)]
.flatMap { case (line, timestamp) =>
line.split(" ").map(word => Event(sessionId = word, timestamp))
}
val sessionUpdates = events
.groupByKey(event => event.sessionId)
.mapGroupsWithState[SessionInfo, Int](GroupStateTimeout.ProcessingTimeTimeout) {
case (sessionId: String, events: Iterator[Event], state: GroupState[SessionInfo]) =>
0
}
val query = sessionUpdates
.writeStream
.outputMode("update")
.trigger(Trigger.ProcessingTime("10 seconds"))
.format("console")
.start()
query.awaitTermination()
}
}
case class Event(sessionId: String, timestamp: Timestamp)
case class SessionInfo(
numEvents: Int,
startTimestampMs: Long,
endTimestampMs: Long)
The reason for the empty batches showing up is the usage of Timeouts within the mapGroupsWithState call.
According to the book "Learning Spark 2.0" it says:
"The next micro-batch will call the function on this timed-out key even if there is not data for that key in that micro.batch. [...] Since the timeouts are processed during the micro-batches, the timing of their execution is imprecise and depends heavily on the trigger interval [...]."
As you have set the timeout to be GroupStateTimeout.ProcessingTimeTimeout it aligns with your trigger time of the query which is 10 seconds. The alternative would be to set the timeout based on event time (i.e. GroupStateTimeout.EventTimeTimeout).
The ScalaDocs on GroupState provide some more details:
When the timeout occurs for a group, the function is called for that group with no values, and GroupState.hasTimedOut() set to true.

How to enrich data of a streaming query and write the result to Elasticsearch?

For a given dataset (originalData) I'm required to map the values and then prepare a new dataset combining the search results from elasticsearch.
Dataset<Row> orignalData = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers","test")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load();
Dataset<Row> esData = JavaEsSparkSQL
.esDF(spark.sqlContext(), "spark_correlation/doc");
esData.createOrReplaceTempView("es_correlation");
List<SGEvent> listSGEvent = new ArrayList<>();
originalData.foreach((ForeachFunction<Row>) row -> {
SGEvent event = new SGEvent();
String sourceKey=row.get(4).toString();
String searchQuery = "select id from es_correlation where es_correlation.key='"+sourceKey+"'";
Dataset<Row> result = spark.sqlContext().sql(searchQuery);
String id = null;
if (result != null) {
result.show();
id = result.first().toString();
}
event.setId(id);
event.setKey(sourceKey);
listSGEvent.add(event)
}
Encoder<SGEvent> eventEncoderSG = Encoders.bean(SGEvent.class);
Dataset<Row> finalData = spark.createDataset(listSGEvent, eventEncoderSG).toDF();
finalData
.writeStream()
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.start("spark_index/doc").awaitTermination();
Spark throws the following exception:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
Is my approach towards combing elasticsearch value with Dataset valid ? Is there any other better solution for this?
There are a couple of issues here.
As the exception says orignalData is a streaming query (streaming Dataset) and the only way to execute it is to use writeStream.start(). That's one issue.
You did writeStream.start() but with another query finalData which is not streaming but batch. That's another issue.
For "enrichment" cases like yours, you can use a streaming join (Dataset.join operator) or one of DataStreamWriter.foreach and DataStreamWriter.foreachBatch. I think DataStreamWriter.foreachBatch would be more efficient.
public DataStreamWriter<T> foreachBatch(VoidFunction2<Dataset<T>,Long> function)
(Java-specific) Sets the output of the streaming query to be processed using the provided function. This is supported only the in the micro-batch execution modes (that is, when the trigger is not continuous). In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a Dataset and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).
Not only would you get all the data of a streaming micro-batch in one shot (the first input argument of type Dataset<T>), but also a way to submit another Spark job (across executors) based on the data.
The pseudo-code could look as follows (I'm using Scala as I'm more comfortable with the language):
val dsWriter = originalData.foreachBatch { case (data, batchId) =>
// make sure the data is small enough to collect on the driver
// Otherwise expect OOME
// It'd also be nice to have a Java bean to convert the rows to proper types and names
val localData = data.collect
// Please note that localData is no longer Spark's Dataset
// It's a local Java collection
// Use Java Collection API to work with the localData
// e.g. using Scala
// You're mapping over localData (for a single micro-batch)
// And creating finalData
// I'm using the same names as your code to be as close to your initial idea as possible
val finalData = localData.map { row =>
// row is the old row from your original code
// do something with it
// e.g. using Java
String sourceKey=row.get(4).toString();
...
}
// Time to save the data processed to ES
// finalData is a local Java/Scala collection not Spark's DataFrame!
// Let's convert it to a DataFrame (and leverage the Spark distributed platform)
// Note that I'm almost using your code, but it's a batch query not a streaming one
// We're inside foreachBatch
finalData
.toDF // Convert a local collection to a Spark DataFrame
.write // this creates a batch query
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.save("spark_index/doc") // save (not start) as it's a batch query inside a streaming query
}
dsWriter is a DataStreamWriter and you can now start it to start the streaming query.
I was able to achieve actual solution by using SQL Joins.
Please refer the code below.
Dataset<Row> orignalData = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers","test")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load();
orignalData.createOrReplaceTempView("stream_data");
Dataset<Row> esData = JavaEsSparkSQL
.esDF(spark.sqlContext(), "spark_correlation/doc");
esData.createOrReplaceTempView("es_correlation");
Dataset<Row> joinedData = spark.sqlContext().sql("select * from stream_data,es_correlation where es_correlation.key=stream_data.key");
// Or
/* By using Dataset Join Operator
Dataset<Row> joinedData = orignalData.join(esFirst, "key");
*/
Encoder<SGEvent> eventEncoderSG = Encoders.bean(SGEvent.class);
Dataset<SGEvent> finalData = joinedData.map((MapFunction<Row, SGEvent>) row -> {
SGEvent event = new SGEvent();
event.setId(row.get(0).toString());
event.setKey(row.get(3).toString());
return event;
},eventEncoderSG);
finalData
.writeStream()
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.start("spark_index/doc").awaitTermination();

Split dataset based on column value

I have a Dataset<Row> which is a resultant of Kafka readStream as shown below in Java code snippet.
m_oKafkaEvents = getSparkSession().readStream().format("kafka")
.option("kafka.bootstrap.servers", strKafkaAddress)
.option("subscribe", getInsightEvent().getTopic())
.option("maxOffsetsPerTrigger", "100000")
.option("startingOffsets", "latest")
.option("failOnDataLoss", false)
.load()
.select(functions.from_json(functions.col("value").cast("string"), oSchema).as("events"))
.select("events.*");
m_oKafkaEvents
{
{"EventTime":"1527005246864000000","InstanceID":"231","Model":"Opportunity_1","Milestone":"OrderProcessed"},
{"EventTime":"1527005246864000002","InstanceID":"232","Model":"Opportunity_2","Milestone":"OrderProcessed"},
{"EventTime":"1527005246864000001","InstanceID":"233","Model":"Opportunity_1","Milestone":"OrderProcessed"},
{"EventTime":"1527005246864000002","InstanceID":"234","Model":"Opportunity_2","Milestone":"OrderProcessed"}
}
I need to split this dataset based on column "Model" which would result in two Dataset as below;
m_oKafkaEvents_for_Opportunity_1_topic
{
{"EventTime":"1527005246864000000","InstanceID":"231","Model":"Opportunity_1","Milestone":"OrderProcessed"},
{"EventTime":"1527005246864000001","InstanceID":"233","Model":"Opportunity_1","Milestone":"OrderProcessed"}
}
m_oKafkaEvents_for_Opportunity_2_topic
{
{"EventTime":"1527005246864000002","InstanceID":"232","Model":"Opportunity_2","Milestone":"OrderProcessed"},
{"EventTime":"1527005246864000002","InstanceID":"234","Model":"Opportunity_2","Milestone":"OrderProcessed"}
}
These Datasets would be published into Kafka sink. The topic name would be the model value. i.e Opportunity_1 and Opportunity_2.
Hence I need to have a handle column "Model" value and respective events list.
Since am new to spark, am looking for help on how this can be achieved via java code.
Appreciate any help.
The simplest solution would look like:
allEvents.selectExpr("topic", "CONCAT('m_oKafkaEvents_for_', Model, '_topic')")
.write()
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.save();
You can see an example here https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html#writing-the-output-of-batch-queries-to-kafka . But after looking at Spark's code, it seems that we can have only 1 topic/write, i.e. it'll chose as topic the first encountered row:
def write(
sparkSession: SparkSession,
queryExecution: QueryExecution,
kafkaParameters: ju.Map[String, Object],
topic: Option[String] = None): Unit = {
val schema = queryExecution.analyzed.output
validateQuery(schema, kafkaParameters, topic)
queryExecution.toRdd.foreachPartition { iter =>
val writeTask = new KafkaWriteTask(kafkaParameters, schema, topic)
Utils.tryWithSafeFinally(block = writeTask.execute(iter))(
finallyBlock = writeTask.close())
}
You can try this approach though and tell here if it works as told above ? If it doesn't work, you have alternative solutions, as:
Cache main DataFrame and create 2 other DataFrames, filtered by Model attribute
Use foreachPartition and Kafka writer to send the messages without splitting the main dataset
The first solution is pretty easy to implement and you use all Spark facilities to do that. In the other side and at least theoritecally, splitting the dataset should be slightly slower than the second proposal. But try to measure before chosing one or another option, maybe the difference will be really small and it's always better to use clear and community-approven approach.
Below you can find some code showing both situations:
SparkSession spark = SparkSession
.builder()
.appName("JavaStructuredNetworkWordCount")
.getOrCreate();
Dataset<Row> allEvents = spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", "")
.option("subscribe", "event")
.option("maxOffsetsPerTrigger", "100000")
.option("startingOffsets", "latest")
.option("failOnDataLoss", false)
.load()
.select(functions.from_json(functions.col("value").cast("string"), null).as("events"))
.select("events.*");
// First solution
Dataset<Row> opportunity1Events = allEvents.filter("Model = 'Opportunity_1'");
opportunity1Events.write().format("kafka").option("kafka.bootstrap.servers", "")
.option("topic", "m_oKafkaEvents_for_Opportunity_1_topic").save();
Dataset<Row> opportunity2Events = allEvents.filter("Model = 'Opportunity_2'");
opportunity2Events.write().format("kafka").option("kafka.bootstrap.servers", "")
.option("topic", "m_oKafkaEvents_for_Opportunity_2_topic").save();
// Note: Kafka writer was added in 2.2.0 https://github.com/apache/spark/commit/b0a5cd89097c563e9949d8cfcf84d18b03b8d24c
// Another approach with iteration throughout messages accumulated within each partition
allEvents.foreachPartition(new ForeachPartitionFunction<Row>() {
private KafkaProducer<String, Row> localProducer = new KafkaProducer<>(new HashMap<>());
private final Map<String, String> modelsToTopics = new HashMap<>();
{
modelsToTopics.put("Opportunity_1", "m_oKafkaEvents_for_Opportunity_1_topic");
modelsToTopics.put("Opportunity_2", "m_oKafkaEvents_for_Opportunity_2_topic");
}
#Override
public void call(Iterator<Row> rows) throws Exception {
// If your message is Opportunity1 => add to messagesOpportunity1
// otherwise it goes to Opportunity2
while (rows.hasNext()) {
Row currentRow = rows.next();
// you can reformat your row here or directly in Spark's map transformation
localProducer.send(new ProducerRecord<>(modelsToTopics.get(currentRow.getAs("Model")),
"some_message_key", currentRow));
}
// KafkaProducer accumulates messages in a in-memory buffer and sends when a threshold was reached
// Flush them synchronously here to be sure that every stored message was correctly
// delivered
// You can also play with features added in Kafka 0.11: the idempotent producer and the transactional producer
localProducer.flush();
}
});

Resources