Spark structured streaming - update data frame's schema on the fly - apache-spark

I have a simple structured streaming job which monitors a directory for CSV files and writes parquet files - no transformation in between.
The job starts by building a data frame from reading CSV files using readStream(), with a schema which I get from calling a function called buildSchema(). Here is the code:
var df = spark
.readStream
.option("sep", "|")
.option("header","true")
.schema(buildSchema(spark, table_name).get) // buildSchema() gets schema for me
.csv(input_base_dir + table_name + "*")
logger.info(" new batch indicator")
if (df.schema != buildSchema(spark, table_name).get) {
df = spark.sqlContext.createDataFrame(df.collectAsList(), buildSchema(spark, table_name).get)
}
val query =
df.writeStream
.format("parquet")
.queryName("convertCSVtoPqrquet for table " + table_name)
.option("path", output_base_dir + table_name + "/")
.trigger(ProcessingTime(60.seconds))
.start()
The job runs fine, but my question is, I'd like to always use the latest schema to build my data frame, or in other words, to read from the CSV files. While buildSchema() can get me the latest schema, I'm not sure how to call it periodically (or once per CSV file), and then use the latest schema to somehow re-generate or modify the data frame.
When testing, my observation is that only the query object is running continuously batch after batch; the log statement that I put, and the if()statement for schema comparison, only happened once at the beginning of the application.
Can data frame schema in structured streaming job be modified after query.start() is called? What would you suggest as a good workaround if we cannot change the schema of a data frame?
Thanks in advance.

You can utilize foreachBatch method to load the latest schema periodically, then compare it to the concrete micro batch dataframe schema.
Example:
var streamingDF = spark
.readStream
.option("sep", "|")
.option("header", "true")
.schema(buildSchema(spark, table_name).get) // buildSchema() gets schema for me
.csv(input_base_dir + table_name + "*")
val query =
streamingDF
.writeStream
.foreachBatch((ds, i) => {
logger.info(s"New batch indicator(${i})")
val batchDf =
if (ds.schema != buildSchema(spark, table_name).get) {
spark.sqlContext.createDataFrame(ds.collectAsList(), buildSchema(spark, table_name).get)
} else {
ds
}
batchDf.write.parquet(output_base_dir + table_name + "/")
})
.trigger(ProcessingTime(60.seconds))
.start()

Related

Spark always bradcasts tables greater than spark.sql.autoBroadcastJoinThreshold when performing streaming merge on DeltaTable sink

I am trying to do a streaming merge between delta tables using this guide - https://docs.delta.io/latest/delta-update.html#upsert-from-streaming-queries-using-foreachbatch
Our Code Sample (Java):
Dataset<Row> sourceDf = sparkSession
.readStream()
.format("delta")
.option("inferSchema", "true")
.load(sourcePath);
DeltaTable deltaTable = DeltaTable.forPath(sparkSession, targetPath);
sourceDf.createOrReplaceTempView("vTempView");
StreamingQuery sq = sparkSession.sql("select * from vTempView").writeStream()
.format("delta")
.foreachBatch((microDf, id) -> {
deltaTable.alias("e").merge(microDf.alias("d"), "e.SALE_ID = d.SALE_ID")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute();
})
.outputMode("update")
.option("checkpointLocation", util.getFullS3Path(target)+"/_checkpoint")
.trigger(Trigger.Once())
.start();
Problem:
Here Source path and Target path is already in sync using the checkpoint folder. Which has around 8 million rows of data amounting to around 450mb of parquet files.
When new data comes in Source Path (let's say 987 rows), then above code will pick that up and perform a merge with target table. During this operation spark is trying to perform a BroadCastHashJoin, and broadcasts the target table which has 8M rows.
Here's a DAG snippet for merge operation (with table with 1M rows),
Expectation:
I am expecting smaller dataset (i.e: 987 rows) to be broadcasted. If not then atleast spark should not broadcast target table, as it is larger than provided spark.sql.autoBroadcastJoinThreshold setting and neither are we providing any broadcast hint anywhere.
Things I have tried:
I searched around and got this article - https://learn.microsoft.com/en-us/azure/databricks/kb/sql/bchashjoin-exceeds-bcjointhreshold-oom.
It provides 2 solutions,
Run "ANALYZE TABLE ..." (but since we are reading target table from path and not from a table this is not possible)
Cache the table you are broadcasting, DeltaTable does not have any provision to cache table, so can't do this.
I thought this was because we are using DeltaTable.forPath() method for reading target table and spark is unable to calculate target table metrics. So I also tried a different approach,
Dataset<Row> sourceDf = sparkSession
.readStream()
.format("delta")
.option("inferSchema", "true")
.load(sourcePath);
Dataset<Row> targetDf = sparkSession
.read()
.format("delta")
.option("inferSchema", "true")
.load(targetPath);
sourceDf.createOrReplaceTempView("vtempview");
targetDf.createOrReplaceTempView("vtemptarget");
targetDf.cache();
StreamingQuery sq = sparkSession.sql("select * from vtempview").writeStream()
.format("delta")
.foreachBatch((microDf, id) -> {
microDf.createOrReplaceTempView("vtempmicrodf");
microDf.sparkSession().sql(
"MERGE INTO vtemptarget as t USING vtempmicrodf as s ON t.SALE_ID = s.SALE_ID WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * "
);
})
.outputMode("update")
.option("checkpointLocation", util.getFullS3Path(target)+"/_checkpoint")
.trigger(Trigger.Once())
.start();
In above snippet I am also caching the targetDf so that Spark can calculate metrics and not broadcast target table. But it didn't help and spark still broadcasts it.
Now I am out of options. Can anyone give me some guidance on this?

Change filter/where condition when restarting a Structured Streaming query reading data from Delta Table

In Structured Streaming, will the checkpoints keep track of which data has already been processed from a Delta Table?
def fetch_data_streaming(source_table: str):
print("Fetching now")
streamingInputDF = (
spark
.readStream
.format("delta")
.option("maxBytesPerTrigger",1024)
.table(source_table)
.where("measurementId IN (1351,1350)")
.where("year >= '2021'")
)
query = (
streamingInputDF
.writeStream
.outputMode("append")
.option("checkpointLocation", "/streaming_checkpoints/5")
.foreachBatch(customWriter)
.start()
.awaitTermination()
)
return query
def customWriter(batchDF,batchId):
print(batchId)
print(batchDF.count())
batchDF.show(10)
length = batchDF.count()
print("batchId,batch size:",batchId,length)
If I change the where clause in the streamingInputDF to add more measurentId, the structured streaming job doesn't always acknowledge the change and fetch the new data values. It continues to run as if nothing has changed, whereas at times it starts fetching new values.
Isn't the checkpoint supposed to identify the change?
Edit: Schema of delta table:
col_name
data_type
measurementId
int
year
int
time
timestamp
q
smallint
v
string
"In structured streaming, will the checkpoints will keep track of which data has already been processed?"
Yes, the Structured Streaming job will store the read version of the Delta table in its checkpoint files to avoid producing duplicates.
Within the checkpoint directory in the folder "offsets", you will see that Spark stored the progress per batchId. For example it will look like below:
v1
{"batchWatermarkMs":0,"batchTimestampMs":1619695775288,"conf":[...]}
{"sourceVersion":1,"reservoirId":"d910a260-6aa2-4a7c-9f5c-1be3164127c0","reservoirVersion":2,"index":2,"isStartingVersion":true}
Here, the important part is the "reservoirVersion":2 which tells you that the streaming job has consumed all data from the Delta Table as of version 2.
Re-starting your Structured Streaming query with an additional filter condition will therefore not be applied to historic records but only to those that were added to the Delta Table after version 2.
In order to see this behavior in action you can use below code and analyse the content in the checkpoint files.
val deltaPath = "file:///tmp/delta/table"
val checkpointLocation = "file:///tmp/checkpoint/"
// run the following two lines once
val deltaDf = Seq(("1", "foo1"), ("2", "foo2"), ("3", "foo2")).toDF("id", "value")
deltaDf.write.format("delta").mode("append").save(deltaPath)
// run this code for the first time, then add filter condition, then run again
val query = spark.readStream
.format("delta")
.load(deltaPath)
.filter(col("id").isin("1")) // in the second run add "2"
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
query.awaitTermination()
Now, if you append some more data to the Delta table while the streaming query is shut down and then restart is with the new filter condition it will be applied to the new data.

Structured Streaming: Reading from multiple Kafka topics at once

I have a Spark Structured Streaming Application which has to read from 12 Kafka topics (Different Schemas, Avro format) at once, deserialize the data and store in HDFS. When I read from a single topic using my code, it works fine and without errors but on running multiple queries together, I'm getting the following error
java.lang.IllegalStateException: Race while writing batch 0
My code is as follows:
def main(args: Array[String]): Unit = {
val kafkaProps = Util.loadProperties(kafkaConfigFile).asScala
val topic_list = ("topic1", "topic2", "topic3", "topic4")
topic_list.foreach(x => {
kafkaProps.update("subscribe", x)
val source= Source.fromInputStream(Util.getInputStream("/schema/topics/" + x)).getLines.mkString
val schemaParser = new Schema.Parser
val schema = schemaParser.parse(source)
val sqlTypeSchema = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
val kafkaStreamData = spark
.readStream
.format("kafka")
.options(kafkaProps)
.load()
val udfDeserialize = udf(deserialize(source), DataTypes.createStructType(sqlTypeSchema.fields))
val transformedDeserializedData = kafkaStreamData.select("value").as(Encoders.BINARY)
.withColumn("rows", udfDeserialize(col("value")))
.select("rows.*")
val query = transformedDeserializedData
.writeStream
.trigger(Trigger.ProcessingTime("5 seconds"))
.outputMode("append")
.format("parquet")
.option("path", "/output/topics/" + x)
.option("checkpointLocation", checkpointLocation + "//" + x)
.start()
})
spark.streams.awaitAnyTermination()
}
Alternative. You can use KAFKA Connect (from Confluent), NIFI, StreamSets etc. as your use case seems to fit "dump/persist to HDFS". That said, you need to have these tools (installed). The small files problem you state is not an issue, so be it.
From Apache Kafka 0.9 or later version you can Kafka Connect API for KAFKA --> HDFS Sink (various supported HDFS formats). You need a KAFKA Connect Cluster though, but that is based on your existing Cluster in any event, so not a big deal. But someone needs to maintain.
Some links to get you on your way:
https://data-flair.training/blogs/kafka-connect/
https://github.com/confluentinc/kafka-connect-hdfs

How to enrich data of a streaming query and write the result to Elasticsearch?

For a given dataset (originalData) I'm required to map the values and then prepare a new dataset combining the search results from elasticsearch.
Dataset<Row> orignalData = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers","test")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load();
Dataset<Row> esData = JavaEsSparkSQL
.esDF(spark.sqlContext(), "spark_correlation/doc");
esData.createOrReplaceTempView("es_correlation");
List<SGEvent> listSGEvent = new ArrayList<>();
originalData.foreach((ForeachFunction<Row>) row -> {
SGEvent event = new SGEvent();
String sourceKey=row.get(4).toString();
String searchQuery = "select id from es_correlation where es_correlation.key='"+sourceKey+"'";
Dataset<Row> result = spark.sqlContext().sql(searchQuery);
String id = null;
if (result != null) {
result.show();
id = result.first().toString();
}
event.setId(id);
event.setKey(sourceKey);
listSGEvent.add(event)
}
Encoder<SGEvent> eventEncoderSG = Encoders.bean(SGEvent.class);
Dataset<Row> finalData = spark.createDataset(listSGEvent, eventEncoderSG).toDF();
finalData
.writeStream()
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.start("spark_index/doc").awaitTermination();
Spark throws the following exception:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
Is my approach towards combing elasticsearch value with Dataset valid ? Is there any other better solution for this?
There are a couple of issues here.
As the exception says orignalData is a streaming query (streaming Dataset) and the only way to execute it is to use writeStream.start(). That's one issue.
You did writeStream.start() but with another query finalData which is not streaming but batch. That's another issue.
For "enrichment" cases like yours, you can use a streaming join (Dataset.join operator) or one of DataStreamWriter.foreach and DataStreamWriter.foreachBatch. I think DataStreamWriter.foreachBatch would be more efficient.
public DataStreamWriter<T> foreachBatch(VoidFunction2<Dataset<T>,Long> function)
(Java-specific) Sets the output of the streaming query to be processed using the provided function. This is supported only the in the micro-batch execution modes (that is, when the trigger is not continuous). In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a Dataset and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).
Not only would you get all the data of a streaming micro-batch in one shot (the first input argument of type Dataset<T>), but also a way to submit another Spark job (across executors) based on the data.
The pseudo-code could look as follows (I'm using Scala as I'm more comfortable with the language):
val dsWriter = originalData.foreachBatch { case (data, batchId) =>
// make sure the data is small enough to collect on the driver
// Otherwise expect OOME
// It'd also be nice to have a Java bean to convert the rows to proper types and names
val localData = data.collect
// Please note that localData is no longer Spark's Dataset
// It's a local Java collection
// Use Java Collection API to work with the localData
// e.g. using Scala
// You're mapping over localData (for a single micro-batch)
// And creating finalData
// I'm using the same names as your code to be as close to your initial idea as possible
val finalData = localData.map { row =>
// row is the old row from your original code
// do something with it
// e.g. using Java
String sourceKey=row.get(4).toString();
...
}
// Time to save the data processed to ES
// finalData is a local Java/Scala collection not Spark's DataFrame!
// Let's convert it to a DataFrame (and leverage the Spark distributed platform)
// Note that I'm almost using your code, but it's a batch query not a streaming one
// We're inside foreachBatch
finalData
.toDF // Convert a local collection to a Spark DataFrame
.write // this creates a batch query
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.save("spark_index/doc") // save (not start) as it's a batch query inside a streaming query
}
dsWriter is a DataStreamWriter and you can now start it to start the streaming query.
I was able to achieve actual solution by using SQL Joins.
Please refer the code below.
Dataset<Row> orignalData = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers","test")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load();
orignalData.createOrReplaceTempView("stream_data");
Dataset<Row> esData = JavaEsSparkSQL
.esDF(spark.sqlContext(), "spark_correlation/doc");
esData.createOrReplaceTempView("es_correlation");
Dataset<Row> joinedData = spark.sqlContext().sql("select * from stream_data,es_correlation where es_correlation.key=stream_data.key");
// Or
/* By using Dataset Join Operator
Dataset<Row> joinedData = orignalData.join(esFirst, "key");
*/
Encoder<SGEvent> eventEncoderSG = Encoders.bean(SGEvent.class);
Dataset<SGEvent> finalData = joinedData.map((MapFunction<Row, SGEvent>) row -> {
SGEvent event = new SGEvent();
event.setId(row.get(0).toString());
event.setKey(row.get(3).toString());
return event;
},eventEncoderSG);
finalData
.writeStream()
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.start("spark_index/doc").awaitTermination();

Call a function with each element a stream in Databricks

I have a DataFrame stream in Databricks, and I want to perform an action on each element. On the net I found specific purpose methods, like writing it to the console or dumping into memory, but I want to add some business logic, and put some results into Redis.
To be more specific, this is how it would look like in non-stream case:
val someDataFrame = Seq(
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
("key4", "value4")
).toDF()
def someFunction(keyValuePair: (String, String)) = {
println(keyValuePair)
}
someDataFrame.collect.foreach(r => someFunction((r(0).toString, r(1).toString)))
But if the someDataFrame is not a simple data frame but a stream data frame (indeed coming from Kafka), the error message is this:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Could anyone please help me solving this problem?
Some important notes:
I've read the relevant documentation, like Spark Streaming or Databricks Streaming and a few other descriptions as well.
I know that there must be something like start() and awaitTermination, but I don't know the exact syntax. The descriptions did not help.
It would take pages to list all the possibilities I tried, so I rather not provide them.
I do not want to solve the specific problem of displaying the result. I.e. please do not provide a solution to this specific case. The someFunction would look like this:
val someData = readSomeExternalData()
if (condition containing keyValuePair and someData) {
doSomething(keyValuePair);
}
(Question What is the purpose of ForeachWriter in Spark Structured Streaming? does not provide a working example, therefore does not answer my question.)
Here is an example of reading using foreachBatch to save every item to redis using the streaming api.
Related to a previous question (DataFrame to RDD[(String, String)] conversion)
// import spark and spark-redis
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.streaming._
import org.apache.spark.sql.types._
import com.redislabs.provider.redis._
// schema of csv files
val userSchema = new StructType()
.add("name", "string")
.add("age", "string")
// create a data stream reader from a dir with csv files
val csvDF = spark
.readStream
.format("csv")
.option("sep", ";")
.schema(userSchema)
.load("./data") // directory where the CSV files are
// redis
val redisConfig = new RedisConfig(new RedisEndpoint("localhost", 6379))
implicit val readWriteConfig: ReadWriteConfig = ReadWriteConfig.Default
csvDF.map(r => (r.getString(0), r.getString(0))) // converts the dataset to a Dataset[(String, String)]
.writeStream // create a data stream writer
.foreachBatch((df, _) => sc.toRedisKV(df.rdd)(redisConfig)) // save each batch to redis after converting it to a RDD
.start // start processing
Call simple user defined function foreachbatch in spark streaming.
please try this,
it will print 'hello world' for every message from tcp socket
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession .builder .appName("StructuredNetworkWordCount") .getOrCreate()
# Create DataFrame representing the stream of input lines from connection tolocalhost:9999
lines = spark .readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
def process_row(df, epoch_id):
# # Write row to storage
print('hello world')
query = words.writeStream.foreachBatch(process_row).start()
#query = wordCounts .writeStream .outputMode("complete") .format("console") .start()
query.awaitTermination()

Resources