Avoiding multiple streaming queries - apache-spark

I have a structured streaming query which sinks to Kafka. This query has a complex aggregation logic.
I would like to sink the output DF of this query to multiple Kafka topics each partitioned on a different ‘key’ column. I don't want to have multiple Kafka sinks for each of the different Kafka topics because that would mean running multiple streaming queries - one for each Kafka topic, especially since my aggregation logic is complex.
Questions:
Is there a way to output the results of a structured streaming query to multiple Kafka topics each with a different key column but without having to execute multiple streaming queries?
If not, would it be efficient to cascade the multiple queries such that the first query does the complex aggregation and writes output to Kafka and then the other queries just read the output of the first query and write their topics to Kafka thus avoiding doing the complex aggregation again?
Thanks in advance for any help.

So the answer was kind of staring at me in the eye. It's documented as well. Link below.
One can write to multiple Kafka topics from a single query. If your dataframe that you want to write has a column named "topic" (along with "key", and "value" columns), it will write the contents of a row to the topic in that row. This automatically works. So the only thing you need to figure out is how to generate the value of that column.
This is documented - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka

I am also looking for solution of this problem and in my case its not necessarily kafka sink. I want to write some records of a dataframe in sink1 while some other records in sink2 (depending upon some condition, without reading the same data twice in 2 streaming queries).
Currently it does not seem possible as per current implementation ( createSink() method in DataSource.scala provides support for a single sink).
However, In Spark 2.4.0 there is a new api coming: foreachBatch() which will give handle to a dataframe microbatch which can be used to cache the dataframe, write to different sinks or processing multiple times before uncaching aagin.
Something like this:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.cache()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.uncache()
}
right now this feature available in databricks runtime :
https://docs.databricks.com/spark/latest/structured-streaming/foreach.html#reuse-existing-batch-data-sources-with-foreachbatch
EDIT 15/Nov/18 :
It is available now in Spark 2.4.0 ( https://issues.apache.org/jira/browse/SPARK-24565)

There is no way to have a single read and multiple writes in structured streaming out of the box. The only way is to implement custom sink that will write into multiple topics.
Whenever you call dataset.writeStream().start() spark starts a new stream that reads from a source (readStream()) and writes into a sink (writeStream()).
Even if you try to cascade it spark will create two separate streams with one source and one sink each. In other words, it will read, process and write data twice:
Dataset df = <aggregation>;
StreamingQuery sq1 = df.writeStream()...start();
StreamingQuery sq2 = df.writeStream()...start();
There is a way to cache read data in spark streaming but this option is not available for structured streaming yet.

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

How does spark structured streaming job handle stream - static DataFrame join?

I have a spark structured streaming job which reads a mapping table from cassandra and deltalake and joins with streaming df. I would like to understand the exact mechanism here. Does spark hit these data sources(cassandra and deltalake) for every cycle of microbatch? If that is the case i see in spark web ui that these tables are read only once.
Please help me understand this.
Thanks in advance
"Does spark hit these data sources(cassandra and deltalake) for every cycle of microbatch?"
According to the book "Learning Spark, 2nd edition" from O'Reilly on static-stream joins it is mentioned that the static DataFrame is read in every micro-batch.
To be more precise, I find the following section in the book quite helpful:
Stream-static joins are stateless operations, and therfore do not required any kind of watermarking
The static DataFrame is read repeatedly while joining with the streaming data of every micro-batch, so you can cache the static DataFrame to speed up reads.
If the underlying data in the data source on which the static DataFrame was defined changes, wether those changes are seen by the streaming query depends on the specific behavior of the data source. For example, if the static DataFrame was defined on files, then changes to those files (e.g. appends) will not be picked up until the streaming query is restarted.
When applying a "static-stream" join it is assumed that the static part is not changing at all or only slowly changing. If you plan to join two rapidly changing data sources it is required to switch to a "stream-stream" join.

How to ingest different spark dataframes in a single spark job

I want to write a ETL pipeline in spark handling different input sources but using as few computing resources as possible and have problem using 'traditional' spark ETL approach.
I have a number of streaming datasources which need to be persisted into DeltaLake tables. Each datasource is simply a folder in s3 with avro files. Each datasource has different schema. Each datasource should be persisted into it's own DeltaLake table. Little conversion other than avro -> delta is needed, only enrichment with some additional fields derived from filename.
New files are added at a moderate rate, from once a min to once a day, depending on the datasource. I have a kafka notification when new data lands, describing what kind of data and s3 file path.
Assume there are two datasources - A and B. A is s3://bucket/A/* files, B - s3://bucket/B/*. Whenever new files is added I have a kafka message with payload {'datasource': 'A', filename: 's3://bucket/A/file1', ... other fields}. A files should go to delta table s3://delta/A/, B - s3://delta/B/
How can I ingest them all in a single spark application with minimal latency?
As need data is constantly coming, sound like streaming. But in spark streaming one needs to define stream schema upfront, and I have different sources with different schema not known upfront.
Spinning up a dedicated spark application per datasource is not an option - there are 100+ datasources with very small files arriving. Having 100+ spark applications is a waste of money. All should be ingested using single cluster of moderate size.
The only idea I have now: in a driver process run a normal kafka consumer, for each record read a dataframe, enrich with additional fields and persist to it's delta table. More more parallelism - consume multiple messages and run them in futures, so multiple jobs run concurrently.
Some pseudo-code, in a driver process:
val consumer = KafkaConsumer(...)
consumer.foreach{record =>
val ds = record.datasource
val file = record.filename
val df = spark.read.format(avro).load(file)
.withColumn('id', record.id)
val dest = s"s3://delta/${record.datasourceName}"
df.write.format('delta').save(dest)
consumer.commit(offset from record)
}
Sounds good (and PoC shows it works), but I wonder if there are other options? Any other ideas are appreciated.
Spark runs in a DataBricks platform.
Spark does not constraint you to have a spark application per datasource ingestion, you can group datasources into a couple of spark app or you could go with one spark application for all the datasources, which is a feasible approach if the spark app have enough resources to ingest and process all the datasource.
You can do something like:
object StreamingJobs extends SparkApp {
// consume from Kafka Topic 1
StreamProcess_1.runStream(spark)
// consume from Kafka Topic 2
StreamProcess_2.runStream(spark)
// consume from Kafka Topic n
StreamProcess_N.runStream(spark)
// wait until termination
spark.streams.awaitAnyTermination()
}
and maybe another spark jobs for batch processing
object BatchedJobs extends SparkApp {
// consume from data source 1
BatchedProcess_1.run(spark)
// consume from data source 2
BatchedProcess_2.run(spark)
// consume from data source n
BatchedProcess_N.run(spark)
}

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

spark transform method behaviour in multiple partition

I am using Kafka Streaming to read data from Kafka topic, and i want to join every RDD that i get in the stream to an existing RDD. So i think using "transform" is the best option (Unless any one disagrees, and suggest a better approach)
And, I read following example of "transform" method on DStreams in Spark:
val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(...) // RDD containing spam information
val cleanedDStream = wordCounts.transform { rdd =>
rdd.join(spamInfoRDD).filter(...) // join data stream with spam information to do data cleaning
...
}
But lets say, i have 3 partitions in the Kafka topic, and that i invoke 3 consumers to read from those. Now, this transform method will be called in three separate threads in parallel.
I am not sure if joining the RDDs in this case will be Thread-safe and this will not result in data-loss. (considering that RDDs are immutable)
Also, if you say that its thread-safe then wouldn't the performance be low since we are creating so many RDDs and then joining them?
can anybody suggest?

Resources