How Kafka sink supports update mode in structured streaming? - apache-spark

I have read about the different output modes like:
Complete Mode - The entire updated Result Table will be written to the sink.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage.
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage
At first I thought I understand the above explanations.
Then I come across this:
File sink supported modes: Append
Kafka sink supported modes: Append,Update,Complete
Wait!! What??!!
Why couldn't we just write out the entire result table to file?
How can we an already existing entry in Kafka update? It's a stream, you can't just look for certain messages and change/update them.
This makes no sense at all.
Could you help me understand this? I just dont get how this works technically

Spark writes one file per partition, often with one file per executor. Executors run in a distributed fashion. Files are local to each executor, so append just makes sense - you cannot full replace individual files, especially without losing data within the stream. So that leaves you with "appending new files to the filesystem", or inserting into existing files.
Kafka has no update functionality... Kafka Integration Guide doesn't mention any of these modes, so it is unclear what you are referring to. You use write or writeStream. It will always "append" the "complete" dataframe batch(es) to the end of the Kafka topic. The way Kafka implements something like updates is using compacted topics, but this has nothing to do with Spark.

Related

Problems in reading existing multilevel partitioned files data from from middle in Spark Structured Streaming

I am working on structured spark streaming with existing multilevel partitioned parquet file as source. I have following issue while using it.
Starting spark streaming job to read data from particular partition instead of starting from beginning.
Suppose if we observed that there is data quality issue in partition year=2018/month=10/hour=10. Now suppose i have corrected that data till date by replacing correct files.
Now Question is how to reprocess data starting from this day instead of starting from beginning? Because in structured streaming lets say i use file stream as source which will load all files i want to ignore few files. Here i need to remove my checkpoint directory as well because it has offset till date.
Suppose if we observed that there is data quality issue in partition year=2018/month=10/hour=10. Then how to reprocess data starting from this day instead of starting from beginning?
I don't think it's possible in Spark Structured Streaming (wish I'd be mistaken).
Since we're talking about a streaming query, you'd have to rewind the "stream". The only way to achieve it (I can think of) is to re-upload the data (no idea how to do it) or simply process data that would "delete" the previous version of the partition year=2018/month=10/hour=10 and then upload a new corrected version.
The question is how to inform the parquet data source that whatever has already been processed should be "evicted" from the result (that may've been sent out to external sources for further processing).

Parquet File Output Sink - Spark Structured Streaming

Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query (with Parquet File output sink configured) to write data to the parquet files. I periodically feed the Stream input data (using StreamReader to read in files), but it does not write output to Parquet file for each file provided as input. Once I have given it a few files, it tends to write a Parquet file just fine.
I am wondering how to control this. I would like to be able force a new write to Parquet file for every new file provided as input. Any tips appreciated!
Note: I have maxFilesPerTrigger set to 1 on the Read Stream call. I am also seeing the Streaming Query process the single input file, however a single file on input does not appear to result in the Streaming Query writing the output to the Parquet file
After further analysis, and working with the ForEach output sink using the default Append Mode, I believe the issue I was running into was the combination of the Append mode along with the Watermarking feature.
After re-reading https://spark.apache.org/docs/2.2.1/structured-streaming-programming-guide.html#starting-streaming-queries It appears that when the Append mode is used with a watermark set, the Spark structured steaming will not write out aggregation results to the Result table until the watermark time limit has past. Append mode does not allow updates to records, so it must wait for the watermark to past, to ensure no change the row...
I believe - the Parquet File sink does not allow for the Update mode, howver after switching to the ForEach output Sink, and using the Update mode, I observed data coming out the sink as I expected. Essentially for each record in, at least one record out, with no delay (as was observed before).
Hopefully this is helpful to others.

Storing Kafka Offsets in a File vs Hbase

I am developing a Spark-Kafka Streaming program where i need to capture the kafka partition offsets, inorder to handle failure scenarios.
Most of the devs are using Hbase as a storage for offsets, but how would it be if i use a file on hdfs or local disk to store offsets which is simple and easy?
I am trying to avoid using a Nosql for storing offsets.
Can i know what are the advantages and disadvantages of using a file over hbase for storing offsets?
Just use Kafka. Out of the box, Apache Kafka stores consumer offsets within Kafka itself.
I too have similar usecase, i prefer hbase because of following reasons-
Easy retrieval, it stores data in sorted order of rowkey. Its helpful when the offsets belong to different data group.
I had to capture start and end offset for a group of data where capturing start is easy but end offset..it though to capture in streaming mode. So I don't wanted to open a file update only end offset and close it.I had a thought of S3 as well but S3 objects are immutable.
Zookeeper can also be one option.
Hope it helps .

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

Unable to read the streaming data from the single file in Spark streaming

I am trying to read the streaming data from the text file which gets appended continuously using Spark streaming API "textFileStream". But unable to read the continuous data with Spark streaming. How to achieve it in Spark?
This an expected behavior. For file based sources (like fileStream):
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
If you want to read continuously appended you'll have to create your own source, or use separate process, which will monitor changes, and push records to for example Kafka (though it is rare to combine Spark with file systems that support appending).

Resources