Parquet File Output Sink - Spark Structured Streaming - apache-spark

Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query (with Parquet File output sink configured) to write data to the parquet files. I periodically feed the Stream input data (using StreamReader to read in files), but it does not write output to Parquet file for each file provided as input. Once I have given it a few files, it tends to write a Parquet file just fine.
I am wondering how to control this. I would like to be able force a new write to Parquet file for every new file provided as input. Any tips appreciated!
Note: I have maxFilesPerTrigger set to 1 on the Read Stream call. I am also seeing the Streaming Query process the single input file, however a single file on input does not appear to result in the Streaming Query writing the output to the Parquet file

After further analysis, and working with the ForEach output sink using the default Append Mode, I believe the issue I was running into was the combination of the Append mode along with the Watermarking feature.
After re-reading https://spark.apache.org/docs/2.2.1/structured-streaming-programming-guide.html#starting-streaming-queries It appears that when the Append mode is used with a watermark set, the Spark structured steaming will not write out aggregation results to the Result table until the watermark time limit has past. Append mode does not allow updates to records, so it must wait for the watermark to past, to ensure no change the row...
I believe - the Parquet File sink does not allow for the Update mode, howver after switching to the ForEach output Sink, and using the Update mode, I observed data coming out the sink as I expected. Essentially for each record in, at least one record out, with no delay (as was observed before).
Hopefully this is helpful to others.

Related

How Kafka sink supports update mode in structured streaming?

I have read about the different output modes like:
Complete Mode - The entire updated Result Table will be written to the sink.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage.
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage
At first I thought I understand the above explanations.
Then I come across this:
File sink supported modes: Append
Kafka sink supported modes: Append,Update,Complete
Wait!! What??!!
Why couldn't we just write out the entire result table to file?
How can we an already existing entry in Kafka update? It's a stream, you can't just look for certain messages and change/update them.
This makes no sense at all.
Could you help me understand this? I just dont get how this works technically
Spark writes one file per partition, often with one file per executor. Executors run in a distributed fashion. Files are local to each executor, so append just makes sense - you cannot full replace individual files, especially without losing data within the stream. So that leaves you with "appending new files to the filesystem", or inserting into existing files.
Kafka has no update functionality... Kafka Integration Guide doesn't mention any of these modes, so it is unclear what you are referring to. You use write or writeStream. It will always "append" the "complete" dataframe batch(es) to the end of the Kafka topic. The way Kafka implements something like updates is using compacted topics, but this has nothing to do with Spark.

Spark reading from a new location keeping the output directory same

I have a spark job which reads (using structured streaming API) from a source s3://bucket/source-1 folder and writes to s3://bucket/destination-1 folder. The checkpoints are saved at s3://bucket/checkpoint-1.
Now I want to read the data with the same schema from s3://bucket/source-2 (with checkpointing at s3://bucket/checkpoint-2) but i want to append it to the same s3://bucket/destination-2 folder. Is it possible?
Yes, of course this is possible to write into the same location. But there are different things that you need to take into account, like:
what data format you're using as output (Parquet, Delta, something else...)?
are these both streaming jobs are running in the same time? Could you have conflicts when writing data?
(potentially) what is partitioning schema for the destination?

Problems in reading existing multilevel partitioned files data from from middle in Spark Structured Streaming

I am working on structured spark streaming with existing multilevel partitioned parquet file as source. I have following issue while using it.
Starting spark streaming job to read data from particular partition instead of starting from beginning.
Suppose if we observed that there is data quality issue in partition year=2018/month=10/hour=10. Now suppose i have corrected that data till date by replacing correct files.
Now Question is how to reprocess data starting from this day instead of starting from beginning? Because in structured streaming lets say i use file stream as source which will load all files i want to ignore few files. Here i need to remove my checkpoint directory as well because it has offset till date.
Suppose if we observed that there is data quality issue in partition year=2018/month=10/hour=10. Then how to reprocess data starting from this day instead of starting from beginning?
I don't think it's possible in Spark Structured Streaming (wish I'd be mistaken).
Since we're talking about a streaming query, you'd have to rewind the "stream". The only way to achieve it (I can think of) is to re-upload the data (no idea how to do it) or simply process data that would "delete" the previous version of the partition year=2018/month=10/hour=10 and then upload a new corrected version.
The question is how to inform the parquet data source that whatever has already been processed should be "evicted" from the result (that may've been sent out to external sources for further processing).

HDFS file sink output as file stream input for another stream - race condition?

I'm evaluating a particular data-flow in a 15-node Spark cluster using structured streaming. I've defined 2 streaming queries in my application:
SQ1 - Reads data from Kakfa -> processes -> writes to HDFS file sink (path - hdfs://tmp/output)
SQ2 - Reads data as file stream from HDFS (same path as above) -> further processing -> writes to external database using ForeachWriter
Both queries are set to trigger every 15 seconds.
My question - am I looking at a race condition here, where SQ2 picks up the partially written files (which are generated by SQ1) from HDFS? A more general question would be, is the file sink writer for HDFS "atomic"? I've tried to dig through the streaming source code in Spark but haven't made much progress.
The main problem of this approach is that all File Sinks (such as HDFS) in Spark Structured Streaming can only operate in append mode. In addition, reading from a file is done as soon as the file has been created. Any subsequent updates or finalization of writes will be ignored.
According to the book "Learning Spark - 2nd Edition" on reading from files
"each file must appear in the directory listing atomically - that is, the whole file must be available at once for reading, and once it is available, the file cannot be updated or modified."
"[Writing to files] ... it only supports append mode, because while it is easy to write new files in the output directory (i.e., append data to a directory), it is hard to modify existing data files (as would be expected with update and complete modes)."
To overcome the issue you are facing you could change your streaming queries to something like:
SQ1 - Reads data from Kafka -> processes -> further processing -> cache/persist
SQ2a writes cached DataFrame to HDFS file sink (path - hdfs://tmp/output)
SQ2b writes cached DataFrame to external database using ForeachWriter
-> writes to external database using ForeachWriter

Unable to read the streaming data from the single file in Spark streaming

I am trying to read the streaming data from the text file which gets appended continuously using Spark streaming API "textFileStream". But unable to read the continuous data with Spark streaming. How to achieve it in Spark?
This an expected behavior. For file based sources (like fileStream):
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
If you want to read continuously appended you'll have to create your own source, or use separate process, which will monitor changes, and push records to for example Kafka (though it is rare to combine Spark with file systems that support appending).

Resources