Spark reading from a new location keeping the output directory same - apache-spark

I have a spark job which reads (using structured streaming API) from a source s3://bucket/source-1 folder and writes to s3://bucket/destination-1 folder. The checkpoints are saved at s3://bucket/checkpoint-1.
Now I want to read the data with the same schema from s3://bucket/source-2 (with checkpointing at s3://bucket/checkpoint-2) but i want to append it to the same s3://bucket/destination-2 folder. Is it possible?

Yes, of course this is possible to write into the same location. But there are different things that you need to take into account, like:
what data format you're using as output (Parquet, Delta, something else...)?
are these both streaming jobs are running in the same time? Could you have conflicts when writing data?
(potentially) what is partitioning schema for the destination?

Related

Rollback write failure in S3 prefix-partition via Spark

We are using Apache Spark(2.4.5) job via EMR, it reads a S3 prefix {bucket}/{prefix}/*.json, conducts some data massaging and then rewrites it back to the same {bucket}/{prefix} via the Spark job save() in overwrite mode. My question is, if the Spark job fails while it is re-writing the data to S3 prefix-partition, then is there any way we can restore data in that prefix-partition in any atomic/transactional way.
Does spark/EMR/S3 any/all of these support it?
Spark will write the new files to the folder because the cluster nodes are writing files at the same time and it is better to write the multiple files. So, when you did the overwrite action, Spark will remove the folder contents first and write the result.
The problem is that the Spark will not cache the whole original data, but just a part of the data that is needed in the code. If you write the result to the original path, then it will delete the original first and write the cached result into the folder.
You could use the append method but it will create some new files not just adding the data into the original files. Spark is not designed to do this and there is no way to revert when you overwrite or it fails.

Problems in reading existing multilevel partitioned files data from from middle in Spark Structured Streaming

I am working on structured spark streaming with existing multilevel partitioned parquet file as source. I have following issue while using it.
Starting spark streaming job to read data from particular partition instead of starting from beginning.
Suppose if we observed that there is data quality issue in partition year=2018/month=10/hour=10. Now suppose i have corrected that data till date by replacing correct files.
Now Question is how to reprocess data starting from this day instead of starting from beginning? Because in structured streaming lets say i use file stream as source which will load all files i want to ignore few files. Here i need to remove my checkpoint directory as well because it has offset till date.
Suppose if we observed that there is data quality issue in partition year=2018/month=10/hour=10. Then how to reprocess data starting from this day instead of starting from beginning?
I don't think it's possible in Spark Structured Streaming (wish I'd be mistaken).
Since we're talking about a streaming query, you'd have to rewind the "stream". The only way to achieve it (I can think of) is to re-upload the data (no idea how to do it) or simply process data that would "delete" the previous version of the partition year=2018/month=10/hour=10 and then upload a new corrected version.
The question is how to inform the parquet data source that whatever has already been processed should be "evicted" from the result (that may've been sent out to external sources for further processing).

Does Spark lock the File while writing to HDFS or S3

I have an S3 location with the below directory structure with a Hive table created on top of it:
s3://<Mybucket>/<Table Name>/<day Partition>
Let's say I have a Spark program which writes data into above table location spanning multiple partitions using the below line of code:
Df.write.partitionBy("orderdate").parquet("s3://<Mybucket>/<Table Name>/")
If another program such as "Hive SQL query" or "AWS Athena Query" started reading data from the table at the same time:
Do they consider temporary files being written?
Does spark lock the data file while writing into S3 location?
How can we handle such concurrency situations using Spark as an ETL tool?
No locks. Not implemented in S3 or HDFS.
The process of committing work in HDFS is not atomic in HDFS; there's some renaming going on in job commit which is fast but not instantaneous
With S3 things are pathologically slow with the classic output committers, which assume rename is atomic and fast.
The Apache S3A committers avoid the renames and only make the output visible in job commit, which is fast but not atomic
Amazon EMR now has their own S3 committer, but it makes files visible when each task commits, so exposes readers to incomplete output for longer
Spark writes the output in a two-step process. First, it writes the data to _temporary directory and then once the write operation is complete and successful, it moves the file to the output directory.
Do they consider temporary files being written?
As the files starting with _ are hidden files, you can not read them from Hive or AWS Athena.
Does spark lock the data file while writing into S3 location?
Locking or any concurrency mechanism is not required because of the simple two-step write process of spark.
How can we handle such concurrency situations using Spark as an ETL tool?
Again using the simple writing to temporary location mechanism.
One more thing to note here is, in your example above after writing output to the output directory you need to add the partition to hive external table using Alter table <tbl_name> add partition (...) command or msck repair table tbl_name command else data won't be available in hive.

Unable to read the streaming data from the single file in Spark streaming

I am trying to read the streaming data from the text file which gets appended continuously using Spark streaming API "textFileStream". But unable to read the continuous data with Spark streaming. How to achieve it in Spark?
This an expected behavior. For file based sources (like fileStream):
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
If you want to read continuously appended you'll have to create your own source, or use separate process, which will monitor changes, and push records to for example Kafka (though it is rare to combine Spark with file systems that support appending).

How to process new files in HDFS directory once their writing has eventually finished?

In my scenario I have CSV files continuously uploaded to HDFS.
As soon as a new file gets uploaded I'd like to process the new file with Spark SQL (e.g., compute the maximum of a field in the file, transform the file into parquet). i.e. I have a one-to-one mapping between each input file and a transformed/processed output file.
I was evaluating Spark Streaming to listen to the HDFS directory, then to process the "streamed file" with Spark.
However, in order to process the whole file I would need to know when the "file stream" completes. I'd like to apply the transformation to the whole file in order to preserve the end-to-end one-to-one mapping between files.
How can I transform the whole file and not its micro-batches?
As far as I know, Spark Streaming can only apply transformation to batches (DStreams mapped to RDDs) and not to the whole file at once (when its finite stream has completed).
Is that correct? If so, what alternative should I consider for my scenario?
I may have misunderstood your question the first try...
As far as I know, Spark Streaming can only apply transformation to batches (DStreams mapped to RDDs) and not to the whole file at once (when its finite stream has completed).
Is that correct?
No. That's not correct.
Spark Streaming will apply transformation to the whole file at once as was written to HDFS at the time Spark Streaming's batch interval elapsed.
Spark Streaming will take the current content of a file and start processing it.
As soon as a new file gets uploaded I need to process the new file with Spark/SparkSQL
Almost impossible with Spark due to its architecture which takes some time from the moment "gets uploaded" and Spark processes it.
You should consider using a brand new and shiny Structured Streaming or (soon obsolete) Spark Streaming.
Both solutions support watching a directory for new files and trigger Spark job once a new file gets uploaded (which is exactly your use case).
Quoting Structured Streaming's Input Sources:
In Spark 2.0, there are a few built-in sources.
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
See also Spark Streaming's Basic Sources:
Besides sockets, the StreamingContext API provides methods for creating DStreams from files as input sources.
File Streams: For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported).
One caveat though given your requirement:
I would need to know when the "file stream" completes.
Don't do this with Spark.
Quoting Spark Streaming's Basic Sources again:
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
Wrapping up...you should only move the files to the directory that Spark watches when the files are complete and ready for processing using Spark. This is outside the scope of Spark.
You can use DFSInotifyEventInputStream to watch Hadoop dir and then execute Spark job programmatically when file is created.
See this post:
HDFS file watcher

Resources