How to change the location of _spark_metadata directory? - apache-spark

I am using Spark Structured Streaming's streaming query to write parquet files to S3 using the following code:
ds.writeStream().format("parquet").outputMode(OutputMode.Append())
.option("queryName", "myStreamingQuery")
.option("checkpointLocation", "s3a://my-kafka-offset-bucket-name/")
.option("path", "s3a://my-data-output-bucket-name/")
.partitionBy("createdat")
.start();
I get the desired output in the s3 bucket my-data-output-bucket-name but along with the output, I get the _spark_metadata folder in it. How to get rid of it? If I can't get rid of it, how to change it's location to a different S3 bucket?

My understanding is that it is not possible up to Spark 2.3.
The name of the metadata directory is always _spark_metadata
_spark_metadata directory is always at the location where path option points to
I think the only way to "fix" it is to report an issue in Apache Spark's JIRA and hope someone would pick it up.
Internals
The flow is that DataSource is requested to create the sink of a streaming query and takes the path option. With that, it goes to create a FileStreamSink. The path option simply becomes the basePath where the results are written to as well as the metadata.
You can find the initial commit quite useful to understand the purpose of the metadata directory.
In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.

Related

Spark reading from a new location keeping the output directory same

I have a spark job which reads (using structured streaming API) from a source s3://bucket/source-1 folder and writes to s3://bucket/destination-1 folder. The checkpoints are saved at s3://bucket/checkpoint-1.
Now I want to read the data with the same schema from s3://bucket/source-2 (with checkpointing at s3://bucket/checkpoint-2) but i want to append it to the same s3://bucket/destination-2 folder. Is it possible?
Yes, of course this is possible to write into the same location. But there are different things that you need to take into account, like:
what data format you're using as output (Parquet, Delta, something else...)?
are these both streaming jobs are running in the same time? Could you have conflicts when writing data?
(potentially) what is partitioning schema for the destination?

Read file while it is being written by spark structured streaming

I am using spark structured streaming for my application. I have use case where i need to read file while it is being written.
I tried with spark structured streaming as below:
sch=StructType([StructField("ID",IntegerType(),True),StructField("COUNTRY",StringType(),True)])
df_str = spark.readStream.format("csv").schema(sch). option("header",True).option("delimiter",','). load("<Load Path>")
query = df_str.writeStream.format("parquet").outputMode("append").trigger(processingTime='10 seconds').option("path","<HDFS location>").option("checkpointLocation","<chckpoint_loc>").start()
But it is reading only file initially, after that file is not getting read incrementally. i am thinking workaround to write file in temp directory and create new file after some time and copy to directory from spark structured streaming job is reading but this is causing latency.
Is there any other way to handle this(I can not use kafka)?
Sorry if this question is not for Stackoverflow because i did not find any other place to ask this question.
Unfortunately Spark doesn't support it. The unit of file stream source is "file". Spark assumes that the files it reads are "immutable", which means the files shouldn't be changed once they're placed in source path. This makes offset management pretty much simpler (doesn't need to track file offsets), where the number of files in source path would keep increasing. Reasonable limitation, but still a limitation.

Rollback write failure in S3 prefix-partition via Spark

We are using Apache Spark(2.4.5) job via EMR, it reads a S3 prefix {bucket}/{prefix}/*.json, conducts some data massaging and then rewrites it back to the same {bucket}/{prefix} via the Spark job save() in overwrite mode. My question is, if the Spark job fails while it is re-writing the data to S3 prefix-partition, then is there any way we can restore data in that prefix-partition in any atomic/transactional way.
Does spark/EMR/S3 any/all of these support it?
Spark will write the new files to the folder because the cluster nodes are writing files at the same time and it is better to write the multiple files. So, when you did the overwrite action, Spark will remove the folder contents first and write the result.
The problem is that the Spark will not cache the whole original data, but just a part of the data that is needed in the code. If you write the result to the original path, then it will delete the original first and write the cached result into the folder.
You could use the append method but it will create some new files not just adding the data into the original files. Spark is not designed to do this and there is no way to revert when you overwrite or it fails.

Spark concurrent writes on same HDFS location

I have a spark code which saves a dataframe to a HDFS location (date partitioned location) in Json format using append mode.
df.write.mode("append").format('json').save(hdfsPath)
sample hdfs location : /tmp/table1/datepart=20190903
I am consuming data from upstream in NiFi cluster. Each node in NiFi cluster will create a flow file for consumed data. My spark code is processing that flow file.As NiFi is distributed, my spark code is getting executed from different NiFi nodes in parallel trying to save data into same HDFS location.
I cannot store output of spark job in different directories as my data is partitioned on date.
This process is running daily once from last 14 days and my spark job failed 4 times with different errors.
First Error:
java.io.IOException: Failed to rename FileStatus{path=hdfs://tmp/table1/datepart=20190824/_temporary/0/task_20190824020604_0000_m_000000/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json; isDirectory=false; length=0; replication=3; blocksize=268435456; modification_time=1566630365451; access_time=1566630365034; owner=hive; group=hive; permission=rwxrwx--x; isSymlink=false} to hdfs://tmp/table1/datepart=20190824/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json
Second Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190825/_temporary/0 does not exist.
Third Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190901/_temporary/0/task_20190901020450_0000_m_000000 does not exist.
Fourth Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190903/_temporary/0 does not exist.
Following are the problems/issue:
I am not able to recreate this scenario again. How to do that?
On all 4 occasions, errors are related to _temporary directory. Is is because 2 or more jobs are parallelly trying to save the data in same HDFS location and whiling doing that Job A might have deleted _temporary directory of Job B? (Because of the same location and all folders have common name /_directory/0/)
If it is concurrency problem then I can run all NiFi processor from primary node but then I will loose the performance.
Need your expert advice.
Thanks in advance.
It seems the problem is that two spark nodes are independently trying to write to the same place, causing conflicts as the fastest one will clear up the working directory before the second one expects it.
The most straightforward solution may be to avoid this.
As I understand how you use Nifi and spark, the node where Nifi runs also determines the node where spark runs (there is a 1-1 relationship?)
If that is the case you should be able to solve this by routing the work in Nifi to nodes that do not interfere with each other. Check out the load balancing strategy (property of the queue) that depends on attributes. Of course you would need to define the right attribute, but something like directory or table name should go a long way.
Try to enable outputcommitter v2:
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
It doesn't use shared temp directory for files , but creates .sparkStaging-<...> independent temp directories for each write
It also speeds up write, but allow some rear hypothetical cases of partial data write
Try to check this doc for more info:
https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#recommended-settings-for-writing-to-object-stores

How to process new files in HDFS directory once their writing has eventually finished?

In my scenario I have CSV files continuously uploaded to HDFS.
As soon as a new file gets uploaded I'd like to process the new file with Spark SQL (e.g., compute the maximum of a field in the file, transform the file into parquet). i.e. I have a one-to-one mapping between each input file and a transformed/processed output file.
I was evaluating Spark Streaming to listen to the HDFS directory, then to process the "streamed file" with Spark.
However, in order to process the whole file I would need to know when the "file stream" completes. I'd like to apply the transformation to the whole file in order to preserve the end-to-end one-to-one mapping between files.
How can I transform the whole file and not its micro-batches?
As far as I know, Spark Streaming can only apply transformation to batches (DStreams mapped to RDDs) and not to the whole file at once (when its finite stream has completed).
Is that correct? If so, what alternative should I consider for my scenario?
I may have misunderstood your question the first try...
As far as I know, Spark Streaming can only apply transformation to batches (DStreams mapped to RDDs) and not to the whole file at once (when its finite stream has completed).
Is that correct?
No. That's not correct.
Spark Streaming will apply transformation to the whole file at once as was written to HDFS at the time Spark Streaming's batch interval elapsed.
Spark Streaming will take the current content of a file and start processing it.
As soon as a new file gets uploaded I need to process the new file with Spark/SparkSQL
Almost impossible with Spark due to its architecture which takes some time from the moment "gets uploaded" and Spark processes it.
You should consider using a brand new and shiny Structured Streaming or (soon obsolete) Spark Streaming.
Both solutions support watching a directory for new files and trigger Spark job once a new file gets uploaded (which is exactly your use case).
Quoting Structured Streaming's Input Sources:
In Spark 2.0, there are a few built-in sources.
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
See also Spark Streaming's Basic Sources:
Besides sockets, the StreamingContext API provides methods for creating DStreams from files as input sources.
File Streams: For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported).
One caveat though given your requirement:
I would need to know when the "file stream" completes.
Don't do this with Spark.
Quoting Spark Streaming's Basic Sources again:
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
Wrapping up...you should only move the files to the directory that Spark watches when the files are complete and ready for processing using Spark. This is outside the scope of Spark.
You can use DFSInotifyEventInputStream to watch Hadoop dir and then execute Spark job programmatically when file is created.
See this post:
HDFS file watcher

Resources