Spark concurrent writes on same HDFS location - apache-spark

I have a spark code which saves a dataframe to a HDFS location (date partitioned location) in Json format using append mode.
df.write.mode("append").format('json').save(hdfsPath)
sample hdfs location : /tmp/table1/datepart=20190903
I am consuming data from upstream in NiFi cluster. Each node in NiFi cluster will create a flow file for consumed data. My spark code is processing that flow file.As NiFi is distributed, my spark code is getting executed from different NiFi nodes in parallel trying to save data into same HDFS location.
I cannot store output of spark job in different directories as my data is partitioned on date.
This process is running daily once from last 14 days and my spark job failed 4 times with different errors.
First Error:
java.io.IOException: Failed to rename FileStatus{path=hdfs://tmp/table1/datepart=20190824/_temporary/0/task_20190824020604_0000_m_000000/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json; isDirectory=false; length=0; replication=3; blocksize=268435456; modification_time=1566630365451; access_time=1566630365034; owner=hive; group=hive; permission=rwxrwx--x; isSymlink=false} to hdfs://tmp/table1/datepart=20190824/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json
Second Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190825/_temporary/0 does not exist.
Third Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190901/_temporary/0/task_20190901020450_0000_m_000000 does not exist.
Fourth Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190903/_temporary/0 does not exist.
Following are the problems/issue:
I am not able to recreate this scenario again. How to do that?
On all 4 occasions, errors are related to _temporary directory. Is is because 2 or more jobs are parallelly trying to save the data in same HDFS location and whiling doing that Job A might have deleted _temporary directory of Job B? (Because of the same location and all folders have common name /_directory/0/)
If it is concurrency problem then I can run all NiFi processor from primary node but then I will loose the performance.
Need your expert advice.
Thanks in advance.

It seems the problem is that two spark nodes are independently trying to write to the same place, causing conflicts as the fastest one will clear up the working directory before the second one expects it.
The most straightforward solution may be to avoid this.
As I understand how you use Nifi and spark, the node where Nifi runs also determines the node where spark runs (there is a 1-1 relationship?)
If that is the case you should be able to solve this by routing the work in Nifi to nodes that do not interfere with each other. Check out the load balancing strategy (property of the queue) that depends on attributes. Of course you would need to define the right attribute, but something like directory or table name should go a long way.

Try to enable outputcommitter v2:
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
It doesn't use shared temp directory for files , but creates .sparkStaging-<...> independent temp directories for each write
It also speeds up write, but allow some rear hypothetical cases of partial data write
Try to check this doc for more info:
https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#recommended-settings-for-writing-to-object-stores

Related

How to make sure that spark.write.parquet() writes the data frame on to the file specified using relative path and not in HDFS, on EMR?

My problem is as below:
A pyspark script that runs perfectly on a local machine and an EC2 is ported on to an EMR for scaling up. There's a config file with relative locations for outputs mentioned.
An example:
Config
feature_outputs= /outputs/features/
File structure:
classifier_setup
feature_generator.py
model_execution.py
config.py
utils.py
logs/
models/
resources/
outputs/
Code reads the config, generates features and writes them into the path mentioned above. On EMR, this is getting saved in to the HDFS. (spark.write.parquet writes into the HDFS, on the hand, df.toPandas().to_csv() writes to the relative output path mentioned). The next part of the script, reads the same path mentioned in the config, tries to read the parquet from the mentioned location, and fails.
How to make sure that the outputs are created in the relative that is specified in the code ?
If that's not possible, how can I make sure that I read it from the HDFS in the subsequent steps.
I referred these discussions: HDFS paths ,enter link description here, however, it's not very clear to me. Can someone help me with this.
Thanks.
Short Answer to your question:
Writing using Pandas and Spark are 2 different things. Pandas doesn't utilize Hadoop to process, read and write; it writes into the standard EMR file system, which is not HDFS. On the other hand, Spark utilizes distributed computing for getting things into multiple machines at the same time and it's built on top of Hadoop so by default when you write using Spark it writes into HDFS.
While writing from EMR, you can choose to write either into
EMR local filesystem,
HDFS, or
EMRFS (which is s3 buckets).
Refer AWS documentation
If at the end of your job, you are writing using Pandas dataframe and you want to write it into HDFS location (maybe because your next step Spark job is reading from HDFS, or for some reason) you might have to use PyArrow for that, Refer this
If at the end fo your job, you are writing into HDFS using Spark dataframe, in next step you can read it by using hdfs://<feature_outputs> like that to read in next step.
Also while you are saving data into EMR HDFS, you will have to keep in mind that if you are using default EMR storage, it's volatile i.e. all the data will be lost once the EMR goes down i.e. gets terminated, and if you want to keep your data stored in EMR you might have to get an External EBS volume attached to it that can be used in other EMR also or some other storage solution that AWS provides.
The best way is if you are writing your data and you need it to be persisted to write it into S3 instead of EMR.

Dilemma about Spark partitions

I am working on a project where I have to read S3 files (each about 3MB zipped) using boto3. I have a small pyspark script that runs every hour to process the file and generate 2 types of output data which is written back to S3. The pyspark script uses 'xmltodict' python library to read some static data into a dictionary object needed for file processing. I have a small Amazon EMR cluster v5.28 running with 1 Master and 1 Core. This might be excessive but is not my main concern right now.
Questions:
1. How do I know 'IF' i should partition the data? I have read articles on how many partitions to create, etc but couldn't find anything on IF and WHEN. What is the criteria that drives partitioning - number of rows, columns, data type, actions taken in the script, etc in the source data file? I read the source file into an RDD and convert it to a DF and perform various operations by adding columns, grouping data, counting data, etc. How does spark handle partitioning behind the scenes?
2. Currently, I manually execute the pyspark script as follows:
spark-submit --master spark://x.x.x.x:7077 --deploy-mode client test.py
on the master node as I have decided to stick with Standalone CM. The 'xmltodict' is installed on this node, but is not installed on the Core node. It doesn't seem like it needs to be installed or even python3 configured on Core node since I am not seeing any errors. Is that correct and can somebody shed some light on this confusion? I tried to install the python libraries via shell file as a bootstrap
when I created the cluster, but it failed and quite frankly after trying it a few times, I gave up.
3. Based on partitioning I think I am slightly confused on whether or not to use coalesce() or collect(). Again, the question is when to use and when not to?
Sorry too many questions. Now, that I have the pyspark script written, I am trying to work the efficiencies.
Thanks
Partitioning is the mechanism with which data is divided into optimum size chunks and based on that multiple tasks are run, each processing one piece of data. As you see this is the core of parallelism and without this there is no significant use of Spark (or any bigdata processing framework). Most of the file formats are splittable and some are splittable when compressed like Avro, parquet, orc etc. Some file formats are not splittable when compressed like - zip, gzip etc. Based on the size of the file being processed and their ability to be split, Spark automatically creates multiple partitions and processes data in parallel. In your case the data being zip, one file will be one partition and no more than 1 CPU can work on it at once. If this zip is small then its ok, but if it is big then its processing will be slow.

Spark failing because S3 files are updated. How to eliminate this error?

My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.
I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message
Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I create the df with: df = spark.table('data_base.bal_daily_posts')
So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?
Move files you're going to process to a different folder(key) and point spark to work with this folder only
This can happen if the same (e.g. if reading and writing directories are the same, Spark will first delete the files in destination) or other process has updated the files while they are being read.

why does _spark_metadata has all parquet partitioned files inside 0 but cluster having 2 workers?

I have a small spark cluster with one master and two workers. I have a Kafka streaming app which streams data from Kafka and writes to a directory in parquet format and in append mode.
So far I am able to read from Kafka stream and write it to a parquet file using the following key line.
val streamingQuery = mydf.writeStream.format("parquet").option("path", "/root/Desktop/sampleDir/myParquet").outputMode(OutputMode.Append).option("checkpointLocation", "/root/Desktop/sampleDir/myCheckPoint").start()
I have checked in both of the workers. There are 3-4 snappy parquet files got created with file names having prefix as part-00006-XXX.snappy.parquet.
But when I try to read this parquet file using following command:
val dfP = sqlContext.read.parquet("/root/Desktop/sampleDir/myParquet")
it is showing file not found exceptions for some of the parquet split files. Strange thing is that, those files are already present in the one of the worker nodes.
When further checked in the logs, it is obeserved that spark is trying to get all the parquet files from only ONE worker nodes, and since not all parquet files are present in one worker, it is hitting with the exception that those files were not found in the mentioned path to parquet.
Am I missing some critical step in the streaming query or while reading data?
NOTE: I don't have a HADOOP infrastructure. I want to use filesystem only.
You need a shared file system.
Spark assumes the same file system is visible from all nodes (driver and workers).
If you are using the basic file system then each node sees their own file system which is different than the file system of other nodes.
HDFS is one way of getting a common, shared file system, another would be to use a common NFS mount (i.e. mount the same remote file system from all nodes to the same path). Other shared file systems also exist.

Spark streaming + processing files in HDFS stage directory

I am having a daemon process which dumps data as files in HDFS. I need to create a RDD over the new files, de-duplicate them and store them back on HDFS. The file names should be maintained while dumping back on HDFS.
Any pointers to achieve this?
I am open to achieve it with or without spark streaming.
I tried creating a spark streaming process which processes data directly ( using java code on worker nodes) and pushes it into HDFS without creating a RDD.
But, this approach fails for larger files (greater than 15GB).
I am looking into JavaSparkContext.fileStreaming now.
Any pointers would be a great help.
Thanks and Regards,
Abhay Dandekar

Resources