How spark process XML files? - apache-spark

How spark process XML files in distributed manner? XML file is not splittable file right? Will it be processed only by a single node? I'm little bit confused, It would be helpful if someone help me on this query. Thanks in advance

I came across the same question from the recent use case/development using Spark.
From my observation of the Spark Web UI, it seems like an XML file is not splittable indeed but the transformation (read/parse..etc) seems to be handled by multiple nodes in a distributed manner.
My summary is that assuming you have 100 XML files to read and process, and you have 10 nodes, then you can only process 10 files at a time and move on to the next multiple of 10. (10 -> 20 -> 30.. 100).

Related

Hadoop input with 100k paths cause extremely long time during splits

I am using Flink batch API with Hadoop FileInputFormat to process a large number of input files(approx. 100k). I found it is extremely slow for job to be prepared. I found that in FileInputFormat.getSplits() method, it iterate all input paths and get block locations for every paths. I think it will send 100k requests to HDFS which leads to the problem. Is there any approaches to speed up the split generation procedure? I think spark and mapreduce may have a similarly problem as well. Thank you very much!
Try increasing this parameter: mapreduce.input.fileinputformat.list-status.num-threads
Also, compacting those 100k files would definitely help.

Dilemma about Spark partitions

I am working on a project where I have to read S3 files (each about 3MB zipped) using boto3. I have a small pyspark script that runs every hour to process the file and generate 2 types of output data which is written back to S3. The pyspark script uses 'xmltodict' python library to read some static data into a dictionary object needed for file processing. I have a small Amazon EMR cluster v5.28 running with 1 Master and 1 Core. This might be excessive but is not my main concern right now.
Questions:
1. How do I know 'IF' i should partition the data? I have read articles on how many partitions to create, etc but couldn't find anything on IF and WHEN. What is the criteria that drives partitioning - number of rows, columns, data type, actions taken in the script, etc in the source data file? I read the source file into an RDD and convert it to a DF and perform various operations by adding columns, grouping data, counting data, etc. How does spark handle partitioning behind the scenes?
2. Currently, I manually execute the pyspark script as follows:
spark-submit --master spark://x.x.x.x:7077 --deploy-mode client test.py
on the master node as I have decided to stick with Standalone CM. The 'xmltodict' is installed on this node, but is not installed on the Core node. It doesn't seem like it needs to be installed or even python3 configured on Core node since I am not seeing any errors. Is that correct and can somebody shed some light on this confusion? I tried to install the python libraries via shell file as a bootstrap
when I created the cluster, but it failed and quite frankly after trying it a few times, I gave up.
3. Based on partitioning I think I am slightly confused on whether or not to use coalesce() or collect(). Again, the question is when to use and when not to?
Sorry too many questions. Now, that I have the pyspark script written, I am trying to work the efficiencies.
Thanks
Partitioning is the mechanism with which data is divided into optimum size chunks and based on that multiple tasks are run, each processing one piece of data. As you see this is the core of parallelism and without this there is no significant use of Spark (or any bigdata processing framework). Most of the file formats are splittable and some are splittable when compressed like Avro, parquet, orc etc. Some file formats are not splittable when compressed like - zip, gzip etc. Based on the size of the file being processed and their ability to be split, Spark automatically creates multiple partitions and processes data in parallel. In your case the data being zip, one file will be one partition and no more than 1 CPU can work on it at once. If this zip is small then its ok, but if it is big then its processing will be slow.

S3 based streaming solution using apache spark or flink

We have batch pipelines writing files (mostly csv) into an s3 bucket. Some of these pipelines write per minute and some of them every 5 mins. Currently, we have a batch application which runs every hour processing these files.
Business wants data to be available every 5 mins. Instead, of running batch jobs every 5 mins we decided to use apache spark structured streaming and process the data in real time. My question is how easy/difficult is productionise this solution?
My only worry is if checkpoint location gets corrupt, deleting the checkpoint directory will re-process data back from last 1 yr. Has anyone productionised any solution using s3 using spark structured streaming or you think flink is better for this use case?
If you think there is a better architecture/pattern for this problem, kindly point me in the right direction.
ps: We already thought of putting these files into kafka and ruled out due to the availability of bandwidth and large size of the files.
I found a way to do this, not the most effective way. Since we have already productionized Kafka based solution before, we could push a event into Kafka using s3 streams and lambda. The event will contain only metadata like file location and size.
This will make the spark program a bit more challenging as the file will be read and processed inside the executor, which is effectively not utilising the distributed processing. Or else, read in executor and bring the data back to the driver to utilise the distributed processing of spark. This will require the spark app to be planned a lot better in terms of memory, ‘cos input file sizes change a lot.
https://databricks.com/blog/2019/05/10/how-tilting-point-does-streaming-ingestion-into-delta-lake.html

Kafka , Spark large csv file (4Go)

I am developing an integration channel with kafka and spark, which will process batchs and streaming.
for batch processing, I entered huge CSV files (4 GB).
I'm considering two solutions:
Send the whole file to the file system and send a message to kafka
with the file address, and the spark job will read the file from the
FS and turn on it.
cut the file before kafka in unit message (with apache nifi) and
send to treat the batch as streaming in the spark job.
What do you think is the best solution ?
Thanks
If you're writing code to place the file on the file system, you can use that same code to submit the Spark job to the job tracker. The job tracker becomes the task queue and processes your submitted files as Spark jobs.
This would be a more simplistic way of implementing #1 but it has drawbacks. The main drawback being that you have to tune resource allocation to make sure you don't under allocate for cases if your data set is extremely large. If you over allocate resources for the job, then your task queue potentially grows while tasks are waiting for resources. The advantage is that there aren't very many moving parts to maintain and troubleshoot.
Using nifi to cut a large file down and having spark handle the pieces as a stream would probably make it easier to utilize the cluster resources more effectively. If your cluster is servicing random jobs on top of this data ingestion, this might be the better way to go. The drawbacks here might be that you need to do extra work to process all parts of a single file in one transactional context, you may have to do a few extra things to make sure you aren't going to lose the data delivered by Kafka, etc.
If this is for a batch operation, maybe method 2 would be considered overkill. The setup seems pretty complex for reading a CSV file even if it is a potentially really large file. If you had a problem with the velocity of the CSV file, a number of ever-changing sources for the CSV, or a high error rate then NiFi would make a lot of sense.
It's hard to suggest the best solution. If it were me, I'd start with the variation of #1 to make it work first. Then you make it work better by introducing more system parts depending on how your approach performs with an acceptable level of accuracy in handling anomalies in the input file. You may find that your biggest problem is trying to identify errors in input files during a large scale ingestion.

Spark Stand Alone - Last Stage saveAsTextFile takes many hours using very little resources to write CSV part files

We run Spark in Standalone mode with 3 nodes on a 240GB "large" EC2 box to merge three CSV files read into DataFrames to JavaRDDs into output CSV part files on S3 using s3a.
We can see from the Spark UI, the first stages reading and merging to produce the final JavaRDD run at 100% CPU as expected, but the final stage writing out as CSV files using saveAsTextFile at package.scala:179 gets "stuck" for many hours on 2 of the 3 nodes with 2 of the 32 tasks taking hours (box is at 6% CPU, memory 86%, Network IO 15kb/s, Disk IO 0 for the entire period).
We are reading and writing uncompressed CSV (we found uncompressed was much faster than gzipped CSV) with re partition 16 on each of the three input DataFrames and not coleaseing the write.
Would appreciate any hints what we can investigate as to why the final stage takes so many hours doing very little on 2 of the 3 nodes in our standalone local cluster.
Many thanks
--- UPDATE ---
I tried writing to local disk rather than s3a and the symptoms are the same - 2 of the 32 tasks in the final stage saveAsTextFile get "stuck" for hours:
If you are writing to S3, via s3n, s3a or otherwise, do not set spark.speculation = true unless you want to run the risk of corrupted output.
What I suspect is happening is that the final stage of the process is renaming the output file, which on an object store involves copying lots (many GB?) of data. The rename takes place on the server, with the client just keeping an HTTPS connection open until it finishes. I'd estimate S3A rename time as about 6-8 Megabytes/second...would that number tie in with your results?
Write to local HDFS then, afterwards, upload the output.
gzip compression can't be split, so Spark will not assign parts of processing a file to different executors. One file: one executor.
Try and avoid CSV, it's an ugly format. Embrace: Avro, Parquet or ORC. Avro is great for other apps to stream into, the others better for downstream processing in other queries. Significantly better.
And consider compressing the files with a format such as lzo or snappy, both of which can be split.
see also slides 21-22 on: http://www.slideshare.net/steve_l/apache-spark-and-object-stores
I have seen similar behavior. There is a bug fix in HEAD as of October 2016 that may be relevant. But for now you might enable
spark.speculation=true
in the SparkConf or in spark-defaults.conf .
Let us know if that mitigates the issue.

Resources