Dilemma about Spark partitions - apache-spark

I am working on a project where I have to read S3 files (each about 3MB zipped) using boto3. I have a small pyspark script that runs every hour to process the file and generate 2 types of output data which is written back to S3. The pyspark script uses 'xmltodict' python library to read some static data into a dictionary object needed for file processing. I have a small Amazon EMR cluster v5.28 running with 1 Master and 1 Core. This might be excessive but is not my main concern right now.
Questions:
1. How do I know 'IF' i should partition the data? I have read articles on how many partitions to create, etc but couldn't find anything on IF and WHEN. What is the criteria that drives partitioning - number of rows, columns, data type, actions taken in the script, etc in the source data file? I read the source file into an RDD and convert it to a DF and perform various operations by adding columns, grouping data, counting data, etc. How does spark handle partitioning behind the scenes?
2. Currently, I manually execute the pyspark script as follows:
spark-submit --master spark://x.x.x.x:7077 --deploy-mode client test.py
on the master node as I have decided to stick with Standalone CM. The 'xmltodict' is installed on this node, but is not installed on the Core node. It doesn't seem like it needs to be installed or even python3 configured on Core node since I am not seeing any errors. Is that correct and can somebody shed some light on this confusion? I tried to install the python libraries via shell file as a bootstrap
when I created the cluster, but it failed and quite frankly after trying it a few times, I gave up.
3. Based on partitioning I think I am slightly confused on whether or not to use coalesce() or collect(). Again, the question is when to use and when not to?
Sorry too many questions. Now, that I have the pyspark script written, I am trying to work the efficiencies.
Thanks

Partitioning is the mechanism with which data is divided into optimum size chunks and based on that multiple tasks are run, each processing one piece of data. As you see this is the core of parallelism and without this there is no significant use of Spark (or any bigdata processing framework). Most of the file formats are splittable and some are splittable when compressed like Avro, parquet, orc etc. Some file formats are not splittable when compressed like - zip, gzip etc. Based on the size of the file being processed and their ability to be split, Spark automatically creates multiple partitions and processes data in parallel. In your case the data being zip, one file will be one partition and no more than 1 CPU can work on it at once. If this zip is small then its ok, but if it is big then its processing will be slow.

Related

PySpark Parallelism

I am new to spark and am trying to implement reading data from a parquet file and then after some transformation returning it to web ui as a paginated way. Everything works no issue there.
So now I want to improve the performance of my application, after some google and stack search I found out about pyspark parallelism.
What I know is that :
pyspark parallelism works by default and It creates a parallel process based on the number of cores the system has.
Also for this to work data should be partitioned.
Please correct me if my understanding is not right.
Questions/doubt:
I am reading data from one parquet file, so my data is not partitioned and if I use the .repartition() method on my dataframe that is expensive. so how should I use PySpark Parallelism here ?
Also I could not find any simple implementation of pyspark parallelism, which could explain how to use it.
In spark cluster 1 core reads one partition so if you are on multinode spark cluster
then you need to leave some meory for existing system manager like Yarn etc.
https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html
you can use reparation and specify number of partitions
df.repartition(n)
where n is the number of partition. Repartition is for parlelleism, it will be ess expensive then process your single file without any partition.

Distribute file copy to executors

I have a bunch of data (on S3) that I am copying to a local HDFS (on amazon EMR). Right now I'm doing that using org.apache.hadoop.fs.FileUtil.copy, but it's not clear if this distributes the file copy to the executors. There's certainly nothing showing up in the Spark History server.
Hadoop DistCp seems like the thing (note I'm on S3, so it's actually supposed to be s3-dist-cp which is built on top of dist-cp) except that it's a command-line tool. I'm looking for a way to invoke this from a Scala script (aka, Java).
Any ideas / leads?
cloudcp is an example of using Spark to do the copy; the list of files is turned into an RDD, each row == a copy. That design is optimised for upload from HDFS, as it tries to schedule the upload close to the files in HDFS.
For download, you want to
use listFiles(path, recursive) for maximum performance in listing an object store.
randomise the list of source files so that you don't get throttled by AWS
randomise the placement across the HDFS cluster so that the blocks end up scattered evenly round the cluster

Importing a large text file into Spark

I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).

Spark Stand Alone - Last Stage saveAsTextFile takes many hours using very little resources to write CSV part files

We run Spark in Standalone mode with 3 nodes on a 240GB "large" EC2 box to merge three CSV files read into DataFrames to JavaRDDs into output CSV part files on S3 using s3a.
We can see from the Spark UI, the first stages reading and merging to produce the final JavaRDD run at 100% CPU as expected, but the final stage writing out as CSV files using saveAsTextFile at package.scala:179 gets "stuck" for many hours on 2 of the 3 nodes with 2 of the 32 tasks taking hours (box is at 6% CPU, memory 86%, Network IO 15kb/s, Disk IO 0 for the entire period).
We are reading and writing uncompressed CSV (we found uncompressed was much faster than gzipped CSV) with re partition 16 on each of the three input DataFrames and not coleaseing the write.
Would appreciate any hints what we can investigate as to why the final stage takes so many hours doing very little on 2 of the 3 nodes in our standalone local cluster.
Many thanks
--- UPDATE ---
I tried writing to local disk rather than s3a and the symptoms are the same - 2 of the 32 tasks in the final stage saveAsTextFile get "stuck" for hours:
If you are writing to S3, via s3n, s3a or otherwise, do not set spark.speculation = true unless you want to run the risk of corrupted output.
What I suspect is happening is that the final stage of the process is renaming the output file, which on an object store involves copying lots (many GB?) of data. The rename takes place on the server, with the client just keeping an HTTPS connection open until it finishes. I'd estimate S3A rename time as about 6-8 Megabytes/second...would that number tie in with your results?
Write to local HDFS then, afterwards, upload the output.
gzip compression can't be split, so Spark will not assign parts of processing a file to different executors. One file: one executor.
Try and avoid CSV, it's an ugly format. Embrace: Avro, Parquet or ORC. Avro is great for other apps to stream into, the others better for downstream processing in other queries. Significantly better.
And consider compressing the files with a format such as lzo or snappy, both of which can be split.
see also slides 21-22 on: http://www.slideshare.net/steve_l/apache-spark-and-object-stores
I have seen similar behavior. There is a bug fix in HEAD as of October 2016 that may be relevant. But for now you might enable
spark.speculation=true
in the SparkConf or in spark-defaults.conf .
Let us know if that mitigates the issue.

library to process .rrd(round robin data) using spark

I have huge time series data which is in .rrd(round robin database) format stored in S3. I am planning to use apache spark for running analysis on this to get different performance matrix.
Currently I am downloading the .rrd file from s3 and processing it using rrd4j library. I am going to do processing for longer terms like year or more. it involves processing of hundreds of thousands of .rrd files. I want spark nodes to get the file directly from s3 and run the analysis.
how can I make spark to use the rrd4j to read the .rrd files? is there any library which helps me do that?
is there any support in spark for processing this kind of data?
The spark part is rather easy, use either wholeTextFiles or binaryFiles on sparkContext (see docs). According to the documentation, rrd4j usually wants a path to construct an rrd, but with the RrdByteArrayBackend, you could load the data in there - but that might be a problem, because most of the API is protected. You'll have to figure out a way to load an Array[Byte] into rrd4j.

Resources