How to print memory usage of an algorithm in Pyspark? - apache-spark

I wrote two algorithms in Pyspark and I need to compare memory usages of these two algorithm and report the best one. is there any way to calculate the memory or RAM usage of any chunk of code in Pyspark?
I search in the spark documentation but I did not find any way and I also somehow a new user of pyspark.

Related

Do we need all the data in memory for running group by on Spark

I'm trying to run a group by operation on a huge data (around 50TB) something like this
df_grouped = df.groupby(df['col1'], df['col2']).sum('col3')
I'm using the dataframe API on Pyspark and running this on EMR with 12 r5.4xlarge machine. The job takes a long time to process and eventually killed with OOM.
My question is:
Is there any best practices on running group by operation with Spark?
Do we need all the data to fit in memory when running this?
The groupBy operation is not efficient for such large datasets. The OOM in groupBy indicates that there might be data skewness and this is because the groupBy implementation reads all the data in a partition in memory. You can take a look at the implementation here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L731

Optimized caching via spark

I am working on a solution to provide low latency results using spark. For this, I was planning to cache the data beforehand on which a user wants to query.
I am able to achieve good performance on the queries. One thing I noticed is that the data on cluster (parquet format) explodes when caching. I understand this is due to deserializing and decoding the data. I am just wondering if there is any other options to reduce the memory footprint.
I tried using
sqlContext.cacheTable("table_name") and also
tbl.persist(StorageLevel.MEMORY_AND_DISK_SER)
But nothing is helping reduce the memory footprint
Perhaps you want to try orc ? There have been improvements in orc support recently (more here: https://www.slideshare.net/Hadoop_Summit/orc-improvement-in-apache-spark-23-95295487). I am not an expert, but I heard that orc uses in memory columnar format... This format gives opportunities for doing things like compressing via techniques like run length encoding of repeated values -- which tends to lower memory footprint.
It also explodes when not caching.
cache has nothing do with reducing memory footprint. You do not state RDD or DF, but I presume latter. This RDD Memory footprint in spark gives an idea for RDDs and the improvements for DFs / DSs: https://spoddutur.github.io/spark-notes/deep_dive_into_storage_formats.html.
You cannot reuse the data for different users. What you could consider is Apache Ignite. See https://ignite.apache.org/use-cases/spark/shared-memory-layer.html

Spark dataset exceeds total ram size

I am recently working in spark and came across few queries which I still couldn't resolve.
Let's say i have a dataset of 100GB and my ram size of the cluster is
16 GB.
Now, I know in case of simply reading the file and saving it in the HDFS will work as Spark will do it for each partition. What will happen when I perform sorting or aggregation transformation on 100GB data? How will it process 100GB in memory since we need entire data in case of sorting?
I have gone through below link but this only tells us what spark do in case of persisting, what I am looking is Spark aggregations or sorting on dataset greater than ram size.
Spark RDD - is partition(s) always in RAM?
Any help is appreciated.
There are 2 things you might want to know.
Once Spark reaches the memory limit, it will start spilling data to
disk. Please check this Spark faq and also there are severals
question from SO talking about the same, for example, this one.
There is an algorihtm called external sort that allows you to sort datasets which do not fit in memory. Essentially, you divide the large dataset by chunks which actually fit in memory, sort each chunk and write each chunk to disk. Finally, merge every sorted chunk in order to get the whole dataset sorted. Spark supports external sorting as you can see here and here is the implementation.
Answering your question, you do not really need that your data fit in memory in order to sort it, as I explained to you before. Now, I would encourage you to think about an algorithm for data aggregation dividing the data by chunks, just like external sort does.
There are multiple things you need to consider. Because you have 16RAM and 100GB data set, it will be good idea to keep persistence in DISK. It maybe difficult as when aggregating if data set has high cardinality. If the cardinality is low you will be better of to do aggregate at each RDD before merging into whole dataset. Also remember to make sure that each partition in RDD is less than memory (default value 0.4*container_size)

Identifying why data is skewed in Spark

I am investigating a Spark SQL job (Spark 1.6.0) that is performing poorly due to badly skewed data across the 200 partitions, most of the data is in 1 partition:
What I'm wondering is...is there anything in the Spark UI to help me find out more about how the data is partitioned? From looking at this I don't know which columns the dataframe is partitioned on. How can I find that out? (other than looking at the code - I'm wondering if there's anything in the logs and/or UI that could help me)?
Additional details, this is using Spark's dataframe API, Spark version 1.6. Underlying data is stored in parquet format.
The Spark UI and logs will not be terribly helpful for this. Spark uses a simple hash partitioning algorithm as the default for almost everything. As you can see here this basically recycles the Java hashCode method.
I would suggest the following:
Try to debug by sampling and printing the contents of the RDD or data frame. See if there's obvious issues with the data distribution (ie. low variance or low cardinality) of the key.
If thats ineffective, you can work back from the logs and UI to figure our how many partitions there are. You can find the hashCode of the data using spark and then take the modulus to see what the collision is.
Once you find the source of the collision you can try to a few techniques to remove it:
See if there's a better key you can use
See if you can improve the hashCode function of the key (the default one in Java isn't that great)
See if you can process the data in two steps by doing an initial scatter/gather step to force some parallelism and reduce the processing overhead for that one partition. This is probably the trickiest optimization to get right of those mentioned here. Basically, partition the data once using a random number generator to force some initial parallel combining of the data, then push it through again with the natural partitioner to get the final result. This requires that the operation you're applying be transitive and associative. This technique hits the network twice and is therefore very expensive unless the data is really actually that highly skewed.

which is faster in spark, collect() or toLocalIterator()

I have a spark application in which I need to get the data from executors to driver and I am using collect(). However, I also came across toLocalIterator(). As far as I have read about toLocalIterator() on Internet, it returns an iterator rather than sending whole RDD instantly, so it has better memory performance, but what about speed? How is the performance between collect() and toLocalIterator() when it comes to execution/computation time?
The answer to this question depends on what would you do after making df.collect() and df.rdd.toLocalIterator(). For example, if you are processing a considerably big file about 7M rows and for each of the records in there, after doing all the required transformations, you needed to iterate over each of the records in the DataFrame and make a service calls in batches of 100.
In the case of df.collect(), it will dumping the entire set of records to the driver, so the driver will need an enormous amount of memory. Where as in the case of toLocalIterator(), it will only return an iterator over a partition of the total records, hence the driver does not need to have enormous amount of memory. So if you are going to load such big files in parallel workflows inside the same cluster, df.collect() will cause you a lot of expense, where as toLocalIterator() will not and it will be faster and reliable as well.
On the other hand if you plan on doing some transformations after df.collect() or df.rdd.toLocalIterator(), then df.collect() will be faster.
Also if your file size is so small that Spark's default partitioning logic does not break it down into partitions at all then df.collect() will be more faster.
To quote from the documentation on toLocalIterator():
This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.
It means that in the worst case scenario (no caching at all) it can be n-partitions times more expensive than collect. Even if data is cached, the overhead of starting multiple Spark jobs can be significant on large datasets. However lower memory footprint can partially compensate that, depending on a particular configuration.
Overall, both methods are inefficient and should be avoided on large datasets.
As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.
So, after applying a function to an RDD using foreach you can call toLocalIterator to get an iterator to all the contents of the RDD and process it. However, bear in mind that if your RDD is very big, you may have memory issues. If you want to transform it to an RDD again after doing the operations you need, use the SparkContext to parallelize it.

Resources