Compression rate in Spark Application - apache-spark

I am doing some benchmark in a cluster using Spark. Among the various things I want to get a good approximation of the average size reduction achieved by serialization and compression. I am running in client deploy-mode and with the local master, and tired with both shells of versions 1.6 and 2.2 of spark.
I want to do that calculating the in-memory size and then the size on disk, so the fraction should be my answer. I have obviously no problems getting the on-disk size, but I am really struggling with the in-memory one.
Since my RDD is made of doubles and they occupy 8 bytes each in memory I tried counting the number of elements in the RDD and multiplying by 8, but that leaves out a lot of things.
The second approach was using "SizeEstimator" (https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.util.SizeEstimator$
), but this is giving me crazy results! In Spark 1.6 it is either 30, 130 or 230 randomly (47 MB on disk), in Spark 2.2 it starts at 30 and everytime I execute it it increases by 0 or by 1. I know it says it's not super accurate but I can't even find a bit of consistency! I even tried setting persisting level in memory only
rdd.persist(StorageLevel.MEMORY_ONLY)
but still, nothing changed.
Is there any other way I can get the in-memory size of the RDD? Or should I try another approach? I am writing to disk with rdd.SaveAsTextFile, and generating the rdd via RandomRDDs.uniformRDD.
EDIT
sample code:
write
val rdd = RandomRDDs.uniformRDD(sc, nBlocks, nThreads)
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
println("RDD count: " + rdd.count)
rdd.saveAsObjectFile("file:///path/to/folder")
read
val rdd = sc.wholeTextFiles(name,nThreads)
rdd.count() //action so I'm sure the file is actually read
webUI

Try caching the rdd as you mentioned and check in the storage tab of the spark UI.
By default rdd is deserialised and stored in memory. If you want to serialise it then specifically use persist with option MEMORY_ONLY_SER.The memory consumption will be less. In disk data always will be stored in serialised fashion
Check once the spark UI

Related

PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
local[1]
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?
If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
df.persist(StorageLevel.MEMORY_AND_DISK)
Update if it helps
In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

How to tell if spark session will be able to hold data size in dataframe?

Intend to read data from an Oracle DB with pyspark (running in local mode) and store locally as parquet. Is there a way to tell whether a spark session dataframe will be able to hold the amount of data from the query (which will be the whole table, ie. select * from mytable)? Are there common solutions for if the data would not be able to fit in a dataframe?
* Saw a similar question here, but was a little confused by the discussion in the comments
As you are running on local, So I assume it is not on a cluster. You can not say exactly how much memory would require? However, you can go close to it. You check your respective table size that how much disk space it's using. Suppose you mytable has occupied 1GB of Hard disk then spark would be required RAM more than that, because Spark's engine required some memory for its own processing. Try to have 2GB extra, for safer side than actual table size.
To check you table size in Oracle, You can use below query:
select segment_name,segment_type,bytes/1024/1024 MB
from dba_segments
where segment_type='TABLE' and segment_name='<yourtablename>';
It will give you a result in MB.
To configure JVM related parameter in Apache-Spark you can check this.
It doesn't matter how big the table is if you are running spark in a distributed manner. You would need to worry about the memory if:-
You are reading the data in the driver and then doing a broadcast.
Caching the dataframe for some computation.
Usually for your spark application a DAG gets generated and if you are using JDBC source then the workers will read the data directly and use the shuffle space and off-heap to disk for memory intensive computation.

Memory difference between pyspark and spark?

I have been trying to get a PySpark job to work which creates a RDD with a bunch of binary files, and then I use a flatMap operation to process the binary data into a bunch of rows. This has lead to a bunch of out of memory errors, and after playing around with memory settings for a while I have decided to get the simplest thing possible working, which is just counting the number of files in the RDD.
This also fails with OOM error. So I opened up both the spark-shell and PySpark and ran the commands in the REPL/shell with default settings, the only additional parameter was --master yarn. The spark-shellversion works, while the PySpark version shows the same OOM error.
Is there that much overhead to running PySpark? Or is this a problem with binaryFiles being new? I am using Spark version 2.2.0.2.6.4.0-91.
The difference:
Scala will load records as PortableDataStream - this means process is lazy, and unless you call toArray on the values, won't load data at all.
Python will call Java backend, but load the data as byte array. This part will be eager-ish, therefore might fail on both sides.
Additionally PySpark will use at least twice as much memory - for Java and Python copy.
Finally binaryFiles (same as wholeTextFiles) are very inefficient and don't perform well, if individual input files are large. In case like this it is better to implement format specific Hadoop input format.
Since you are reading multiple binary files with binaryFiles() and starting Spark 2.1, the minPartitions argument of binaryFiles() is ignored
1.try to repartition the input files based on the following:
enter code hererdd = sc.binaryFiles(Path to the binary file , minPartitions = ).repartition()
2.You may try reducing the partition size to 64 MB or less depending on your size of the data using below config's
spark.files.maxPartitionBytes, default 128 MB
spark.files.openCostInBytes, default 4 MB
spark.default.parallelism

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

high cache size for RDD in apache Spark

I am reading a textfile of ~20MB consisting of rows of space separated integers, convert to RDD and cache it. On caching I observed, it consumed ~200MB on RAM!
I don't understand why it is consuming such high RAM (x10) for caching ?
val filea = sc.textFile("a.txt")
val fileamapped = filea.map(_.split(" ").map(_.toInt))
fileamapped.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
fileamapped.collect()
I am running Spark in local interactive mode (spark-shell) and reading datafile from HDFS.
Questions
Reasons behind high RAM use for caching?
Is there a way I can directly read integers from the file, sc.textFile gives me RDD[String].
I checked fileamapped with estimate() method and it shows ~64MB size, would it be JAVA component sizes?
thanks,

Resources