Is there a size limit for Spark's RDD - apache-spark

Do spark's RDD have a limit in size?
As for my specific case, can a RDD have 2^400 colums?

The first part of Avishek's answer is a bit out of date as of Spark 2.4.0. At time of writing, almost all of the 2GB limits throughout the Spark source have been resolved: https://issues.apache.org/jira/browse/SPARK-6235. That being said, your table width is still problematic.
In practicality, your RDD is likely going to be guaranteed poor read/write times even when each partitions has only a single row, 2^400 is still an enormous number! Conservatively assuming each col has 10 bytes of data, a single row has approx:
(10 bytes / col) * 2.6 * 10^120 cols
= 2.6 * 10^121 bytes
= 2.6 * 10^112 gigabytes
That is huge! Do you really need 2^400 columns?

Theoretically RDD doesn't have a size limit. Neither it has any limit on number of columns you can store. However there is a limitation from SPARK which allows each RDD partition to be capped at 2GB. See Here
So, you can store the 2^400 columns in a RDD. As long as each partition size is less than 2GB.
Now there are practical problems associated with having 2^400. Because you have to adhere current spark limitation , with huge number of columns you would need to repartition the data in to large number of partitions. This probably reduce the efficiency.

Related

How to repartition spark dataframe into smaller partitions

I have a dataframe which is partitioned by date.
In normal processing, I am processing a week of data at a time, so this means I have 7 partitions. I would like to increase this number of partitions, but without having to shuffle data or have a mix of dates in the same partition.
I've tried using df.repartition(20, my_date_column), but this just results in 13 empty partitions since the hash partitioner will only get 7 distinct values.
I've also tried using df.repartition(20, my_date_column, unique_id), which does increase the number of partitions to 20, but it means that dates are mixed within the partitions.
Is what I'm trying to do possible?
Perhaps you can increase the number of partitions by setting spark.sql.files.maxPartitionBytes to a smaller value. According to the tuning guide:
Property Name: spark.sql.files.maxPartitionBytes
Default: 134217728 (128 MB)
Meaning: The maximum number of bytes to pack into a single partition when reading files.
This configuration is effective only when using file-based sources
such as Parquet, JSON and ORC.
Since Version: 2.0.0
Alternatively, you can try spark.sql.files.minPartitionNum but it only means to suggest, not guarantee, the number of partitions.

Spark Repartition and Coalesce

If I want to repartition a dataframe, How to decide on the number of partitions that need to be made? How to decide on whether to use repartition or coalesce?
I understand that coalesce is basically used only to reduce the number of partitions. But how can we decide which to use in what scenario?
we can't decide this based on specific parameter there will be multiple factors are there to decide how many partitions and repartition or coalesce
*based on the size of data ,
if size of the file is too big you can give 2 or 3 partitions per block to
increase the performance but if give more too many partitions it split as
small files .In Big data small files will lower performance .
1 Block (128 MB) --> 128/2 = 64MB each partition ,So 1 mapper will run for 64 MB
*based on the cluster size , if you have more number of executors/cores are free you can give according to that.
*repartition will cause the complete shuffling and coalesce will avoid the complete shuffle.

Empty Files in output spark

I am writing my dataframe like below
df.write().format("com.databricks.spark.avro").save("path");
However I am getting around 200 files where around 30-40 files are empty.I can understand that it might be due to empty partitions. I then updated my code like
df.coalesce(50).write().format("com.databricks.spark.avro").save("path");
But I feel it might impact performance. Is there any other better approach to limit number of output files and remove empty files
You can remove the empty partitions in your RDD before writing by using repartition method.
The default partition is 200.
The suggested number of partition is number of partitions = number of cores * 4
repartition your dataframe using this method. To eliminate skew and ensure even distribution of data choose column(s) in your dataframe with high cardinality (having unique number of values in the columns) for the partitionExprs argument to ensure even distribution.
As default no. of RDD partitions is 200; you have to do shuffle to remove skewed partitions.
You can either use repartition method on the RDD; or make use of DISTRIBUTE BY clause on dataframe - which will repartition along with distributing data among partitions evenly.
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
Returns dataset instance with proper partitions.
You may use repartitionAndSortWithinPartitions - which can improve compression ratio.

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Maximum size of rows in Spark jobs using Avro/Parquet

I am planning to use Spark to process data where each individual element/row in the RDD or DataFrame may occasionally be large (up to several GB).
The data will probably be stored in Avro files in HDFS.
Obviously, each executor must have enough RAM to hold one of these "fat rows" in memory, and some to spare.
But are there other limitations on row size for Spark/HDFS or for the common serialisation formats (Avro, Parquet, Sequence File...)? For example, can individual entries/rows in these formats be much larger than the HDFS block size?
I am aware of published limitations for HBase and Cassandra, but not Spark...
There are currently some fundamental limitations related to block size, both for partitions in use and for shuffle blocks - both are limited to 2GB, which is the maximum size of a ByteBuffer (because it takes an int index, so is limited to Integer.MAX_VALUE bytes).
The maximum size of an individual row will normally need to be much smaller than the maximum block size, because each partition will normally contain many rows, and the largest rows might not be evenly distributed among partitions - if by chance a partition contains an unusually large number of big rows, this may push it over the 2GB limit, crashing the job.
See:
Why does Spark RDD partition has 2GB limit for HDFS?
Related Jira tickets for these Spark issues:
https://issues.apache.org/jira/browse/SPARK-1476
https://issues.apache.org/jira/browse/SPARK-5928
https://issues.apache.org/jira/browse/SPARK-6235

Resources