Is there a way to check if a variable in Spark is parallelizable? - apache-spark

So I am using groupByKey function in spark, but its not being parallelized, as I can see that during its execution, only 1 core is being used. It seems that the data I'm working with doesn't allow parallelization. Is there a way in spark to know if the input data is amicable to parallelization or if it's not a proper RDD?

The unit of parallelization in Spark is the 'partition'. That is, RDDs are split in partitions and transformations are applied to each partition in parallel. How RDD data is distributed across partitions is determined by the Partitioner. By default, the HashPartitioner is used which should work fine for most purposes.
You can check how many partitions your RDD is split into using:
rdd.partitions // Array of partitions

Related

Difference between shuffle partition and repartition

I am a newbie in spark and I am trying to understand shuffle partition and repartition function. But i still dont understand how they are different. Both reduces the number of partition??
Thank you
The biggest difference between shuffle partition and repartition is when things are defined.
The configuration spark.sql.shuffle.partitions is a property and according to the documentation
Configures the number of partitions to use when shuffling data for joins or aggregations.
That means, every time you run a Join or any type of aggregation in spark that will shuffle the data according to the configuration, where the default value is 200. So if you join two datasets the number of partitions in the shuffle will be 200.
The repartition(numPartitions, *cols) function is applied during an execution, where you can define how many partitions you will write, that usually is for output writing based in partition columns or just number. The example in the documentation is pretty good to show.
So in general, Shuffle Partition is for Joins and Aggregations during the execution. Repartition is for number of output files, based in number or partition column.

Why Spark create less partitions than the number of files whem reading from S3

I'm using Spark 2.3.1.
I have a job that reads 5.000 small parquet files into s3.
When I do a mapPartitions followed by a collect, only 278 tasks are used (I would have expected 5000). Why ?
Spark is grouping multiple files into each partition due to their small size. You should see as much when you print out the partitions.
Example (Scala):
val df = spark.read.parquet("/path/to/files")
df.rdd.partitions.foreach(println)
If you want to use 5,000 task you could do a repartition transformation.
Quote from the docs about repartition:
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data
over the network.
I recommend you take a look at the RDD Programming Guide. Remember that shuffle is an expensive operation.

Managing Spark partitions after DataFrame unions

I have a Spark application that will need to make heavy use of unions whereby I'll be unioning lots of DataFrames together at different times, under different circumstances. I'm trying to make this run as efficiently as I can. I'm still pretty much brand-spanking-new to Spark, and something occurred to me:
If I have DataFrame 'A' (dfA) that has X number of partitions (numAPartitions), and I union that to DataFrame 'B' (dfB) which has Y number of partitions (numBPartitions), then what will the resultant unioned DataFrame (unionedDF) look like, with result to partitions?
// How many partitions will unionedDF have?
// X * Y ?
// Something else?
val unionedDF : DataFrame = dfA.unionAll(dfB)
To me, this seems like its very important to understand, seeing that Spark performance seems to rely heavily on the partitioning strategy employed by DataFrames. So if I'm unioning DataFrames left and right, I need to make sure I'm constantly managing the partitions of the resultant unioned DataFrames.
The only thing I can think of (so as to properly manage partitions of unioned DataFrames) would be to repartition them and then subsequently persist the DataFrames to memory/disk as soon as I union them:
val unionedDF : DataFrame = dfA.unionAll(dfB)
unionedDF.repartition(optimalNumberOfPartitions).persist(StorageLevel.MEMORY_AND_DISK)
This way, as soon as they are unioned, we repartition them so as to spread them over the available workers/executors properly, and then the persist(...) call tells to Spark to not evict the DataFrame from memory, so we can continue working on it.
The problem is, repartitioning sounds expensive, but it may not be as expensive as the alternative (not managing partitions at all). Are there generally-accepted guidelines about how to efficiently manage unions in Spark-land?
Yes, Partitions are important for spark.
I am wondering if you could find that out yourself by calling:
yourResultedRDD.getNumPartitions()
Do I have to persist, post union?
In general, you have to persist/cache an RDD (no matter if it is the result of a union, or a potato :) ), if you are going to use it multiple times. Doing so will prevent spark from fetching it again in memory and can increase the performance of your application by 15%, in some cases!
For example if you are going to use the resulted RDD just once, it would be safe not to do persist it.
Do I have to repartition?
Since you don't care about finding the number of partitions, you can read in my memoryOverhead issue in Spark
about how the number of partitions affects your application.
In general, the more partitions you have, the smaller the chunk of data every executor will process.
Recall that a worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.
Isn't the Dataframe always in memory?
Not really. And that's something really lovely with spark, since when you handle bigdata you don't want unnecessary things to lie in the memory, since this will threaten the safety of your application.
A DataFrame can be stored in temporary files that spark creates for you, and is loaded in the memory of your application only when needed.
For more read: Should I always cache my RDD's and DataFrames?
Union just add up the number of partitions in dataframe 1 and dataframe 2. Both dataframe have same number of columns and same order to perform union operation. So no worries, if partition columns different in both the dataframes, there will be max m + n partitions.
You doesn't need to repartition your dataframe after join, my suggestion is to use coalesce in place of repartition, coalesce combine common partitions or merge some small partitions and avoid/reduce shuffling data within partitions.
If you cache/persist dataframe after each union, you will reduce performance and lineage is not break by cache/persist, in that case, garbage collection will clean cache/memory in case of some heavy memory intensive operation and recomputing will increase computation time for the same, may be this time partial computation is required for clear/removed data.
As spark transformation are lazy, i.e; unionAll is lazy operation and coalesce/repartition is also lazy operation and come in action at the time of first action, so try to coalesce unionall result after an interval like counter of 8 and reduce partition in resulting dataframe. Use checkpoints to break lineage and store data, if there is lots of memory intensive operation in your solution.

Spark SQL(Hive query through HiveContext) always creating 31 partitions

I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions.
I using this code snippet to execute hive query:
var pairedRDD = hqlContext.sql(hql).rdd.map(...)
I am using Spark 1.3.1
Thanks,
Nitin
The number of partitions in an RDD is the same as the number of partitions in the RDD on which it depends, with a couple exceptions: the coalesce transformation allows creating an RDD with fewer partitions than its parent RDD, the union transformation creates an RDD with the sum of its parents’ number of partitions, and cartesian creates an RDD with their product.
To increase number of partitions
Use the repartition transformation, which will trigger a shuffle.
Configure your InputFormat to create more splits.
Write the input data out to HDFS with a smaller block size.
This link here has good explanation of how the number of partitions are defined and how to increase the number of partitions.

How to split the input file in Apache Spark

Suppose I have an input file of size 100MB. It contains large number of points (lat-long pair) in CSV format. What should I do in order to split the input file in 10 10MB files in Apache Spark or how do I customize the split.
Note: I want to process a subset of the points in each mapper.
Spark's abstraction doesn't provide explicit split of data. However you can control the parallelism in several ways.
Assuming you use YARN, HDFS file is automatically split into HDFS blocks and they're processed concurrently when Spark action is running.
Apart from HDFS parallelism, consider using partitioner with PairRDD. PairRDD is data type of RDD of key-value pairs and a partitioner manages mapping from a key to a partition. Default partitioner reads spark.default.parallelism. The partitioner helps to control the distribution of data as well as its locality in PairRDD-specific actions, e.g., reduceByKey.
Take a look at following documentation about Spark data parallelism.
http://spark.apache.org/docs/1.2.0/tuning.html
After searching through the Spark API I have found one method partition which returns the number of partitions of the JavaRDD. At the time of JavaRDD creation we have repartitioned it to desired number of partitions as told by #Nick Chammas.
JavaRDD<String> lines = ctx.textFile("/home/hduser/Spark_programs/file.txt").repartition(5);
List<Partition> partitions = lines.partitions();
System.out.println(partitions.size());

Resources