Apache Spark get number of RDD's getting created in memory - apache-spark

I'm using MLlib with Python(Pyspark) and would like to get to know the number of RDD's getting created in memory before the execution of my code. I'm performing Transactions and Actions on RDD's. So would just like to get to know the total number of RDD's that created in Memory.

number of RDD's depends on your program.
But I think here you want to know number of partitions an RDD is created on :
for that you can use : rdd.getNumPartitions()
refer : Show partitions on a pyspark RDD
Upvote if works

First of all As you asked Number of RDD's . That depends how you write your application code. There can be 1 or more than 1 RDD in you application.
Though you can find the number of partitions in an RDD.
for scala
someRDD.partitions.size
Pyspark
someRDD.getNumPartitions()
If there are more than 1 rdd in you application you can count partitions of each RDD and sum them that will be the total number of partitions..

Related

Glue Spark: Some task has 0 records for shuffle but some task has disk spill

I have a spark job where some task has zero records output and shuffle read size where some task have memory and disk spill. Can some one help me what can I do to optimize the execution.
Execution Info: repartition_cnt=3500 [ datasets is in S3 and execution is through Glue G2X with 298 DPUs)
Code:
fct_ate_df.repartition(expr(s"pmod(hash(mae_id, rowsin, dep), $repartition_cnt)"))
.write
.mode("overwrite")
.format("parquet")
.bucketBy(repartition_cnt, "rowsin", "rowsin","dep")
.sortBy("rowsin","dep")
.option("path", s"s3://b222-id/data22te=$dat22et_date")
.saveAsTable(s"btemp.intte_${table_name}_${regd}")
Summary Metrics
No record output/shuffle
Spill record
You are using reparition by expression and i think that this the reason why you see those empty partitions. In this case internally spark is going to use HashPartitioner and this partinioner does not guarantee that partitions are going to be equal.
Due to Hash algorithm you are sure that records with the same expression value are going to be in the same partition but you may end up with empty partitions or with partitions which has for example 5 keys inside.
In this case numPartitions is not changing anything, in case of many keys in one bucket (so later partition) which at the end are generating less partitions than numPartition Spark is going to generate empty partitions as you can see in your example
I think that if you want to have equal partitions you may remove this expression in which you are calculating hash and leave only $repartition_cnt
Thanks to that Spark will use RoundRobinPartitioner instead and this one will generate equals partitions
If you want to dig dipper you may take a look at source code, i think that here are nice starting points
Here you can find logic connected to repartition without expression: Spark source code
Here you can find logic which is used for partitioning by expression: Spark source code
Regards!

What is the initial number of partitions created for a dataframe?

I am new to Spark. I am trying to understand the number of partitions produced by default by a hiveContext.sql("query") statement. I know that we can repartition the dataframe after it has been created using df.repartition. But, what is the number of partitions produced by default when the dataframe is initially created?
I understand that sc.parallelize and some other transformations produce the number of partitions according to spark.default.parallelism. But what about a dataframe ? I saw some answers saying that the setting spark.sql.shuffle.partitions produces the set number of partitions while doing shuffle operations like join. Does this give the initial number of partitions when a dataframe is created?
Then I also saw some answers explaining the number of partitions produced by setting
mapred.min.split.size.
mapred.max.split.size and
hadoop block size
Then when I tried to do it practically, I read 10 million records into a dataframe in a spark-shell launched with 2 executors and 4 cores per executor. When I did df.rdd.getNumPartitions, I get the value 1. How am I getting 1 for the number of partitions? isn't 2 the min number of partitions?
When I do a count on the dataframe, I see that 200 tasks are being launched. IS this due to the spark.sql.shuffle.partitions setting?
I am totally confused! can someone please answer my questions?? Any help would be appreciated. Thank you!

Number of Partitions of Spark Dataframe

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.
I know that for a RDD, while creating it we can mention the number of partitions like below.
val RDD1 = sc.textFile("path" , 6)
But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.
Only possibility i think is, after creating dataframe we can use repartition API.
df.repartition(4)
So can anyone please let me know if we can specify the number of partitions while creating a dataframe.
You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.
In general:
Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
Datasets created from RDD inherit number of partitions from its parent.
Datsets created using data source API:
In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
In Spark 2.x there is a Spark SQL specific configuration in use.
Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.
Default number of shuffle partitions in spark dataframe(200)
Default number of partitions in rdd(10)

Spark SQL(Hive query through HiveContext) always creating 31 partitions

I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions.
I using this code snippet to execute hive query:
var pairedRDD = hqlContext.sql(hql).rdd.map(...)
I am using Spark 1.3.1
Thanks,
Nitin
The number of partitions in an RDD is the same as the number of partitions in the RDD on which it depends, with a couple exceptions: the coalesce transformation allows creating an RDD with fewer partitions than its parent RDD, the union transformation creates an RDD with the sum of its parents’ number of partitions, and cartesian creates an RDD with their product.
To increase number of partitions
Use the repartition transformation, which will trigger a shuffle.
Configure your InputFormat to create more splits.
Write the input data out to HDFS with a smaller block size.
This link here has good explanation of how the number of partitions are defined and how to increase the number of partitions.

RDD and partition in Apache Spark

So, in Spark when an application is started then an RDD containing the dataset for the application (e.g. words dataset for WordCount) is created.
So far what I understand is that RDD is a collection of those words in WordCount and the operations that have been done to those dataset (e.g. map, reduceByKey, etc...)
However, afaik, Spark also has HadoopPartition (or in general: partition) which is read by every executor from HDFS. And I believe that an RDD in driver also contains all of these partitions.
So, what is getting divided among executors in Spark? Does every executor get those sub-dataset as a single RDD which contains less data compared to RDD in the driver or does every executor only deals with these partitions and read them directly from HDFS? Also, when are the partitions created? On the RDD creation?
Partitions are configurable provided the RDD is key-value based.
There are 3 main partition's property:
Tuples in the same partition are guaranteed to be in the same
machine.
Each node in a cluster can contain more than one partition.
The total number of partitions are configurable, by default it is
set to the total number of cores on all the executor nodes.
Spark supports two types of partitioning:
Hash Partitioning
Range Partitioning
When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file.
When you call rdd.repartition(x) it would perform a shuffle of the data from N partitions you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
Please see more details here and here
Your RDD have rows in it. If it is a text file, it have lines separated by \n.
Those rows are getting divided into partitions across different nodes in Spark cluster.

Resources