Compute size of Spark dataframe - SizeEstimator gives unexpected results - apache-spark

I am trying to find a reliable way to compute the size (in bytes) of a Spark dataframe programmatically.
The reason is that I would like to have a method to compute an "optimal" number of partitions ("optimal" could mean different things here: it could mean having an optimal partition size, or resulting in an optimal file size when writing to Parquet tables - but both can be assumed to be some linear function of the dataframe size). In other words, I would like to call coalesce(n) or repartition(n) on the dataframe, where n is not a fixed number but rather a function of the dataframe size.
Other topics on SO suggest using SizeEstimator.estimate from org.apache.spark.util to get the size in bytes of the dataframe, but the results I'm getting are inconsistent.
First of all, I'm persisting my dataframe to memory:
df.cache().count
The Spark UI shows a size of 4.8GB in the Storage tab. Then, I run the following command to get the size from SizeEstimator:
import org.apache.spark.util.SizeEstimator
SizeEstimator.estimate(df)
This gives a result of 115'715'808 bytes =~ 116MB. However, applying SizeEstimator to different objects leads to very different results. For instance, I try computing the size separately for each row in the dataframe and sum them:
df.map(row => SizeEstimator.estimate(row.asInstanceOf[ AnyRef ])).reduce(_+_)
This results in a size of 12'084'698'256 bytes =~ 12GB. Or, I can try to apply SizeEstimator to every partition:
df.mapPartitions(
iterator => Seq(SizeEstimator.estimate(
iterator.toList.map(row => row.asInstanceOf[ AnyRef ]))).toIterator
).reduce(_+_)
which results again in a different size of 10'792'965'376 bytes =~ 10.8GB.
I understand there are memory optimizations / memory overhead involved, but after performing these tests I don't see how SizeEstimator can be used to get a sufficiently good estimate of the dataframe size (and consequently of the partition size, or resulting Parquet file sizes).
What is the appropriate way (if any) to apply SizeEstimator in order to get a good estimate of a dataframe size or of its partitions? If there isn't any, what is the suggested approach here?

Unfortunately, I was not able to get reliable estimates from SizeEstimator, but I could find another strategy - if the dataframe is cached, we can extract its size from queryExecution as follows:
df.cache.foreach(_ => ())
val catalyst_plan = df.queryExecution.logical
val df_size_in_bytes = spark.sessionState.executePlan(
catalyst_plan).optimizedPlan.stats.sizeInBytes
For the example dataframe, this gives exactly 4.8GB (which also corresponds to the file size when writing to an uncompressed Parquet table).
This has the disadvantage that the dataframe needs to be cached, but it is not a problem in my case.
EDIT: Replaced df.cache.foreach(_=>_) by df.cache.foreach(_ => ()), thanks to #DavidBenedeki for pointing it out in the comments.

SizeEstimator returns the number of bytes an object takes up on the JVM heap. This includes objects referenced by the object, the actual object size will almost always be much smaller.
The discrepancies in sizes you've observed are because when you create new objects on the JVM the references take up memory too, and this is being counted.
Check out the docs here 🤩
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.util.SizeEstimator$

Apart from Size estimator, which you have already tried(good insight)..
below is another option
RDDInfo[] getRDDStorageInfo()
Return information about what RDDs are cached, if they are in mem or on both, how much space they take, etc.
actually spark storage tab uses this.Spark docs
Below is the implementation from spark
/**
* :: DeveloperApi ::
* Return information about what RDDs are cached, if they are in mem or on disk, how much space
* they take, etc.
*/
#DeveloperApi
def getRDDStorageInfo: Array[RDDInfo] = {
getRDDStorageInfo(_ => true)
}
private[spark] def getRDDStorageInfo(filter: RDD[_] => Boolean): Array[RDDInfo] = {
assertNotStopped()
val rddInfos = persistentRdds.values.filter(filter).map(RDDInfo.fromRdd).toArray
rddInfos.foreach { rddInfo =>
val rddId = rddInfo.id
val rddStorageInfo = statusStore.asOption(statusStore.rdd(rddId))
rddInfo.numCachedPartitions = rddStorageInfo.map(_.numCachedPartitions).getOrElse(0)
rddInfo.memSize = rddStorageInfo.map(_.memoryUsed).getOrElse(0L)
rddInfo.diskSize = rddStorageInfo.map(_.diskUsed).getOrElse(0L)
}
rddInfos.filter(_.isCached)
}
yourRDD.toDebugString from RDD also uses this. code here
General Note :
In my opinion, to get optimal number of records in each partition and check your repartition is correct and they are uniformly distributed, I would suggest to try like below... and adjust your re-partition number. and then measure the size of partition... would be more sensible. to address this kind of problems
yourdf.rdd.mapPartitionsWithIndex{case (index,rows) => Iterator((index,rows.size))}
.toDF("PartitionNumber","NumberOfRecordsPerPartition")
.show
or existing spark functions (based on spark version)
import org.apache.spark.sql.functions._
df.withColumn("partitionId", sparkPartitionId()).groupBy("partitionId").count.show

My suggestion is
from sys import getsizeof
def compare_size_two_object(one, two):
'''compare size of two files in bites'''
print(getsizeof(one), 'versus', getsizeof(two))

Related

How can I stop this Spark flatmap, which returns massive results, failing on writing?

I'm using a flatmap function to split absolutely huge XML files into (tens of thousands) of smaller XML String fragments which I want to write out to Parquet. This has a high rate of stage failure; exactly where is a bit cryptic, but it seems to be somewhere when the DataFrameWriter is writing that I lose an executor, probably because I'm exceeding some storage boundary.
To give a flavour, here's the class that's used in the flatMap, with some pseudo-code. Note that the class returns an Iterable - which I had hoped would allow Spark to stream the results from the flatMap, rather than (I suspect) holding it all in memory before writing it:
class XmlIterator(filepath: String, split_element: String) extends Iterable[String] {
// open an XMLEventReader on a FileInputStream on the filepath
// Implement an Iterable that returns a chunk of the XML file at a time
def iterator = new Iterator[String] {
def hasNext = {
// advance in the input stream and return true if there's something to return
}
def next = {
// return the current chunk as a String
}
}
}
And here is how I use it:
var dat = [a one-column DataFrame containing a bunch of paths to giga-files]
dat.repartition(1375) // repartition to the number of rows, as I want the DataFrameWriter
// to write out as soon as each file is processed
.flatMap(rec => new XmlIterator(rec, "bibrecord"))
.write
.parquet("some_path")
This works beautifully for a few files in parallel but for larger batches I suffer stage failure. One part of the stack trace suggests to me that Spark is in fact holding the entire results of each flatMap as an array before writing out:
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
To be honest, I thought that by implementing the flatMap as an Iterable Spark would be able to pull the results out one-by-one and avoid buffering the entire results in memory, but I'm a bit baffled.
Can anyone suggest an alternative, more memory-efficient strategy for saving out the results of the flatMap?
For what it's worth, I've managed to solve this myself by adding an intermediate stage that persists the flatMap output to disk. This lets me repartition the output of the flatmap before passing to a DataFrameWriter. Works seamlessly.
dat.repartition(1375)
.flatMap(rec => new XmlIterator(rec, "bibrecord"))
.persist(StorageLevel.DISK_ONLY)
.repartition(5000)
.write
.parquet("some_path")
I suspect that trying to pass the flatMap output directly to a DataFrameWriter was overwhelming some internal buffer - the output from each flatMap could be as much as 5GB, and I assume Spark was needing to hold this in memory.
If anyone has comments or pointers to the internal workings of the DataFrameWriter that would be super interesting.

What is the performance difference between accumulator and collect() in Spark?

Accumulator are basically the shared variable in spark to be updated by executors but read by driver only.
Collect() in spark is to get all the data into the driver from executors.
So, in both when I am get the data ultimately in driver only. so, what is the difference in performance when we use accumulator or collect() to convert a large RDD into a LIST?
Code to convert dataframe to List using accumulator
val queryOutput = spark.sql(query)
val acc = spark.sparkContext.collectionAccumulator[Map[String,Any]]("JsonCollector")
val jsonString = queryOutput.foreach(a=>acc.add(convertRowToJSON(a)))
acc.value.asScala.toList
def convertRowToJSON(row: Row): Map[String,Any] = {
val m = row.getValuesMap(row.schema.fieldNames)
println(m)
JSONObject(m).obj
}
Code to convert dataframe to list using collect()
val queryOutput = spark.sql(query)
queryOutput.toJSON.collectAsList()
Convert large RDD to LIST
It is not a good idea. collect will move data from all executors to driver memory. If memory is not enough then it will throw Out Of Memory (OOM) Exception. If your data is fits in memory of single machine then probably you don't need spark.
Spark natively supports accumulators of numeric types, and programmers can add support for new types. They can be used to implement counters (as in MapReduce) or sums. OUT parameter of accumulator should be a type that can be read atomically (e.g., Int, Long), or thread-safely (e.g., synchronized collections) because it will be read from other threads.
CollectionAccumulator .value returns List (ArrayList implementation) and it will throw OOM if size is greater than driver memory.

What is the best strategy to load huge datasets/data into Hive tables using Spark? [duplicate]

I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code:
val conf = new SparkConf().setAppName("Spark-JDBC").set("spark.executor.heartbeatInterval","120s").set("spark.network.timeout","12000s").set("spark.sql.inMemoryColumnarStorage.compressed", "true").set("spark.sql.orc.filterPushdown","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.max","512m").set("spark.serializer", classOf[org.apache.spark.serializer.KryoSerializer].getName).set("spark.streaming.stopGracefullyOnShutdown","true").set("spark.yarn.driver.memoryOverhead","7168").set("spark.yarn.executor.memoryOverhead","7168").set("spark.sql.shuffle.partitions", "61").set("spark.default.parallelism", "60").set("spark.memory.storageFraction","0.5").set("spark.memory.fraction","0.6").set("spark.memory.offHeap.enabled","true").set("spark.memory.offHeap.size","16g").set("spark.dynamicAllocation.enabled", "false").set("spark.dynamicAllocation.enabled","true").set("spark.shuffle.service.enabled","true")
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
def prepareFinalDF(splitColumns:List[String], textList: ListBuffer[String], allColumns:String, dataMapper:Map[String, String], partition_columns:Array[String], spark:SparkSession): DataFrame = {
val colList = allColumns.split(",").toList
val (partCols, npartCols) = colList.partition(p => partition_columns.contains(p.takeWhile(x => x != ' ')))
val queryCols = npartCols.mkString(",") + ", 0 as " + flagCol + "," + partCols.reverse.mkString(",")
val execQuery = s"select ${allColumns}, 0 as ${flagCol} from schema.tablename where period_year='2017' and period_num='12'"
val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","cast_id")
.option("lowerBound", 1).option("upperBound", 100000)
.option("numPartitions",70).load()
val totalCols:List[String] = splitColumns ++ textList
val cdt = new ChangeDataTypes(totalCols, dataMapper)
hiveDataTypes = cdt.gpDetails()
val fc = prepareHiveTableSchema(hiveDataTypes, partition_columns)
val allColsOrdered = yearDF.columns.diff(partition_columns) ++ partition_columns
val allCols = allColsOrdered.map(colname => org.apache.spark.sql.functions.col(colname))
val resultDF = yearDF.select(allCols:_*)
val stringColumns = resultDF.schema.fields.filter(x => x.dataType == StringType).map(s => s.name)
val finalDF = stringColumns.foldLeft(resultDF) {
(tempDF, colName) => tempDF.withColumn(colName, regexp_replace(regexp_replace(col(colName), "[\r\n]+", " "), "[\t]+"," "))
}
finalDF
}
val dataDF = prepareFinalDF(splitColumns, textList, allColumns, dataMapper, partition_columns, spark)
val dataDFPart = dataDF.repartition(30)
dataDFPart.createOrReplaceTempView("preparedDF")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
spark.sql(s"INSERT OVERWRITE TABLE schema.hivetable PARTITION(${prtn_String_columns}) select * from preparedDF")
The data is inserted into the hive table dynamically partitioned based on prtn_String_columns: source_system_name, period_year, period_num
Spark-submit used:
SPARK_MAJOR_VERSION=2 spark-submit --conf spark.ui.port=4090 --driver-class-path /home/fdlhdpetl/jars/postgresql-42.1.4.jar --jars /home/fdlhdpetl/jars/postgresql-42.1.4.jar --num-executors 80 --executor-cores 5 --executor-memory 50G --driver-memory 20G --driver-cores 3 --class com.partition.source.YearPartition splinter_2.11-0.1.jar --master=yarn --deploy-mode=cluster --keytab /home/fdlhdpetl/fdlhdpetl.keytab --principal fdlhdpetl#FDLDEV.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Splinter --conf spark.executor.extraClassPath=/home/fdlhdpetl/jars/postgresql-42.1.4.jar
The following error messages are generated in the executor logs:
Container exited with a non-zero exit code 143.
Killed by external signal
18/10/03 15:37:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[SIGTERM handler,9,system]
java.lang.OutOfMemoryError: Java heap space
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
at java.util.zip.ZipFile$ZipFileInflaterInputStream.<init>(ZipFile.java:393)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:374)
at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
at java.util.jar.JarFile.getManifest(JarFile.java:180)
at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:99)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:745)
I see in the logs that the read is being executed properly with the given number of partitions as below:
Scan JDBCRelation((select column_names from schema.tablename where period_year='2017' and period_num='12') as year2017) [numPartitions=50]
Below is the state of executors in stages:
The data is not being partitioned properly. One partition is smaller while the other one becomes huge. There is a skew problem here.
While inserting the data into Hive table the job fails at the line:spark.sql(s"INSERT OVERWRITE TABLE schema.hivetable PARTITION(${prtn_String_columns}) select * from preparedDF") but I understand this is happening because of the data skew problem.
I tried to increase number of executors, increasing the executor memory, driver memory, tried to just save as csv file instead of saving the dataframe into a Hive table but nothing affects the execution from giving the exception:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there anything in the code that I need to correct ? Could anyone let me know how can I fix this problem ?
Determine how many partitions you need given the amount of input data and your cluster resources. As a rule of thumb it is better to keep partition input under 1GB unless strictly necessary. and strictly smaller than the block size limit.
You've previously stated that you migrate 1TB of data values you use in different posts (5 - 70) are likely way to low to ensure smooth process.
Try to use value which won't require further repartitioning.
Know your data.
Analyze the columns available in the the dataset to determine if there any columns with high cardinality and uniform distribution to be distributed among desired number of partitions. These are good candidates for an import process. Additionally you should determine an exact range of values.
Aggregations with different centrality and skewness measure as well as histograms and basic counts-by-key are good exploration tools. For this part it is better to analyze data directly in the database, instead of fetching it to Spark.
Depending on the RDBMS you might be able to use width_bucket (PostgreSQL, Oracle) or equivalent function to get a decent idea how data will be distributed in Spark after loading with partitionColumn, lowerBound, upperBound, numPartitons.
s"""(SELECT width_bucket($partitionColum, $lowerBound, $upperBound, $numPartitons) AS bucket, COUNT(*)
FROM t
GROUP BY bucket) as tmp)"""
If there are no columns which satisfy above criteria consider:
Creating a custom one and exposing it via. a view. Hashes over multiple independent columns are usually good candidates. Please consult your database manual to determine functions that can be used here (DBMS_CRYPTO in Oracle, pgcrypto in PostgreSQL)*.
Using a set of independent columns which taken together provide high enough cardinality.
Optionally, if you're going to write to a partitioned Hive table, you should consider including Hive partitioning columns. It might limit the number of files generated later.
Prepare partitioning arguments
If column selected or created in the previous steps is numeric (or date / timestamp in Spark >= 2.4) provide it directly as the partitionColumn and use range values determined before to fill lowerBound and upperBound.
If bound values don't reflect the properties of data (min(col) for lowerBound, max(col) for upperBound) it can result in a significant data skew so thread carefully. In the worst case scenario, when bounds don't cover the range of data, all records will be fetched by a single machine, making it no better than no partitioning at all.
If column selected in the previous steps is categorical or is a set of columns generate a list of mutually exclusive predicates that fully cover the data, in a form that can be used in a SQL where clause.
For example if you have a column A with values {a1, a2, a3} and column B with values {b1, b2, b3}:
val predicates = for {
a <- Seq("a1", "a2", "a3")
b <- Seq("b1", "b2", "b3")
} yield s"A = $a AND B = $b"
Double check that conditions don't overlap and all combinations are covered. If these conditions are not satisfied you end up with duplicates or missing records respectively.
Pass data as predicates argument to jdbc call. Note that the number of partitions will be equal exactly to the number of predicates.
Put database in a read-only mode (any ongoing writes can cause data inconsistency. If possible you should lock database before you start the whole process, but if might be not possible, in your organization).
If the number of partitions matches the desired output load data without repartition and dump directly to the sink, if not you can try to repartition following the same rules as in the step 1.
If you still experience any problems make sure that you've properly configured Spark memory and GC options.
If none of the above works:
Consider dumping your data to a network / distributes storage using tools like COPY TO and read it directly from there.
Note that or standard database utilities you will typically need a POSIX compliant file system, so HDFS usually won't do.
The advantage of this approach is that you don't need to worry about the column properties, and there is no need for putting data in a read-only mode, to ensure consistency.
Using dedicated bulk transfer tools, like Apache Sqoop, and reshaping data afterwards.
* Don't use pseudocolumns - Pseudocolumn in Spark JDBC.
In my experience there are 4 kinds of memory settings which make a difference:
A) [1] Memory for storing data for processing reasons VS [2] Heap Space for holding the program stack
B) [1] Driver VS [2] executor memory
Up to now, I was always able to get my Spark jobs running successfully by increasing the appropriate kind of memory:
A2-B1 would therefor be the memory available on the driver to hold the program stack. Etc.
The property names are as follows:
A1-B1) executor-memory
A1-B2) driver-memory
A2-B1) spark.yarn.executor.memoryOverhead
A2-B2) spark.yarn.driver.memoryOverhead
Keep in mind that the sum of all *-B1 must be less than the available memory on your workers and the sum of all *-B2 must be less than the memory on your driver node.
My bet would be, that the culprit is one of the boldly marked heap settings.
There was an another question of yours routed here as duplicate
'How to avoid data skewing while reading huge datasets or tables into spark?
The data is not being partitioned properly. One partition is smaller while the
other one becomes huge on read.
I observed that one of the partition has nearly 2million rows and
while inserting there is a skew in partition. '
if the problem is to deal with data that is partitioned in a dataframe after read, Have you played around increasing the "numPartitions" value ?
.option("numPartitions",50)
lowerBound, upperBound form partition strides for generated WHERE clause expressions and numpartitions determines the number of split.
say for example, sometable has column - ID (we choose that as partitionColumn) ; value range we see in table for column-ID is from 1 to 1000 and we want to get all the records by running select * from sometable,
so we going with lowerbound = 1 & upperbound = 1000 and numpartition = 4
this will produce a dataframe of 4 partition with result of each Query by building sql based on our feed (lowerbound = 1 & upperbound = 1000 and numpartition = 4)
select * from sometable where ID < 250
select * from sometable where ID >= 250 and ID < 500
select * from sometable where ID >= 500 and ID < 750
select * from sometable where ID >= 750
what if most of the records in our table fall within the range of ID(500,750). that's the situation you are in to.
when we increase numpartition , the split happens even further and that reduce the volume of records in the same partition but this
is not a fine shot.
Instead of spark splitting the partitioncolumn based on boundaries we provide, if you think of feeding the split by yourself so, data can be evenly
splitted. you need to switch over to another JDBC method where instead of (lowerbound,upperbound & numpartition) we can provide
predicates directly.
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame
Link

Avoid repartition costs when filtering and then coalescing

I am implementing a range query on an RDD of (x,y) points in pyspark. I partitioned the xy space into a 16*16 grid (256 cells) and assigned each point in my RDD to one of these cells.
The gridMappedRDD is a PairRDD: (cell_id, Point object)
I partitioned this RDD to 256 partitions, using:
gridMappedRDD.partitionBy(256)
The range query is a rectangular box. I have a method for my Grid object which can return the list of cell ids which overlap with the query range. So, I used this as a filter to prune the unrelated cells:
filteredRDD = gridMappedRDD.filter(lambda x: x[0] in candidateCells)
But the problem is that when running the query and then collecting the results, all the 256 partitions are evaluated; A task is created for each partition.
To avoid this problem, I tried coalescing the filteredRDD to the length of candidateCell list and I hoped this could solve the problem.
filteredRDD.coalesce(len(candidateCells))
In fact the resulting RDD has len(candidateCells) partitions but the partitions are not the same as gridMappedRDD.
As stated in the coalesce documentation, the shuffle parameter is False and no shuffle should be performed among partitions but I can see (with the help of glom()) that this is not the case.
For example after a coalesce(4) with candidateCells=[62, 63, 78, 79] the partitions are like this:
[[(62, P), (62, P) .... , (63, P)],
[(78, P), (78, P) .... , (79, P)],
[], []
]
Actually, by coalescing, I have a shuffle read which equals to the size of my whole dataset for every task, which takes a significant time. What I need is an RDD with only partitions related to cells in candidateCells, without any shuffles.
So, my question is that is it possible to filter only some partitions without reshuffling? For the above example, my filteredRDD would have 4 partitions with exactly the same data as originalRDD's 62, 63, 78, 79th partitions. Doing so, the query could be directed to affecting partitions only.
You made a few incorrect assumptions here:
The shuffle is not related to coalesce (nor coalesce is useful here). It is caused by partitionBy. Partitioning by definition requires shuffle.
Partitioning cannot be used to optimize filter. Spark knows nothing about the function you use (it is a black box).
Partitioning doesn't uniquely map keys to partitions. Multiple keys can be placed on the same partition - How does HashPartitioner work?
What can you do:
If resulting subset is small repartition and apply lookup for each key:
from itertools import chain
partitionedRDD = gridMappedRDD.partitionBy(256)
chain.from_iterable(
((c, x) for x in partitionedRDD.lookup(c))
for c in candidateCells
)
If data is large you can try to skip scanning partitions (number of tasks won't change, but some task can be short circuited):
candidatePartitions = [
partitionedRDD.partitioner.partitionFunc(c) for c in candidateCells
]
partitionedRDD.mapPartitionsWithIndex(
lambda i, xs: (x for x in xs if x[0] in candidateCells) if i in candidatePartitions else []
)
This two methods make sense only if you perform multiple "lookups". If it is one-off operation, it is better to perform linear filter:
It is cheaper than shuffle and repartitioning.
If initial data is uniformly distributed downstream processing will be able to better utilize available resources.

Use of partitioners in Spark

Hy, I have a question about partitioning in Spark,in Learning Spark book, authors said that partitioning can be useful, like for example during PageRank at page 66 and they write :
since links is a static dataset, we partition it at the start with
partitionBy(), so that it does not need to be shuffled across the
network
Now I'm focused about this example, but my questions are general:
why a partitioned RDD doesn't need to be shuffled?
PartitionBy() is a wide transformation,so it will produce shuffle anyway,right?
Could someone illustrate a concrete example and what happen into each single node when partitionBy happens?
Thanks in advance
Why a partitioned RDD doesn't need to be shuffled?
When the author does:
val links = sc.objectFile[(String, Seq[String])]("links")
.partitionBy(new HashPartitioner(100))
.persist()
He's partitioning the data set into 100 partitions where each key will be hashed to a given partition (pageId in the given example). This means that the same key will be stored in a single given partition. Then, when he does the join:
val contributions = links.join(ranks)
All chunks of data with the same pageId should already be located on the same executor, avoiding the need for a shuffle between different nodes in the cluster.
PartitionBy() is a wide transformation,so it will produce shuffle
anyway, right?
Yes, partitionBy produces a ShuffleRDD[K, V, V]:
def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
if (self.partitioner == Some(partitioner)) {
self
} else {
new ShuffledRDD[K, V, V](self, partitioner)
}
}
Could someone illustrate a concrete example and what happen into each
single node when partitionBy happens?
Basically, partitionBy will do the following:
It will hash the key modulu the number of partitions (100 in this case), and since it relys on the fact that the same key will always produce the same hashcode, it will package all data from a given id (in our case, pageId) to the same partition, such that when you join, all data will be available in that partition already, avoiding the need for a shuffle.

Resources