Use of partitioners in Spark - apache-spark

Hy, I have a question about partitioning in Spark,in Learning Spark book, authors said that partitioning can be useful, like for example during PageRank at page 66 and they write :
since links is a static dataset, we partition it at the start with
partitionBy(), so that it does not need to be shuffled across the
network
Now I'm focused about this example, but my questions are general:
why a partitioned RDD doesn't need to be shuffled?
PartitionBy() is a wide transformation,so it will produce shuffle anyway,right?
Could someone illustrate a concrete example and what happen into each single node when partitionBy happens?
Thanks in advance

Why a partitioned RDD doesn't need to be shuffled?
When the author does:
val links = sc.objectFile[(String, Seq[String])]("links")
.partitionBy(new HashPartitioner(100))
.persist()
He's partitioning the data set into 100 partitions where each key will be hashed to a given partition (pageId in the given example). This means that the same key will be stored in a single given partition. Then, when he does the join:
val contributions = links.join(ranks)
All chunks of data with the same pageId should already be located on the same executor, avoiding the need for a shuffle between different nodes in the cluster.
PartitionBy() is a wide transformation,so it will produce shuffle
anyway, right?
Yes, partitionBy produces a ShuffleRDD[K, V, V]:
def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
if (self.partitioner == Some(partitioner)) {
self
} else {
new ShuffledRDD[K, V, V](self, partitioner)
}
}
Could someone illustrate a concrete example and what happen into each
single node when partitionBy happens?
Basically, partitionBy will do the following:
It will hash the key modulu the number of partitions (100 in this case), and since it relys on the fact that the same key will always produce the same hashcode, it will package all data from a given id (in our case, pageId) to the same partition, such that when you join, all data will be available in that partition already, avoiding the need for a shuffle.

Related

Avoid repartition costs when filtering and then coalescing

I am implementing a range query on an RDD of (x,y) points in pyspark. I partitioned the xy space into a 16*16 grid (256 cells) and assigned each point in my RDD to one of these cells.
The gridMappedRDD is a PairRDD: (cell_id, Point object)
I partitioned this RDD to 256 partitions, using:
gridMappedRDD.partitionBy(256)
The range query is a rectangular box. I have a method for my Grid object which can return the list of cell ids which overlap with the query range. So, I used this as a filter to prune the unrelated cells:
filteredRDD = gridMappedRDD.filter(lambda x: x[0] in candidateCells)
But the problem is that when running the query and then collecting the results, all the 256 partitions are evaluated; A task is created for each partition.
To avoid this problem, I tried coalescing the filteredRDD to the length of candidateCell list and I hoped this could solve the problem.
filteredRDD.coalesce(len(candidateCells))
In fact the resulting RDD has len(candidateCells) partitions but the partitions are not the same as gridMappedRDD.
As stated in the coalesce documentation, the shuffle parameter is False and no shuffle should be performed among partitions but I can see (with the help of glom()) that this is not the case.
For example after a coalesce(4) with candidateCells=[62, 63, 78, 79] the partitions are like this:
[[(62, P), (62, P) .... , (63, P)],
[(78, P), (78, P) .... , (79, P)],
[], []
]
Actually, by coalescing, I have a shuffle read which equals to the size of my whole dataset for every task, which takes a significant time. What I need is an RDD with only partitions related to cells in candidateCells, without any shuffles.
So, my question is that is it possible to filter only some partitions without reshuffling? For the above example, my filteredRDD would have 4 partitions with exactly the same data as originalRDD's 62, 63, 78, 79th partitions. Doing so, the query could be directed to affecting partitions only.
You made a few incorrect assumptions here:
The shuffle is not related to coalesce (nor coalesce is useful here). It is caused by partitionBy. Partitioning by definition requires shuffle.
Partitioning cannot be used to optimize filter. Spark knows nothing about the function you use (it is a black box).
Partitioning doesn't uniquely map keys to partitions. Multiple keys can be placed on the same partition - How does HashPartitioner work?
What can you do:
If resulting subset is small repartition and apply lookup for each key:
from itertools import chain
partitionedRDD = gridMappedRDD.partitionBy(256)
chain.from_iterable(
((c, x) for x in partitionedRDD.lookup(c))
for c in candidateCells
)
If data is large you can try to skip scanning partitions (number of tasks won't change, but some task can be short circuited):
candidatePartitions = [
partitionedRDD.partitioner.partitionFunc(c) for c in candidateCells
]
partitionedRDD.mapPartitionsWithIndex(
lambda i, xs: (x for x in xs if x[0] in candidateCells) if i in candidatePartitions else []
)
This two methods make sense only if you perform multiple "lookups". If it is one-off operation, it is better to perform linear filter:
It is cheaper than shuffle and repartitioning.
If initial data is uniformly distributed downstream processing will be able to better utilize available resources.

Compute size of Spark dataframe - SizeEstimator gives unexpected results

I am trying to find a reliable way to compute the size (in bytes) of a Spark dataframe programmatically.
The reason is that I would like to have a method to compute an "optimal" number of partitions ("optimal" could mean different things here: it could mean having an optimal partition size, or resulting in an optimal file size when writing to Parquet tables - but both can be assumed to be some linear function of the dataframe size). In other words, I would like to call coalesce(n) or repartition(n) on the dataframe, where n is not a fixed number but rather a function of the dataframe size.
Other topics on SO suggest using SizeEstimator.estimate from org.apache.spark.util to get the size in bytes of the dataframe, but the results I'm getting are inconsistent.
First of all, I'm persisting my dataframe to memory:
df.cache().count
The Spark UI shows a size of 4.8GB in the Storage tab. Then, I run the following command to get the size from SizeEstimator:
import org.apache.spark.util.SizeEstimator
SizeEstimator.estimate(df)
This gives a result of 115'715'808 bytes =~ 116MB. However, applying SizeEstimator to different objects leads to very different results. For instance, I try computing the size separately for each row in the dataframe and sum them:
df.map(row => SizeEstimator.estimate(row.asInstanceOf[ AnyRef ])).reduce(_+_)
This results in a size of 12'084'698'256 bytes =~ 12GB. Or, I can try to apply SizeEstimator to every partition:
df.mapPartitions(
iterator => Seq(SizeEstimator.estimate(
iterator.toList.map(row => row.asInstanceOf[ AnyRef ]))).toIterator
).reduce(_+_)
which results again in a different size of 10'792'965'376 bytes =~ 10.8GB.
I understand there are memory optimizations / memory overhead involved, but after performing these tests I don't see how SizeEstimator can be used to get a sufficiently good estimate of the dataframe size (and consequently of the partition size, or resulting Parquet file sizes).
What is the appropriate way (if any) to apply SizeEstimator in order to get a good estimate of a dataframe size or of its partitions? If there isn't any, what is the suggested approach here?
Unfortunately, I was not able to get reliable estimates from SizeEstimator, but I could find another strategy - if the dataframe is cached, we can extract its size from queryExecution as follows:
df.cache.foreach(_ => ())
val catalyst_plan = df.queryExecution.logical
val df_size_in_bytes = spark.sessionState.executePlan(
catalyst_plan).optimizedPlan.stats.sizeInBytes
For the example dataframe, this gives exactly 4.8GB (which also corresponds to the file size when writing to an uncompressed Parquet table).
This has the disadvantage that the dataframe needs to be cached, but it is not a problem in my case.
EDIT: Replaced df.cache.foreach(_=>_) by df.cache.foreach(_ => ()), thanks to #DavidBenedeki for pointing it out in the comments.
SizeEstimator returns the number of bytes an object takes up on the JVM heap. This includes objects referenced by the object, the actual object size will almost always be much smaller.
The discrepancies in sizes you've observed are because when you create new objects on the JVM the references take up memory too, and this is being counted.
Check out the docs here 🤩
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.util.SizeEstimator$
Apart from Size estimator, which you have already tried(good insight)..
below is another option
RDDInfo[] getRDDStorageInfo()
Return information about what RDDs are cached, if they are in mem or on both, how much space they take, etc.
actually spark storage tab uses this.Spark docs
Below is the implementation from spark
/**
* :: DeveloperApi ::
* Return information about what RDDs are cached, if they are in mem or on disk, how much space
* they take, etc.
*/
#DeveloperApi
def getRDDStorageInfo: Array[RDDInfo] = {
getRDDStorageInfo(_ => true)
}
private[spark] def getRDDStorageInfo(filter: RDD[_] => Boolean): Array[RDDInfo] = {
assertNotStopped()
val rddInfos = persistentRdds.values.filter(filter).map(RDDInfo.fromRdd).toArray
rddInfos.foreach { rddInfo =>
val rddId = rddInfo.id
val rddStorageInfo = statusStore.asOption(statusStore.rdd(rddId))
rddInfo.numCachedPartitions = rddStorageInfo.map(_.numCachedPartitions).getOrElse(0)
rddInfo.memSize = rddStorageInfo.map(_.memoryUsed).getOrElse(0L)
rddInfo.diskSize = rddStorageInfo.map(_.diskUsed).getOrElse(0L)
}
rddInfos.filter(_.isCached)
}
yourRDD.toDebugString from RDD also uses this. code here
General Note :
In my opinion, to get optimal number of records in each partition and check your repartition is correct and they are uniformly distributed, I would suggest to try like below... and adjust your re-partition number. and then measure the size of partition... would be more sensible. to address this kind of problems
yourdf.rdd.mapPartitionsWithIndex{case (index,rows) => Iterator((index,rows.size))}
.toDF("PartitionNumber","NumberOfRecordsPerPartition")
.show
or existing spark functions (based on spark version)
import org.apache.spark.sql.functions._
df.withColumn("partitionId", sparkPartitionId()).groupBy("partitionId").count.show
My suggestion is
from sys import getsizeof
def compare_size_two_object(one, two):
'''compare size of two files in bites'''
print(getsizeof(one), 'versus', getsizeof(two))

Use Spark groupByKey to dedup RDD which causes a lot of shuffle overhead

I have a key-value pair RDD. The RDD contains some elements with duplicate keys, and I want to split original RDD into two RDDs: One stores elements with unique keys, and another stores the rest elements. For example,
Input RDD (6 elements in total):
<k1,v1>, <k1,v2>, <k1,v3>, <k2,v4>, <k2,v5>, <k3,v6>
Result:
Unique keys RDD (store elements with unique keys; For the multiple elements with the same key, any element is accepted):
<k1,v1>, <k2, v4>, <k3,v6>
Duplicated keys RDD (store the rest elements with duplicated keys):
<k1,v2>, <k1,v3>, <k2,v5>
In the above example, unique RDD has 3 elements, and the duplicated RDD has 3 elements too.
I tried groupByKey() to group elements with the same key together. For each key, there is a sequence of elements. However, the performance of groupByKey() is not good because the data size of element value is very big which causes very large data size of shuffle write.
So I was wondering if there is any better solution. Or is there a way to reduce the amount of data being shuffled when using groupByKey()?
EDIT: given the new information in the edit, I would first create the unique rdd, and than the the duplicate rdd using the unique and the original one:
val inputRdd: RDD[(K,V)] = ...
val uniqueRdd: RDD[(K,V)] = inputRdd.reduceByKey((x,y) => x) //keep just a single value for each key
val duplicateRdd = inputRdd
.join(uniqueRdd)
.filter {case(k, (v1,v2)) => v1 != v2}
.map {case(k,(v1,v2)) => (k, v1)} //v2 came from unique rdd
there is some room for optimization also.
In the solution above there will be 2 shuffles (reduceByKey and join).
If we repartition the inputRdd by the key from the start, we won't need any additional shuffles
using this code should produce much better performance:
val inputRdd2 = inputRdd.partitionBy(new HashPartitioner(partitions=200) )
Original Solution:
you can try the following approach:
first count the number of occurrences of each pair, and then split into the 2 rdds
val inputRdd: RDD[(K,V)] = ...
val countRdd: RDD[((K,V), Int)] = inputRDD
.map((_, 1))
.reduceByKey(_ + _)
.cache
val uniqueRdd = countRdd.map(_._1)
val duplicateRdd = countRdd
.filter(_._2>1)
.flatMap { case(kv, count) =>
(1 to count-1).map(_ => kv)
}
Please use combineByKey resulting in use of combiner on the Map Task and hence reduce shuffling data.
The combiner logic depends on your business logic.
http://bytepadding.com/big-data/spark/groupby-vs-reducebykey/
There are multiple ways to reduce shuffle data.
1. Write less from Map task by use of combiner.
2. Send Aggregated serialized objects from Map to reduce.
3. Use combineInputFormts to enhance efficiency of combiners.

Spark RDD do not get processed in multiple nodes

I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code
val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Any clue where i go wrong?
As specify here by #jaceklaskowski
By default, a partition is created for each HDFS partition, which by
default is 64MB (from Spark’s Programming Guide).
If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.
Spark will use all nodes when using big data
Could there be a possibility that your data is skewed?
To rule out this possibility, do the following and rerun the code.
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Further if in your map logic you are dependent on a particular column you can do below
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
< do some computation on each row>
}
A good partition column could be date column

How to effectively read millions of rows from Cassandra?

I have a hard task to read from a Cassandra table millions of rows. Actually this table contains like 40~50 millions of rows.
The data is actually internal URLs for our system and we need to fire all of them. To fire it, we are using Akka Streams and it have been working pretty good, doing some back pressure as needed. But we still have not found a way to read everything effectively.
What we have tried so far:
Reading the data as Stream using Akka Stream. We are using phantom-dsl that provides a publisher for a specific table. But it does not read everything, only a small portion. Actually it stops to read after the first 1 million.
Reading using Spark by a specific date. Our table is modeled like a time series table, with year, month, day, minutes... columns. Right now we are selecting by day, so Spark will not fetch a lot of things to be processed, but this is a pain to select all those days.
The code is the following:
val cassandraRdd =
sc
.cassandraTable("keyspace", "my_table")
.select("id", "url")
.where("year = ? and month = ? and day = ?", date.getYear, date.getMonthOfYear, date.getDayOfMonth)
Unfortunately I can't iterate over the partitions to get less data, I have to use a collect because it complains the actor is not serializable.
val httpPool: Flow[(HttpRequest, String), (Try[HttpResponse], String), HostConnectionPool] = Http().cachedHostConnectionPool[String](host, port).async
val source =
Source
.actorRef[CassandraRow](10000000, OverflowStrategy.fail)
.map(row => makeUrl(row.getString("id"), row.getString("url")))
.map(url => HttpRequest(uri = url) -> url)
val ref = Flow[(HttpRequest, String)]
.via(httpPool.withAttributes(ActorAttributes.supervisionStrategy(decider)))
.to(Sink.actorRef(httpHandlerActor, IsDone))
.runWith(source)
cassandraRdd.collect().foreach { row =>
ref ! row
}
I would like to know if any of you have such experience on reading millions of rows for doing anything different from aggregation and so on.
Also I have thought to read everything and send to a Kafka topic, where I would be receiving using Streaming(spark or Akka), but the problem would be the same, how to load all those data effectively ?
EDIT
For now, I'm running on a cluster with a reasonable amount of memory 100GB and doing a collect and iterating over it.
Also, this is far different from getting bigdata with spark and analyze it using things like reduceByKey, aggregateByKey, etc, etc.
I need to fetch and send everything over HTTP =/
So far it is working the way I did, but I'm afraid this data get bigger and bigger to a point where fetching everything into memory makes no sense.
Streaming this data would be the best solution, fetching in chunks, but I haven't found a good approach yet for this.
At the end, I'm thinking of to use Spark to get all those data, generate a CSV file and use Akka Stream IO to process, this way I would evict to keep a lot of things in memory since it takes hours to process every million.
Well, after spending sometime reading, talking with other guys and doing tests the result could be achieve by the following code sample:
val sc = new SparkContext(sparkConf)
val cassandraRdd = sc.cassandraTable(config.getString("myKeyspace"), "myTable")
.select("key", "value")
.as((key: String, value: String) => (key, value))
.partitionBy(new HashPartitioner(2 * sc.defaultParallelism))
.cache()
cassandraRdd
.groupByKey()
.foreachPartition { partition =>
partition.foreach { row =>
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val myActor = system.actorOf(Props(new MyActor(system)), name = "my-actor")
val source = Source.fromIterator { () => row._2.toIterator }
source
.map { str =>
myActor ! Count
str
}
.to(Sink.actorRef(myActor, Finish))
.run()
}
}
sc.stop()
class MyActor(system: ActorSystem) extends Actor {
var count = 0
def receive = {
case Count =>
count = count + 1
case Finish =>
println(s"total: $count")
system.shutdown()
}
}
case object Count
case object Finish
What I'm doing is the following:
Try to achieve a good number of Partitions and a Partitioner using the partitionBy and groupBy methods
Use Cache to prevent Data Shuffle, making your Spark move large data across nodes, using high IO etc.
Create the whole actor system with it's dependencies as well as the Stream inside the foreachPartition method. Here is a trade off, you can have only one ActorSystem but you will have to make a bad use of .collect as I wrote in the question. However creating everything inside, you still have the ability to run things inside spark distributed across your cluster.
Finish each actor system at the end of the iterator using the Sink.actorRef with a message to kill(Finish)
Perhaps this code could be even more improved, but so far I'm happy to do not make the use of .collect anymore and working only inside Spark.

Resources