'take' that transforms RDD - apache-spark

take(count) is an action on RDD, which returns an Array with first count items.
Is there a transformation that returns a RDD with first count items? (It is ok if count is approximate)
The best I can get is
val countPerPartition = count / rdd.getNumPartitions.toDouble
rdd.mapPartitions(_.take(countPerPartition))
Update:
I do not want data to be transfered to the driver. In my case, count may be quite large, and driver has not enough memory to hold it. I want the data to remain paralellized for further transformations.

Why not to rdd.map(..).take(X). I.e. transform and then take. Don't be afraid to do redundant work, until you call take all the computations are lazy evaluated in spark(so only ~X transformations will happen)

Related

Should we always use rdd.count() instead of rdd.collect().size

rdd.collect().size will first move all data to driver, if the dataset is large, it could resutl in OutOfMemoryError.
So, should we always use rdd.count() instead?
Or in other words, in what situation, people would prefer rdd.collect().size?
collect causes data to be processed and then fetched to the driver node.
For count you don't need:
Full processing - some columns may not be required to be fetched or calculated e.g. not included in any filter. You don't need to load, process or transfer the columns that don't effect the count.
Fetch to driver node - each worker node can count it's rows and the counts can be summed up.
I see no reason for calling collect().size.
Just for general knowledge, there is another way to get around #2, however, for this case it is redundant and won't prevent #1: rdd.mapPartitions(p => p.size).agg(r => r.sum())
Assuming you're using the Scala size function on the array returned by rdd.collect() I don't see any advantage of collecting the whole RDD just to get its number of rows.
This is the point of RDDs, to work on chunks of data in parallel to make transformations manageable. Usually the result is smaller than the original dataset because the given data is somehow transformed/filtered/synthesized.
collect usually comes at the end of data processing and if you run an action you might also want to save the data since might require some expensive computations and the collected data is presumably interesting/valuable.

Apache Spark - map and filter and take(1)

I know the usage of map and filter transformations, but I want to clarify something, map change the content of every element of an rdd one by one, if I use myrdd.map().filter().take(1) the map() function stops when the first element pass the filter function? Or does the whole map() function execute, then the filter takes action?
I'm trying to transform every rdd element and if an element satisfying a condition then the map() function stops and return the element.
The documentation seems to hint that there is no shortcut, and that the entire map and filter is executed.
Take the first num elements of the RDD.
It works by first scanning one partition, and use the results from
that partition to estimate the number of additional partitions needed
to satisfy the limit.
Translated from the Scala implementation in RDD#take().
Note this method should only be used if the resulting array is
expected to be small, as all the data is loaded into the driver’s
memory.

Spark flatMapToPair vs [filter + mapToPair]

What is the performance difference between the blocks of code below?
1.FlatMapToPair: This code block uses a single transformation, but is basically having the filter condition inside of it which returns an empty list, technically not allowing this element in the RDD to progress along
rdd.flatMapToPair(
if ( <condition> )
return Lists.newArrayList();
return Lists.newArrayList(new Tuple2<>(key, element));
)
2.[Filter + MapToPair] This code block has two transformations where the first transformation simply filters using the same condition as the above block of code but does another transformation mapToPair after the filter.
rdd.filter(
(element) -> <condition>
).mapToPair(
(element) -> new Tuple2<>(key, element)
)
Is Spark intelligent enough to perform the same with both these blocks of code regardless of the number of transformation OR perform worse in the code block 2 as these are two transformations?
Thanks
Actually Spark will perform worse in the first case because it has to initialize and then garbage collect new ArrayList for each record. Over a large number of records it can add substantial overhead.
Otherwise Spark is "intelligent enough" to use lazy data structures and combines multiple transformations which don't require shuffles into a single stage.
There are some situations where explicit merging of different transformations is beneficial (either to reduce number of initialized objects or to keep shorter lineage) but this is not one of these.

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
EDIT,
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

Why does df.limit keep changing in Pyspark?

I'm creating a data sample from some dataframe df with
rdd = df.limit(10000).rdd
This operation takes quite some time (why actually? can it not short-cut after 10000 rows?), so I assume I have a new RDD now.
However, when I now work on rdd, it is different rows every time I access it. As if it resamples over again. Caching the RDD helps a bit, but surely that's not save?
What is the reason behind it?
Update: Here is a reproduction on Spark 1.5.2
from operator import add
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],100)
rdd1=rdd.toDF().limit(1000).rdd
for _ in range(3):
print(rdd1.map(lambda row:row.i).reduce(add))
The output is
499500
19955500
49651500
I'm surprised that .rdd doesn't fix the data.
EDIT:
To show that it get's more tricky than the re-execution issue, here is a single action which produces incorrect results on Spark 2.0.0.2.5.0
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200)
rdd1=rdd.toDF().limit(12345).rdd
rdd2=rdd1.map(lambda x:(x,x))
rdd2.join(rdd2).count()
# result is 10240 despite doing a self-join
Basically, whenever you use limit your results might be potentially wrong. I don't mean "just one of many samples", but really incorrect (since in the case the result should always be 12345).
Because Spark is distributed, in general it's not safe to assume deterministic results. Your example is taking the "first" 10,000 rows of a DataFrame. Here, there's ambiguity (and hence non-determinism) in what "first" means. That will depend on the internals of Spark. For example, it could be the first partition that responds to the driver. That partition could change with networking, data locality, etc.
Even once you cache the data, I still wouldn't rely on getting the same data back every time, though I certainly would expect it to be more consistent than reading from disk.
Spark is lazy, so each action you take recalculates the data returned by limit(). If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition (i.e. if your data is stored across 10 Parquet files, the first limit call might pull from file 1, the second from file 7, and so on).
From the Spark docs:
The LIMIT clause is used to constrain the number of rows returned by the SELECT statement. In general, this clause is used in conjunction with ORDER BY to ensure that the results are deterministic.
So you need to sort the rows beforehand if you want the call to .limit() to be deterministic. But there is a catch! If you sort by a column that doesn't have unique values for every row, the so called "tied" rows (rows with same sorting key value) will not be deterministically ordered, thus the .limit() might still be nondeterministic.
You have two options to work around this:
Make sure you include the a unique row id in the sorting call.
For example df.orderBy('someCol', 'rowId').limit(n).
You can define the rowId like so:
df = df.withColumn('rowId', func.monotonically_increasing_id())
If you only need deterministic result in the single run, you could simply cache the results of limit df.limit(n).cache() so that at least the results from that limit do not change due to the consecutive action calls that would otherwise recompute the results of limit and mess up the results.

Resources