Performing operations only on subset of a RDD - apache-spark

I would like to perform some transformations only on a subset of a RDD (to make experimenting in REPL faster).
Is it possible?
RDD has take(num: Int): Array[T] method, I think I'd need something similar, but returning RDD[T]

You can use RDD.sample to get an RDD out, not an Array. For example, to sample ~1% without replacement:
val data = ...
res1: Long = 18066983
val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt)
res3: Long = 180190
The third parameter is a seed, and is thankfully optional in the next Spark version.

RDDs are distributed collections which are materialized on actions only. It is not possible to truncate your RDD to a fixed size, and still get an RDD back (hence RDD.take(n) returns an Array[T], just like collect)
I you want to get similarly sized RDDs regardless of the input size, you can truncate items in each of your partitions - this way you can better control the absolute number of items in resulting RDD. Size of the resulting RDD will depend on spark parallelism.
An example from spark-shell:
import org.apache.spark.rdd.RDD
val numberOfPartitions = 1000
val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions)
val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10))
val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10))
millionRdd.count // 1000000
millionRddTruncated.count // 10000 = 10 item * 1000 partitions
billionRddTruncated.count // 10000 = 10 item * 1000 partitions

Apparently it's possible to create RDD subset by first using its take method and then passing returned array to SparkContext's makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism) which returns new RDD.
This approach seems dodgy to me though. Is there a nicer way?

I always use parallelize function of SparkContext to distribute from Array[T] but it seems makeRDD do the same. It's correct way both of them.


What is the performance difference between accumulator and collect() in Spark?

Accumulator are basically the shared variable in spark to be updated by executors but read by driver only.
Collect() in spark is to get all the data into the driver from executors.
So, in both when I am get the data ultimately in driver only. so, what is the difference in performance when we use accumulator or collect() to convert a large RDD into a LIST?
Code to convert dataframe to List using accumulator
val queryOutput = spark.sql(query)
val acc = spark.sparkContext.collectionAccumulator[Map[String,Any]]("JsonCollector")
val jsonString = queryOutput.foreach(a=>acc.add(convertRowToJSON(a)))
def convertRowToJSON(row: Row): Map[String,Any] = {
val m = row.getValuesMap(row.schema.fieldNames)
Code to convert dataframe to list using collect()
val queryOutput = spark.sql(query)
Convert large RDD to LIST
It is not a good idea. collect will move data from all executors to driver memory. If memory is not enough then it will throw Out Of Memory (OOM) Exception. If your data is fits in memory of single machine then probably you don't need spark.
Spark natively supports accumulators of numeric types, and programmers can add support for new types. They can be used to implement counters (as in MapReduce) or sums. OUT parameter of accumulator should be a type that can be read atomically (e.g., Int, Long), or thread-safely (e.g., synchronized collections) because it will be read from other threads.
CollectionAccumulator .value returns List (ArrayList implementation) and it will throw OOM if size is greater than driver memory.

Splitting a pipeline in spark?

Assume that I have a Spark pipeline like this (formatted to emphasize the important steps):
val foos1 =
I'm adding a similar pipeline:
val foos2 =
Then I do something with both results.
I'd like to avoid doing someComplicatedProcessing twice (not parsing the file twice is nice, too).
Is there a way to take the stream after the .map(someComplicatedProcessing) step and create two parallel streams feeding off it?
I know that I can store the intermediate result on disk and thus save the CPU time at the cost of more I/O. Is there a better way? What words do I web-search for?
First option - cache intermediate results:
val cached =
val foos1 =
val foos2 =
Second option - use RDD and make single pass:
val foos =
.flatMap(x => Seq(("t1", transform1(x)), ("t2", transform2(x))))
val foos1 = foos("t1")
val foos2 = foos("t2")
The second option may require some type wrangling if transform1 and transform2 have incompatible return types.

Spark 1.6.2's RDD caching seems do to weird things with filters in some cases

I have an RDD:
avroRecord: org.apache.spark.rdd.RDD[com.rr.eventdata.ViewRecord] = MapPartitionsRDD[75]
I then filter the RDD for a single matching value:
val siteFiltered = avroRecord.filter(_.getSiteId == 1200)
I now count how many distinct values I get for SiteId. Given the filter it should be "1". Here's two ways I do it without cache and with cache:
val basic =
val cached =
The result indicates that the cached version isn't filtered at all:
basic: Long = 1
cached: Long = 93
"93" isn't even the expected value if the filter was ignored completely (that answer is "522"). It also isn't a problem with "distinct" as the values are real ones.
It seems like the cached RDD has some odd partial version of the filter.
Anyone know what's going on here?
I supposed the problem is that you have to cache the result of your RDD before doing any action on it.
Spark build a DAG that represents the execution of your program. Each node is a transformation or an action on your RDD. Without cacheing the RDD, each action forces Spark to execute the whole DAG from the begining (or from the last cache invocation).
So, your code should work if you do the following changes:
val siteFiltered =
avroRecord.filter(_.getSiteId == 1200)
val basic = siteFiltered.distinct.count
// Yes, I know, in this way the second count has no sense at all
val cached = siteFiltered.distinct.count
There is no issue with your code. It should work fine.
I tried out the same at my local it is working fine without any discrepancies with multiple runs.
I have following data with me:
And I tried the same thing as you did, following is my code that is working fine:
scala> var textrdd = sc.textFile("file:///data/pocs/blogs/eventrecords");
textrdd: org.apache.spark.rdd.RDD[String] = file:///data/pocs/blogs/eventrecords MapPartitionsRDD[123] at textFile at <console>:27
scala> var filteredRdd = textrdd.filter(_.split(",")(1).toDouble > 1)
filteredRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[124] at filter at <console>:29
scala> => x.split(",")(1)).distinct.count
res36: Long = 12
scala> => x.split(",")(1)).distinct.count
res37: Long = 12

How does Spark keep track of the splits in randomSplit?

This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split.
If we look at the implementation of randomSplit:
def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = {
// It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its
// constituent partitions each time a split is materialized which could result in
// overlapping splits. To prevent this, we explicitly sort each input partition to make the
// ordering deterministic.
val sorted = Sort(, Ascending)), global = false, logicalPlan)
val sum = weights.sum
val normalizedCumWeights = / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
new DataFrame(sqlContext, Sample(x(0), x(1), withReplacement = false, seed, sorted))
we can see that it creates two DataFrames that share the same sqlContext and with two different Sample(rs).
How are these two DataFrame(s) communicating with each other so that a value that fell in the first one is not included in the second one?
And is the data being fetched twice? (Assume the sqlContext is selecting from a DB, is the select being executed twice?).
It's exactly the same as sampling an RDD.
Assuming you have the weight array (0.6, 0.2, 0.2), Spark will generate one DataFrame for each range (0.0, 0.6), (0.6, 0.8), (0.8, 1.0).
When it's time to read the result DataFrame, Spark will just go over the parent DataFrame. For each item, generate a random number, if that number fall in the the specified range, then emit the item. All child DataFrame share the same random number generator (technically, different generators with the same seed), so the sequence of random number is deterministic.
For your last question, if you did not cache the parent DataFrame, then the data for the input DataFrame will be re-fetch each time an output DataFrame is computed.

How to get an Iterator of Rows using Dataframe in SparkSQL

I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list.
I am executing this SparkSQL application using yarn-client.
Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you really want to go with this you can use toLocalIterator method on a RDD:
val df: org.apache.spark.sql.DataFrame = ???
df.cache // Optional, to avoid repeated computation, see docs for details
val iter: Iterator[org.apache.spark.sql.Row] = df.rdd.toLocalIterator
Actually you can just use: df.toLocalIterator, here is the reference in Spark source code:
* Return an iterator that contains all of [[Row]]s in this Dataset.
* The iterator will consume as much memory as the largest partition in this Dataset.
* Note: this results in multiple Spark jobs, and if the input Dataset is the result
* of a wide transformation (e.g. join with different partitioners), to avoid
* recomputing the input Dataset should be cached first.
* #group action
* #since 2.0.0
def toLocalIterator(): java.util.Iterator[T] = withCallback("toLocalIterator", toDF()) { _ =>
withNewExecutionId {
