How to create Spark RDD from an iterator? - apache-spark

To make it clear, I am not looking for RDD from an array/list like
List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7); // sample
JavaRDD<Integer> rdd = new JavaSparkContext().parallelize(list);
How can I create a spark RDD from a java iterator without completely buffering it in memory?
Iterator<Integer> iterator = Arrays.asList(1, 2, 3, 4).iterator(); //sample iterator for illustration
JavaRDD<Integer> rdd = new JavaSparkContext().what("?", iterator); //the Question
Additional Question:
Is it a requirement for source to be re-readable(or capable to read many times) to offer resilience for RDD? In other words, since iterators are fundamentally read-once, is it even possible to create Resilient Distributed Datasets(RDD) from iterators?

As somebody else said, you could do something with spark streaming, but as for pure spark, you can't, and the reason is that what you're asking goes against spark's model. Let me explain.
To distribute and parallelize work, spark has to divide it in chunks. When reading from HDFS, that 'chunking' is done for Spark by HDFS, since HDFS files are organized in blocks. Spark will generally generate one task per block.
Now, iterators only provide sequential access to your data, so it's impossible for spark to organize it in chunks without reading it all in memory.
It may be possible to build a RDD that has a single iterable partition, but even then, it is impossible to say if the implementation of the Iterable could be sent to workers. When using sc.parallelize() spark creates partitions that implement serializable so each partition can be sent to a different worker. The iterable could be over a network connection, or file in the local FS, so they cannot be sent to the workers unless they are buffered in memory.

Super old question but I would just create the iterators in a flatMap after serialization.
var ranges = Arrays.asList(Pair.of(1,7), Pair.of(0,5));
JavaRDD<Integer> data = sparkContext.parallelize(ranges).flatMap(pair -> Flux.range(pair.left(), pair.right()).toStream().iterator());

Related

Is there a way to write RDD rows to HDFS or S3 from inside a map transformation?

I'm aware that the typical way of writing RDD or Dataframe rows to HDFS or S3 is by using saveAsTextFile or df.write. However, I would like to figure out how to write individual records from inside a map transformation like this:
myRDD.map(row => {
if(row.contains("something")) {
// write record to HDFS or S3
}
row
}
I know that this can be accomplished with the following code,
val newRDD = myRDD.filter(row => row.contains("something"))
newRDD.saveAsTextFile("myFile")
but I want to continue processing the original myRDD after writing to HDFS and that would require caching myRDD and I am low on memory resources.
I want to continue processing the original myRDD after writing to HDFS and that would require caching myRDD and I am low on memory resources.
The above statement is not correct. You can operate on an RDD further without caching if you have low memory.
You can write inside a map() function using the Hadoop API, but it's not a good idea to operate terminal actions inside a map() function. map() operations should be side effect free. However you can use the mappartition() function.
You don't need to cache an RDD for doing subsequent operations on it. Caching helps in avoiding recomputation, but RDDs are immutable. A new RDD will be created (preserving the lineage) on each and every transformation.

How to execute computation per executor in Spark

In my computation, I
first broadcast some data, say bc,
then compute some big data shared by all executor/partition:val shared = f(bc)
then run the distributed computing, using shared data
To avoid computing the shared data on all RDD items, I can use .mapPartitions but I have much more partitions than executors. So it run the computation of shared data more times than necessary.
I found a simple method to compute the shared data only once per executor (which I understood is the JVM actually running the spark tasks): using lazy val on the broadcast data.
// class to be Broadcast
case class BC(input: WhatEver){
lazy val shared = f(input)
}
// in the spark code
val sc = ... // init SparkContext
val bc = sc.broadcast(BC(...))
val initRdd = sc.parallelize(1 to 10000, numSlices = 10000)
initRDD.map{ i =>
val shared = bc.value.shared
... // run computation using shared data
}
I think this does what I want, but
I am not sure, can someone guaranties it?
I am not sure lazy val is the best way to manage concurrent access, especially with respect to the spark internal distribution system. Is there a better way?
if computing shared fails, I think it will be recomputed for all RDD items with possible retries, instead of simply stopping the whole job with a unique error.
So, is there a better way?

spark transform method behaviour in multiple partition

I am using Kafka Streaming to read data from Kafka topic, and i want to join every RDD that i get in the stream to an existing RDD. So i think using "transform" is the best option (Unless any one disagrees, and suggest a better approach)
And, I read following example of "transform" method on DStreams in Spark:
val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(...) // RDD containing spam information
val cleanedDStream = wordCounts.transform { rdd =>
rdd.join(spamInfoRDD).filter(...) // join data stream with spam information to do data cleaning
...
}
But lets say, i have 3 partitions in the Kafka topic, and that i invoke 3 consumers to read from those. Now, this transform method will be called in three separate threads in parallel.
I am not sure if joining the RDDs in this case will be Thread-safe and this will not result in data-loss. (considering that RDDs are immutable)
Also, if you say that its thread-safe then wouldn't the performance be low since we are creating so many RDDs and then joining them?
can anybody suggest?

How to use RDD checkpointing to share datasets across Spark applications?

I have a spark application, and checkpoint the rdd in the code, a simple code snippet is as follows(It is very simple, just for illustrating my question.):
#Test
def testCheckpoint1(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data)
//sc is initialized in the setup
sc.setCheckpointDir(Utils.getOutputDir())
rdd.checkpoint()
rdd.collect()
}
When the rdd is checkpointed on the file system.I write another Spark application and would pick up the data checkpointed in the above code,
and make it as an RDD as a starting point in this second application
The ReliableCheckpointRDD is exactly the RDD that does the work, but this RDD is private to Spark.
So,since ReliableCheckpointRDD is private, it looks spark doesn't recommend to use ReliableCheckpointRDD outside spark.
I would ask if there is a way to do it.
Quoting the scaladoc of RDD.checkpoint (highlighting mine):
checkpoint(): Unit Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
So, RDD.checkpoint will cut the RDD lineage and trigger partial computation so you've got something already pre-computed in case your Spark application may fail and stop.
Note that RDD checkpointing is very similar to RDD caching but caching would make the partial datasets private to some Spark application.
Let's read Spark Streaming's Checkpointing (that in some way extends the concept of RDD checkpointing making it closer to your needs to share the results of computations between Spark applications):
Data checkpointing Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
So, yes, in a sense you could share the partial results of computations in a form of RDD checkpointing, but why would you even want to do it if you could save the partial results using the "official" interface using JSON, parquet, CSV, etc.
I doubt using this internal persistence interface could give you more features and flexibility than using the aforementioned formats. Yes, it is indeed technically possible to use RDD checkpointing to share datasets between Spark applications, but it's too much effort for not much gain.

How to broadcast RDD in PySpark?

Is it possible to broadcast an RDD in Python?
I am following the book "Advanced Analytics with Spark: Patterns for Learning from Data at Scale" and on chapter 3 an RDD needs to be broadcasted. I'm trying to follow the examples using Python instead of Scala.
Anyway, even with this simple example I have an error:
my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd)
The error being:
"It appears that you are attempting to broadcast an RDD or reference an RDD from an "
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an
action or transformation. RDD transformations and actions can only be invoked by the driver, n
ot inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) i
s invalid because the values transformation and count action cannot be performed inside of the
rdd1.map transformation. For more information, see SPARK-5063.
I don't really understand what "action or transformation" the error is referring to.
I am using spark-2.1.1-hadoop2.7.
Important Edit: the book is correct. I just failed to read that it wasn't an RDD that was being broadcasted but a map version of it obtained with collectAsMap().
Thanks!
Is it possible to broadcast an RDD in Python?
TL;DR No.
When you think what RDD really is you'll find it's simply not possible. There is nothing in an RDD you could broadcast. It's too fragile (so to speak).
RDD is a data structure that describes a distributed computation on some datasets. By the features of RDD you can describe what and how to compute. It's an abstract entity.
Quoting the scaladoc of RDD:
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Internally, each RDD is characterized by five main properties:
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
There's not much you could broadcast as (quoting SparkContext.broadcast method's scaladoc):
broadcast[T](value: T)(implicit arg0: ClassTag[T]): Broadcast[T] Broadcast a read-only variable to the cluster, returning a org.apache.spark.broadcast.Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.
You can only broadcast a real value, but an RDD is just a container of values that are only available when executors process its data.
From Broadcast Variables:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
And later in the same document:
This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
You could however collect the dataset an RDD holds and broadcast it as follows:
my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd.collect) // <-- collect the dataset
At "collect the dataset" step, the dataset leaves an RDD space and becomes a locally-available collection, a Python value, that can be then broadcast.
you cannot broadcast an RDD. you broadcast values to all your executors nodes that is used multiple times while process your RDD. So in your code you should collect your RDD before broadcasting it. The collect converts a RDD into a local python object which can be broadcasted without issues.
sc.broadcast(my_list_rdd.collect())
When you broadcast a value, the value is serialized and sent over the network to all the executor nodes. your my_list_rdd is just a reference to an RDD that is distributed across multiple nodes. serializing this reference and broadcasting this reference to all worker nodes wouldn't mean anything in the worker node. so you should collect the values of your RDD and broadcast the value instead.
more information on Spark Broadcast can be found here
Note: If your RDD is too large, the application might run into a OutOfMemory error. The collect method pull all the data the driver's memory which usually isn't large enough.

Resources