How to cache data on Google Dataproc worker nodes

How to cache data on Google Dataproc worker nodes - apache-spark

I want to cache some data (ndarrays) locally on worker nodes to do some comparison with ndarray distributed from incoming RDDs from Spark streaming. What is the best way to do it?
Since I want to compare ndarrays stored in my files with each single ndarray passed in from Spark streaming. It doesn't seem like I can load those data into an RDD, since I cannot go through another RDD inside the map function of one other RDD. And I tried loading them to a list on master node and broadcast them to worker nodes. But I got an error that broadcast variable is not iterable when I try to go through them and comparing with the incoming data.

Issue here was that you need to use the value() method to read the actual value of the broadcasted variable. Following the example in the comment by #user9613318:
bd_array = sc.broadcast(np.arange(100))
This will create a numpy array for that range and broadcast it to all workers. If you try to use the variable with just 'bd_array' you'll get a broadcast variable class which has other methods such as persist, destroy, etc. This is not iterable. If you read it with 'bd_array.value' you'll get back the broadcasted numpy array which can be iterated on (some docs here)

Related

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?

I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

how to broadcast the content of a RDD efficiently

so I have this need to broadcast some related content from a RDD to all worker nodes, and I am trying to do it more efficiently.
More specifically, some RDD is created dynamically in the middle of the execution, to broadcast some of its content to all the worker nodes, an obvious solution would be to traverse its element one by one, and create a list/vector/hashmap to hold the needed content while traversing, and then broadcast this data structure to the cluster.
This does not seems to be a good solution at all since the RDD can be huge and it is distributed already, traversing it and creating some array/list based on the traversal result will be very slow.
So what would be a better solution, or best practice for this case? Would it be a good idea to run a SQL query on the RDD (after changing it to a dataFrame) to get the needed content, and then broadcast the query result to all the worker nodes?
thank you for your help in advance!
The following is added after reading Varslavans' answer:
a RDD is created dynamically and it has the following content:
[(1,1), (2,5), (3,5), (4,7), (5,1), (6,3), (7,2), (8,2), (9,3), (10,3), (11,3), ...... ]
so this RDD contains key-value pairs. What we want is to collect all the pairs whose value is > 3. So pair (2,5), (3,5), (4,7), ..., will be collected. Now, once we collected all these pairs, we would like to broadcast them so all the worker nodes will have these pairs.
Sounds like we should use collect() on the RDD and then broadcast... at least this is the best solution at this point.
Thanks again!

First of all - you don't need to traverse RDD to get all data. There is API for that - collect().
Second: Broadcast is not the same as distributed.
In broadcast - you have all the data on each node
In Distributed - you have different parts of a whole on each node
RDD is distributed by it's nature.
Third: To get needed content you can either use RDD API or convert it to DataFrame and use SQL queries. It depends on the data you have. Anyway contents of the result will be RDD or DataFrame and it will also be distributed. So if you need data locally - you collect() it.
Btw from your question it's not possible to understand what you exactly want to do and it looks like you need to read Spark basics. That will give you much answers :)

How to broadcast RDD in PySpark?

Is it possible to broadcast an RDD in Python?
I am following the book "Advanced Analytics with Spark: Patterns for Learning from Data at Scale" and on chapter 3 an RDD needs to be broadcasted. I'm trying to follow the examples using Python instead of Scala.
Anyway, even with this simple example I have an error:
my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd)
The error being:
"It appears that you are attempting to broadcast an RDD or reference an RDD from an "
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an
action or transformation. RDD transformations and actions can only be invoked by the driver, n
ot inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) i
s invalid because the values transformation and count action cannot be performed inside of the
rdd1.map transformation. For more information, see SPARK-5063.
I don't really understand what "action or transformation" the error is referring to.
I am using spark-2.1.1-hadoop2.7.
Important Edit: the book is correct. I just failed to read that it wasn't an RDD that was being broadcasted but a map version of it obtained with collectAsMap().
Thanks!

Is it possible to broadcast an RDD in Python?
TL;DR No.
When you think what RDD really is you'll find it's simply not possible. There is nothing in an RDD you could broadcast. It's too fragile (so to speak).
RDD is a data structure that describes a distributed computation on some datasets. By the features of RDD you can describe what and how to compute. It's an abstract entity.
Quoting the scaladoc of RDD:
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Internally, each RDD is characterized by five main properties:
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
There's not much you could broadcast as (quoting SparkContext.broadcast method's scaladoc):
broadcast[T](value: T)(implicit arg0: ClassTag[T]): Broadcast[T] Broadcast a read-only variable to the cluster, returning a org.apache.spark.broadcast.Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.
You can only broadcast a real value, but an RDD is just a container of values that are only available when executors process its data.
From Broadcast Variables:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
And later in the same document:
This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
You could however collect the dataset an RDD holds and broadcast it as follows:
my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd.collect) // <-- collect the dataset
At "collect the dataset" step, the dataset leaves an RDD space and becomes a locally-available collection, a Python value, that can be then broadcast.

you cannot broadcast an RDD. you broadcast values to all your executors nodes that is used multiple times while process your RDD. So in your code you should collect your RDD before broadcasting it. The collect converts a RDD into a local python object which can be broadcasted without issues.
sc.broadcast(my_list_rdd.collect())
When you broadcast a value, the value is serialized and sent over the network to all the executor nodes. your my_list_rdd is just a reference to an RDD that is distributed across multiple nodes. serializing this reference and broadcasting this reference to all worker nodes wouldn't mean anything in the worker node. so you should collect the values of your RDD and broadcast the value instead.
more information on Spark Broadcast can be found here
Note: If your RDD is too large, the application might run into a OutOfMemory error. The collect method pull all the data the driver's memory which usually isn't large enough.

Why these two Spark RDDs generation ways have different data localities?

I am running two different ways of RDDs generation in a local machine, the first way is:
rdd = sc.range(0, 100).sortBy(lambda x: x, numPartitions=10)
rdd.collect()
The second way is:
rdd = sc.parallelize(xrange(100), 10)
rdd.collect()
But in my Spark UI, it showed different data locality, and I don't know why. Below is the result from the first way, it shows Locality Level(the 5th column) is ANY
And the result from the second way shows the Locality Level is Process_Local:
And I read from https://spark.apache.org/docs/latest/tuning.html , Process_Local Level is usually faster than Any Level for processing.
Is this because of sortBy operation will give rise to shuffle then influence the data locality? Can someone give me a clearer explanation?

You are correct.
In the first snippet you first create a parallelized collection, meaning your driver tells each worker to create some part of the collection. Then, as for sorting each worker node needs access to data on other nodes, data needs to be shuffled around and data locality is lost.
The second code snippet is effectively not even a distributed job.
As Spark uses lazy evaluation, nothing is done until you call to materialize the results, in this case using the collect method. The steps in your second computation are effectively
Distribute the object of type list from driver to worker nodes
Do nothing on each worker node
Collect distributed objects from workers to create object of type list on driver.
Spark is smart enough to realize that there is no reason to distribute the list even though parallelize is called. Since the data resides and the computation is done on the same single node, data locality is obviously preserved.
EDIT:
Some additional info on how Spark does sort.
Spark operates on the underlying MapReduce model (the programming model, not the Hadoop implementation) and sort is implemented as a single map and a reduce. Conceptually, on each node in the map phase, the part of the collection that a particular node operates on is sorted and written to memory. The reducers then pull relevant data from the mappers, merge the results and create iterators.
So, for your example, let's say you have a mapper that wrote numbers 21-34 to memory in sorted order. Let's say the same node has a reducer that is responsible for numbers 31-40. The reducer gets information from driver where the relevant data is. The numbers 31-34 are pulled from the same node and data only has to travel between threads. The other numbers however can be on arbitrary nodes in the cluster and need to be transferred over the network. Once the reducer has pulled all the relevant data from the nodes, the shuffle phase is over. The reducer now merges the results (like in mergesort) and creates an iterator over the sorted part of the collection.
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/

In spark streaming can I create RDD on worker

I want to know how can I create RDD on worker say containing a Map. This Map/RDD will be small and I want this RDD to completely reside on one machine/executor (I guess repartition(1) can achieve this). Further I want to be able to cache this Map/RDD on local executor and use it in tasks running on this executor for lookup.
How can I do this?

No, you cannot create RDD in worker node. Only driver can create RDD.
The broadcast variable seems be solution in your situation. It will send data to all workers, however if your map is small, then it wouldn't be an issue.
You cannot control on which partition your RDD will be placed, so you cannot just do repartition(1) - you don't know if this RDD will be placed on the same node ;) Broadcast variable will be on every node, so lookup will be very fast

You can create RDD on your driver program using sc.parallelize(data) . For storing Map, it can be split into 2 parts as key, value and then can be stored in RDD/Dataframe as two separate columns.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string