How to execute some instructions on selected nodes in a cluster? - apache-spark

I don't have any RDD to use, I just want to execute some of my own functions on some nodes of my cluster, with Apache Spark. So I don't have any data to distribute, but only code (which depends on the node that is executing it).
Is it possible ? Is Spark compatible with this goal ?

Is it possible?
I think it is possible and I've been asked about it few times already (so had time to think about it :))
Is Spark compatible with this goal?
The way Spark could handle it is to launch as many executors as you want to use nodes for the distributed work. That's the job of a cluster manager to spread the work across a cluster of nodes and so Spark can only use what nodes are given.
With the nodes assigned you simply execute a computation on fake dataset to build a RDD on top of.
If the computation runs on a node that should not be used, you can hostname inside the code and see what node you are on and decide on whether to continue or stop.
You could even read the code to execute from a database (seen a solution like this already).

Related

How are the task results being processed on Spark?

I am new to Spark and I am currently try to understand the architecture of spark.
As far as I know, the spark cluster manager assigns tasks to worker nodes and sends them partitions of the data. Once there, each worker node performs the transformations (like mapping etc.) on its own specific partition of the data.
What I don't understand is: where do all the results of these transformations from the various workers go to? are they being sent back to the cluster manager / driver and once there reduced (e.g. sum of values of each unique key)? If yes, is there a specific way this happens?
Would be nice if someone is able to enlighten me, neither the spark docs nor other Resources concerning the architecture haven't been able to do so.
Good question, I think you are asking how does a shuffle work...
Here is a good explanation.
When does shuffling occur in Apache Spark?

run a function on all the spark executors to get licence key

before beginning to scan a database, all the executors need a licence file. I have a function licenceFileInstaller.install() that downloads the file at the right location on the system, which works fine on the master node. How do I run this on all the executors ?
I guess when you use this method in rdd transformations, it should serialize this method and send it across all the nodes. Like you can use this method within mapPartitions and it will be distributed across all the nodes and run once for every partition.
rdd.mapPartitions( x => ( licenceFileInstaller.install() ...further database code ))
Also, I guess you can add that license file using sc.addFile() which will be distributed across all the nodes.
You can refer to Spark closures for further understanding

Spark code is taking a long time to return query. Help on speeding this up

I am currently running some Spark code and I need to query a data frame that is taking a long time (over 1 hour) per query. I need to query multiple times to check if the data frame is in fact correct.
I am relatively new to Spark and I understand that Spark uses lazy evaluation which means that the commands are executed only once I do a call for some action (in my case .show()).
Is there a way to do this process once for the whole DF and then quickly call on the data?
Currently I am saving the DF as a temporary table and then running queries in beeline (HIVE). This seems a little bit overkill as I have to save the table in a database first, which seems like a waste of time.
I have looked into the following functions .persist, .collect but I am confused on how to use them and query from them.
I would really like to learn the correct way of doing this.
Many thanks for the help in advance!!
Yes, you can keep your RDD in memory using rddName.cache() (or persists()) . More information about RDD Persistence can be found here
Using a temporary table ( registerTempTable (spark 1.6) or createOrReplaceTempView (spark2.x)) does not "save" any data. It only creates a view with the lifetime of you spark session. If you wish to save the table, you should use .saveAsTable, but I assume that this is not what you are looking for.
Using .cache is equivalent to .persist(StorageLevel.MEMORY). If your table is large and thus can't fit in memory, you should use .persist(StorageLevel.MEMORY_AND_DISK).
Also it is possible that you simple need more nodes in you cluster. In case you are running locally, make sure you deploy with --master local[*] to use all available cores on your machine. If you are running on a stand alone cluster or with a cluster manager like Yarn or Mesos, you should make sure that all necessary/available resources are assigned to you job.

Cache RDD and add executer

Scenario:
start a cluster with 1 node
start application with spark.Executer.instances=10
create an RDD repartition to 100 parts and cache it.
add 9 more nodes to the cluster
You will see all 10 executers has been accepted to the application but actual working node is only the first one.
Does anybody know if this is the way it was intended to work?
Yes, this is the way it is intended to work.
Partitions of an RDD are cached on a specific node.
DAG, and as result inputs, outputs and number won't change, unless everything is recomputed. Which brings us to previous point.
So when creating a RDD checkpoint which needs caching, in this case you have to rdd.unpersist before using.

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

Resources