Running Spark on local machine with master = local[*] and invoking .collect method - apache-spark

I need some help in understanding this documentation on Spark website:
Another common idiom is attempting to print out the elements of an RDD using rdd.foreach(println) or rdd.map(println). [1st category] On a single machine, this will generate the expected output and print all the RDD’s elements. [2nd category] However, in cluster mode, the output to stdout being called by the executors is now writing to the executor’s stdout instead...
I running spark locally (with local[*] inside Eclipse IDE) that connects to staging Cassandra (which is running on multiple nodes) falls in the first category or second?
Any help is appreciated.

You're not submitting your code to a cluster, therefore your code is the first category

Related

run a function on all the spark executors to get licence key

before beginning to scan a database, all the executors need a licence file. I have a function licenceFileInstaller.install() that downloads the file at the right location on the system, which works fine on the master node. How do I run this on all the executors ?
I guess when you use this method in rdd transformations, it should serialize this method and send it across all the nodes. Like you can use this method within mapPartitions and it will be distributed across all the nodes and run once for every partition.
rdd.mapPartitions( x => ( licenceFileInstaller.install() ...further database code ))
Also, I guess you can add that license file using sc.addFile() which will be distributed across all the nodes.
You can refer to Spark closures for further understanding

Notebook vs spark-submit

I'm very new to PySpark.
I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?
Also the same thing happens (the excess time) if I run the code using python from terminal.
I am also setting the configuration in the script as
conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')
Any help is appreciated. Thanks in advance.
There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.
Regarding Jupyter Notebook or pyspark shell.
While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.
Regarding spark-submit.
However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.
Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).
sc = SparkContext(conf=conf)
Hope this helps.
Regards,
Neeraj
I had the same problem, but to initialize my spark variable I was using this line :
spark = SparkSession.builder.master("local[1]").appName("Test").getOrCreate()
The problem is that "local[X]", is equivalent to say that spark will do the operations on the local machine, on X cores. So you have to optimize X with the number of cores available on your machine.
To use it with a yarn cluster, you have to put "yarn".
There is many others possibilities listed here : https://spark.apache.org/docs/latest/submitting-applications.html

Apache Spark Correlation only runs on driver

I am new to Spark and learn that transformations happen on workers and action on the driver but the intermediate action can happen(if the operation is commutative and associative) at the workers also which gives the actual parallelism.
I looked into the correlation and covariance code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
How could I find what part of the correlation has happened at the driver and what at executor?
Update 1: The setup I'm talking about to run the correlation is the cluster setup consisting of multiple VM's.
Look here for the images from the SparK web UI: Distributed cross correlation matrix computation
Update 2
I setup my cluster in standalone mode like It was a 3 Node cluster, 1 master/driver(actual machine: workstation) and 2 VM slaves/executor.
submitting the job like this
./bin/spark-submit --master spark://192.168.0.11:7077 examples/src/main/python/mllib/correlations_example.py
from master node
My correlation sample file is correlations_example.py:
data = sc.parallelize(np.array([range(10000000), range(10000000, 20000000),range(20000000, 30000000)]).transpose())
print(Statistics.corr(data, method="pearson"))
sc.stop()
I always get a sequential timeline as :
Doesn't it mean that it not happening in parallel based on timeline of events ? Am I doing something wrong with the job submission or correlation computation in Spark is not parallel?
Update 3:
I tried even adding another executor, still the same seqquential treeAggreagate.
I set the spark cluster as mentioned here:
http://paxcel.net/blog/how-to-setup-apache-spark-standalone-cluster-on-multiple-machine/
Your statement is not entirely accurate. The container[executor] for the driver is launched on the client/edge node or on the cluster, depending on the spark submit mode e.g. client or yarn. The actions are executed by the workers and the results are sent back to the driver (e.g. collect)
This has been answered already. See link below for more details.
When does an action not run on the driver in Apache Spark?

How to execute some instructions on selected nodes in a cluster?

I don't have any RDD to use, I just want to execute some of my own functions on some nodes of my cluster, with Apache Spark. So I don't have any data to distribute, but only code (which depends on the node that is executing it).
Is it possible ? Is Spark compatible with this goal ?
Is it possible?
I think it is possible and I've been asked about it few times already (so had time to think about it :))
Is Spark compatible with this goal?
The way Spark could handle it is to launch as many executors as you want to use nodes for the distributed work. That's the job of a cluster manager to spread the work across a cluster of nodes and so Spark can only use what nodes are given.
With the nodes assigned you simply execute a computation on fake dataset to build a RDD on top of.
If the computation runs on a node that should not be used, you can hostname inside the code and see what node you are on and decide on whether to continue or stop.
You could even read the code to execute from a database (seen a solution like this already).

Where does spark look for text files?

I thought that loading text files is done only from workers / within the cluster (you just need to make sure all workers have access to the same path, either by having that text file available on all nodes, or by use some shared folder mapped to the same path)
e.g. spark-submit / spark-shell can be launched from anywhere, and connect to a spark master, and the machine where you launched spark-submit / spark-shell (which is also where our driver runs, unless you are in "cluster" deploy mode) has nothing to do with the cluster. Therefore any data loading should be done only from the workers, not on the driver machine, right? e.g. there should be no way that sc.textFile("file:///somePath") will cause spark to look for a file on the driver machine (again, the driver is external to the cluster, e.g. in "client" deploy mode / standalone mode), right?
Well, this is what I thought too...
Our cast
machine A: where the driver runs
machine B: where both spark master and one of the workers run
Act I - The Hope
When I start a spark-shell from machine B to spark master on B I get this:
scala> sc.master
res3: String = spark://machinB:7077
scala> sc.textFile("/tmp/data/myfile.csv").count()
res4: Long = 976
Act II - The Conflict
But when I start a spark-shell from machine A, pointing to spark master on B I get this:
scala> sc.master
res2: String = spark://machineB:7077
scala> sc.textFile("/tmp/data/myfile.csv").count()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/data/myfile.csv
And indeed /tmp/data/myfile.csv does not exist on machine A, but machine A is not on the cluster, it's just where the driver runs
Act III - The Amazement
What’s even weirder is that if I make this file available on machine A, it doesn’t throw this error anymore. (Instead it creates a job, but no tasks, and just fails due to a timeout, which is another issue that deserves a separate question)
Is there something in the way that Spark behaves that I’m missing? I thought that spark shell when connected to a remote, has nothing to do with the machine you are running on. So why does the error stops when I put that file available on machine A? It means that the location of sc.textFile includes the location of where spark-shell or spark-submit were initiated (in my case also where the driver runs)? This makes zero sense to me. but again, I'm open to learn new things.
Epilogue
tl;dr - sc.textFile("file:/somePath") running form a driver on machine A to a cluster on machines B,C,D... (driver not part of cluster)
It seems like it's looking for path file:/somePath also on the driver, is that true (or is it just me)? is that known? is that as designed?
I have a feeling that this is some weird network / VPN topology issue unique to my workplace network, but still this is what happens to me, and I'm utterly confused whether it is just me or a known behavior. (or I'm simply not getting how Spark works, which is always an option)
So the really short version of it the answer is, if you reference "file://..." it should be accessible on all nodes in your cluster including the dirver program. Sometimes some bits of work happen on the worker. Generally the way around this is just not using local files, and instead using something like S3, HDFS, or another network filesystem. There is the sc.addFile method which can be used to distribute a file from the driver to all of the other nodes (and then you use SparkFiles.get to resolve the download location).
Spark can look for files both locally or on HDFS.
If you'd like to read in a file using sc.textFile() and take advantage of its RDD format, then the file should sit on HDFS. If you just want to read in a file the normal way, it is the same as you do depending on the API (Scala, Java, Python).
If you submit a local file with your driver, then addFile() distributes the file to each node and SparkFiles.get() downloads the file to a local temporary file.

Resources