Suppose I want to run two spark-submit and to do that I have to run them on two different terminals. No w what I want to do is I want to run those to jobs using a single spark-submit command. Is it possible? And If it is then how can I achieve that?
Related
I'm very new to PySpark.
I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?
Also the same thing happens (the excess time) if I run the code using python from terminal.
I am also setting the configuration in the script as
conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')
Any help is appreciated. Thanks in advance.
There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.
Regarding Jupyter Notebook or pyspark shell.
While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.
Regarding spark-submit.
However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.
Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).
sc = SparkContext(conf=conf)
Hope this helps.
Regards,
Neeraj
I had the same problem, but to initialize my spark variable I was using this line :
spark = SparkSession.builder.master("local[1]").appName("Test").getOrCreate()
The problem is that "local[X]", is equivalent to say that spark will do the operations on the local machine, on X cores. So you have to optimize X with the number of cores available on your machine.
To use it with a yarn cluster, you have to put "yarn".
There is many others possibilities listed here : https://spark.apache.org/docs/latest/submitting-applications.html
I have a PySpark code/application. What is the best way to run it (utilize the maximum power of PySpark), using the python interpreter or using spark-submit?
The SO answer here was almost similar but did not explain it in great details. Would love to know, why?
Any help is appreciated. Thanks in advance.
I am assuming when you say python interpreter you are referring to pyspark shell.
You can run your spark code both ways using pySpark interpreter, using Spark-submit or even with multiple available notebooks (Jupyter/Zeppelin).
When to use PySpark Interpreter.
Generally when we are learning or doing some very basic operations for an understanding or exploration purpose we use pySpark interpreter.
Spark Submit.
This is usually used when you have written your entire application in pySpark and packaged into py files, so that you can submit your entire code to Spark cluster for execution.
A little analogy may help here. Let's take an example of Unix shell commands. We can execute the shell commands directly on the command prompt or we can create shell script (.sh) to execute the bunch instruction at once. Similarly, you can think of pyspark interpreter and spark-submit utility, where in pySpark interpreter you can execute individual command. However, you can package your spark application into py files and execute using spark-submit utility.
Hope this helps.
Regards,
Neeraj
Running your job on pyspark shell will always be in client mode. Where as using spark-submit you can execute it in either modes. I.e. client or cluster
I am new to spark and oozie technologies.
I am trying to get few variables from spark and use it in next oozie action.
In "Decision" node spark submit will be called and few processing is done and a counter variable is generated
Eg: var counter = 8 from spark
So now I need to use this variable in next oozie action which is "take decision"
node.
take decision
[Decision ][counter]
When I googled I was able to find few solutions
1. Write to hdfs
2. Wrap spark submit in shell and use <capture-output>
(I am not able to use this as I use oozie spark action node)
Any other ways to do the same?
The best approach is store the values in either HDFS (Hive) or HBase/Cassandra and your decision action read the values.
If you are wrap spark-submit with shell action, there would be problem if you submit the jobs in cluster mode because spark-submit jobs to yarn cluster and run any of the node where you cannot get the output.
We are running a batch process using spark and using spark-submit to submit our jobs with options
--deploy-mode cluster \
--master yarn-cluster \
We basically takes a csv files and do some processing on those files and create a parquet files from it. We are running multiple files in same spark submit command using a config file. Now lets say we have 10 files that we are processing and if the process fails on lets say file 6 Spark tries to re-run the process again and it will process all the files till file 6 and writes duplicate records for all those 5 files before failing. We are creating Parquet files and hence we don't have control over how spark names those files but it always create unique name.
Is there a way I can change the Spark property about not to re-execute a failed process?
The property spark.yarn.maxAppAttempts worked in my case I set its value to 1 like below in my spark submit command:
--conf "spark.yarn.maxAppAttempts=1"
I need to run lots of jobs (a pipeline) on a Condor cluster but it has to be on one node. So I need to do 2 things:
How do I ask Condor for an available node?
How do I tell Condor to run a job on that node?
I imagine this is very simple, but I'm deep in the docs with no luck.
Simply set a job requirement to run on a specific node.
requirements = $(requirements) && (TARGET.Machine == "somenode")
Selection of that node is up to you. If you use a DAG you can have a "node selection" job and then rewrite submit files as I outline here https://stackoverflow.com/a/27590992/174430.