PySpark distributed processing on a YARN cluster - apache-spark

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).
I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).
I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.
I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.
I have a very simply Python script processing data from HDFS like so:
import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")
rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")
data = rrd.map(lambda line: json.loads(line))
joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))
print joes.count()
And I am running a submit command like:
spark-submit atest.py --deploy-mode client --master yarn-client
What can I do to ensure the job runs in parallel across the cluster?

Can you swap the arguments for the command?
spark-submit --deploy-mode client --master yarn-client atest.py
If you see the help text for the command:
spark-submit
Usage: spark-submit [options] <app jar | python file>

I believe #MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.

Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.
Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py
Python Code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setMaster("yarn")
.setAppName("retrieve data"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")
parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()
Output : Its big so i am not pasting. But it runs perfect.

It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

Related

spark-submit works for yarn-cluster mode but SparkLauncher doesn't, with same params

I'm able to submit a spark job through spark-submit however when I try to do the same programatically using SparkLauncher, it gives me nothing ( I dont even see a Spark job on the UI)
Below is the scenario:
I've a server(say hostname: cr-hdbc101.dev.local:7123) which hosts the hdfs cluster. I push a fat jar to the server which I'm trying to exec.
The following spark-submit works as expected and a spark job is submitted in yarn-cluster mode
spark-submit \
--verbose \
--class com.digital.StartSparkJob \
--master yarn \
--deploy-mode cluster \
--num-executors 2 \
--driver-memory 2g \
--executor-memory 3g \
--executor-cores 4 \
/usr/share/Deployments/Consolidateservice.jar "<arg_to_main>"
However the following piece of SparkLauncher code doesn't work
val sparkLauncher = new SparkLauncher()
sparkLauncher
.setSparkHome("/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark")
.setAppResource("/usr/share/Deployments/Consolidateservice.jar")
.setMaster("yarn-cluster")
.setVerbose(true)
.setMainClass("com.digital.StartSparkJob")
.setDeployMode("cluster")
.setConf("spark.driver.cores", "2")
.setConf("spark.driver.memory", "2g")
.setConf("spark.executor.cores", "4")
.setConf("spark.executor.memory", "3g")
.addAppArgs(<arg_to_main>)
.startApplication()
I thought maybe SparkLauncher is not getting correct env variables to work with, so I send the following to SparkLauncher, but to no avail(basically I pass everything in the spark-env.sh to SparkLauncher)
val env: java.util.Map[String, String] = new java.util.HashMap[String, String]
env.put("SPARK_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn")
env.put("HADOOP_HOME", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop")
env.put("YARN_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf")
env.put("SPARK_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark/lib")
env.put("SCALA_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark/lib")
env.put("LD_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop/lib/native")
env.put("SPARK_DIST_CLASSPATH", "/etc/spark/conf.cloudera.spark_on_yarn/classpath.txt")
val sparkLauncher = new SparkLauncher(env)
sparkLauncher
.setSparkHome("/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark")...
What adds to the frustration, is that when I use same SparkLauncher code for yarn-client mode, it works perfectly fine.
Can someone please point to me what am I missing, I just feel I'm staring at the issue without recognizing it.
NOTE: Both the main class(com.digital.StartSparkJob) and SparkLauncher code are part of the fat jar I'm pushing to the server. I just call the SparkLauncher code with an external API, which in turn should open a driver JVM on the cluster
SparkVersion: 1.6.0, scala ver: 2.10.5
I wasn't even getting logs on the Spark-UI...the sparkApp wasn't even running. Therefore I ran the sparkLauncher as a process(using .launch().waitFor() ) so that I can capture the error Logs.
I captured the logs using .getInputStream and .getErrorStream and found out that the user being passed to the cluster is wrong. My cluster will work only for user "abcd".
I did set System.setProperty("HADOOP_USER_NAME", "abcd"), as well as added "spark.yarn.appMasterEnv.HADOOP_USER_NAME=abcd" to spark-default.conf, before launching SparkLauncher. However looks like they don't get ported over to cluster.
I therefore passed the HADOOP_USER_NAME as an childArg to the SparkLauncher
val env: java.util.Map[String, String] = new java.util.HashMap[String, String]
env.put("SPARK_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn")
env.put("YARN_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf")
env.put("HADOOP_USER_NAME", "abcd")
try {
val sparkLauncher = new SparkLauncher(env)...

Python job submission to spark from remotely

I have a python script with pyspark code on my local system. I am trying to submit a pyspark job from my local machine to remote spark cluster.
Please let me know how it can be done.
Do I need spark locally installed to submit spark job.
You need to have the spark master URL in the spark conf like below
SparkSession spark = SparkSession.builder().appName("CDX JSON Merge Job").master("spark://ip-address:7077")
.getOrCreate();
You have to install the spark client in your localhost and then execute the jar using the spark-submit
spark-submit --num-executors 50 --executor-memory 4G --executor-cores 4 --master spark://ip-address:7077 --deploy-mode client --class fully-qualified-class-name artifact.jar
You can also have the master as YARN if you are running Spark on YARN and deploy-mode as cluster.

A master URL must be set in your configuration gives lot of confusion

I have compiled my spark-scala code in eclipse.
I am trying to run my jar in EMR (5.9.0 Spark 2.2.0)using spark-submit option.
But when I run I get an error:
Details : Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
After reading lots of StackOverflow solution I get confused and did not find a correct explanation of how and why to set app master.
This is how I run my jar.I have tried all below option
spark-submit --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master yarn --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master yarn-client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --deploy-mode cluster --master yarn --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --deploy-mode cluster --master yarn-client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[*] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[1] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[2] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[3] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[4] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[5] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
I am not setting any app master in my Scala code .
package financialLineItem
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{ Date, Timestamp }
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions._
object FinancialLineItem {
def main(args: Array[String]) {
println("Enterin In to Spark Mode ")
val conf = new SparkConf().setAppName("FinanicalLineItem");
println("After conf")
val sc = new SparkContext(conf); //Creating spark context
println("After SC")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val get_cus_val = sqlContext.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val rdd = sc.textFile("s3://path/FinancialLineItem/MAIN")
val header = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)
val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)
val df1resultFinal = data.withColumn("DataPartition", get_cus_val(input_file_name))
val rdd1 = sc.textFile("s3://path/FinancialLineItem/INCR")
val header1 = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)
val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition").as("DataPartition"),
when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
when($"FFAction_1".isNotNull, $"FFAction_1").otherwise($"FFAction|!|").as("FFAction|!|"))
.filter(!$"FFAction|!|".contains("D|!|"))
val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition", $"StatementTypeCode", concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))
val headerColumn = dataHeader.columns.toSeq
val headerLast = headerColumn.mkString("", "|^|", "|!|").dropRight(3)
val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "|^|null", "")).withColumnRenamed("concatenated", headerLast)
dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition", "StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("delimiter", "\t")
.option("quote", "\u0000")
.option("header", "true")
.option("codec", "gzip")
.save("s3://path/FinancialLineItem/output")
Even i tried setting master url in spark-scala code.
This is working in EMR example for spark
spark-submit --deploy-mode cluster --class org.apache.spark.examples.JavaSparkPi /usr/lib/spark/examples/jars/spark-examples.jar 5
If this working then why my jar is not working ?
I tried printing statement in my scala class before creating spark context and it is printing ,so there is no issue in jar file creation .
I don't know what am i missing ?
Updating my eclipse IDE setup also .
Followed below docs
AWS add steps document
This is what my observation
A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf,
More findings .
When i tried to submit from spark-shell i got below error
User class threw exception: java.lang.UnsupportedOperationException: empty collection.
I think there might some issue with my code also .
But i am getting correct result when i run it from Zeppelin .
There's a lot of confusion going on here in the question and in the first answer. If you're running on EMR, which runs Spark on YARN, you do not need to set a master URL at all. It automatically defaults to "yarn", which is the correct value when running Spark on YARN (as opposed to Spark Standalone, which would have a master URL like spark://:7077).
As mentioned in one of the other answers, "--master local" and "--deploy-mode cluster" also don't make sense together. "--master local" should only be used for local development and testing purposes and doesn't make sense to use on a cluster of machines such as on EMR. All it does is run your entire application in a single JVM; it won't run on YARN, it won't be distributed across the cluster, and there won't even be a separation between your driver code and the tasks.
As for "--deploy-mode cluster", as also stated in the other answer, this means that your driver runs in a YARN container on the cluster along with the executors, as opposed to the default of "--deploy-mode client", where the driver runs on the master node outside of YARN.
For more information, please see the Spark documentation, mainly https://spark.apache.org/docs/latest/submitting-applications.html and https://spark.apache.org/docs/latest/running-on-yarn.html.
As explained in the documentation, --deploy-mode cluster asks spark-submit to run the driver on one of the executors.
That, however, isn't applicable to your execution. as you're running locally. You should be using the client deploy mode. For that, just remove the --deploy-mode parameter altogether.
You have to choose one of the following calls, depending on how you want to run the driver program (or executors, for the last option). It's important to understand the differences as they are consequential.
If you want to run the driver program on the cluster (cluster mode, master chooses where on the cluster):
spark-submit --master master.address.com:7077 --deploy-mode cluster #other options
If you want to run the driver program on the compute that is calling spark-submit (client mode, executors remain on the cluster):
spark-submit --master master.address.com:7077 --deploy-mode client #other options
If you are running all locally (driver and executors), then your local master is appropriate:
spark-submit --master local[*] #other options

SparkConf settings not used when running Spark app in cluster mode on YARN

I wrote a Spark application, which sets sets some configuration stuff via SparkConf instance, like this:
SparkConf conf = new SparkConf().setAppName("Test App Name");
conf.set("spark.driver.cores", "1");
conf.set("spark.driver.memory", "1800m");
conf.set("spark.yarn.am.cores", "1");
conf.set("spark.yarn.am.memory", "1800m");
conf.set("spark.executor.instances", "30");
conf.set("spark.executor.cores", "3");
conf.set("spark.executor.memory", "2048m");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> inputRDD = sc.textFile(...);
...
When I run this application with the command (master=yarn & deploy-mode=client)
spark-submit --class spark.MyApp --master yarn --deploy-mode client /home/myuser/application.jar
everything seems to work fine, the Spark History UI shows correct executor information:
But when running it with (master=yarn & deploy-mode=cluster)
my Spark UI shows wrong executor information (~512 MB instead of ~1400 MB):
Also my App name equals Test App Name when running in client mode, but is spark.MyApp when running in cluster mode. It seems that however some default settings are taken when running in Cluster mode. What am I doing wrong here? How can I make these settings for the Cluster mode?
I'm using Spark 1.6.2 on a HDP 2.5 cluster, managed by YARN.
OK, I think I found out the problem! In short form: There's a difference between running Spark settings in Standalone and in YARN-managed mode!
So when you run Spark applications in the Standalone mode, you can focus on the Configuration documentation of Spark, see http://spark.apache.org/docs/1.6.2/configuration.html
You can use the following settings for Driver & Executor CPU/RAM (just as explained in the documentation):
spark.executor.cores
spark.executor.memory
spark.driver.cores
spark.driver.memory
BUT: When running Spark inside a YARN-managed Hadoop environment, you have to be careful with the following settings and consider the following points:
orientate on the "Spark on YARN" documentation rather then on the Configuration documentation linked above: http://spark.apache.org/docs/1.6.2/running-on-yarn.html (the properties explained here have a higher priority then the ones explained in the Configuration docu (this seems to describe only the Standalone cluster vs. client mode, not the YARN cluster vs. client mode!!))
you can't use SparkConf to set properties in yarn-cluster mode! Instead use the corresponding spark-submit parameters:
--executor-cores 5
--executor-memory 5g
--driver-cores 3
--driver-memory 3g
In yarn-client mode you can't use the spark.driver.cores and spark.driver.memory properties! You have to use the corresponding AM properties in a SparkConf instance:
spark.yarn.am.cores
spark.yarn.am.memory
You can't set these AM properties via spark-submit parameters!
To set executor resources in yarn-client mode you can use
spark.executor.cores and spark.executor.memory in SparkConf
--executor-cores and executor-memory parameters in spark-submit
if you set both, the SparkConf settings overwrite the spark-submit parameter values!
This is the textual form of my notes:
Hope I can help anybody else with this findings...
Just to add on to D. Müller's answer:
Same issue happened to me and I tried the settings with some different combination. I am running Pypark 2.0.0 on YARN cluster.
I found that driver-memory must be written during spark submit but executor-memory can be written in script (i.e. SparkConf) and the application will still work.
My application will die if driver-memory is less than 2g. The error is:
ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
ERROR yarn.ApplicationMaster: User application exited with status 143
CASE 1:
driver & executor both written in SparkConf
spark = (SparkSession
.builder
.appName("driver_executor_inside")
.enableHiveSupport()
.config("spark.executor.memory","4g")
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.config("spark.driver.memory","2g")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster myscript.py
CASE 2:
- driver in spark submit
- executor in SparkConf in script
spark = (SparkSession
.builder
.appName("executor_inside")
.enableHiveSupport()
.config("spark.executor.memory","4g")
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster --conf spark.driver.memory=2g myscript.py
The job Finished with succeed status. Executor memory correct.
CASE 3:
- driver in spark submit
- executor not written
spark = (SparkSession
.builder
.appName("executor_not_written")
.enableHiveSupport()
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster --conf spark.driver.memory=2g myscript.py
Apparently the executor memory is not set. Meaning CASE 2 actually captured executor memory settings despite writing it inside sparkConf.

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

Resources