Spark-submit fails with return code 13 for example of wordCount - apache-spark

My spark-submit command is :
spark-submit --class com.sundogsoftware.spark.WordCountBetterDataset --master yarn --deploy-mode cluster SparkCourse.jar
And for defining the sparkSession, i use this :
val spark = SparkSession
.builder
.master("spark://youness:7077")
.appName("WordCount")
.getOrCreate()
but at the end, my job fails with return code 13.

You need to let the master unset in the code. It is preferable to set it later when you issue spark-submit (spark-submit --master yarn-client ...) and you are already doing that above. Just remove .master("spark://youness:7077") from your code.

Related

Spark Kafka Streaming not displaying data on spark-submit on EMR

I am trying to stream the data from a Kafka topic, it is working in spark shell. But if i create a .py file and use spark-submit for the same, it is failing:
Code:
spark_session = SparkSession.builder.appName("TestApp").enableHiveSupport().getOrCreate()
kafka_bootstrap_server = BOOTSTRAP_SERVERS
topic = 'ota-impactreportsync'
starting_offsets = 'earliest'
df = spark_session.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_bootstrap_server).option(
"subscribe", topic).option("startingOffsets", starting_offsets).option("failOnDataLoss", "false").load()
df.writeStream.format("console").outputMode("append").start()
Commands used:
pyspark --master local --packages io.delta:delta-core_2.12:2.1.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
spark-submit --master local --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.1,org.apache.spark:spark-avro_2.12:3.3.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" test.py
If i use batch streaming with spark-submit it is working
The job fails in like 10 seconds everytime, the logs are not helpful either. No errors.

spark-submit command with --py-files fails if the driver class path or executor class path is not set

I have a main script as below
from pyspark.sql.session import SparkSession
..............
..............
..............
import callmodule as cm <<<--- This is imported from another pyspark script which is in callmod.zip file
..............
..............
..............
when I submit the spark command as below it fails with Error: No module named Callmodule
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip mainscript.py
when I submit the spark command with driver class path(without executor class path) as below it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --driver-class-path C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.driver.extraClassPath=C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.driver.extraLibraryPath=C:\pyspark\scripts\callmod.zip mainscript.py
when I submit the spark command with executor class path (without driver classpath) also it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.executor.extraClassPath=C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.executor.extraLibraryPath=C:\pyspark\scripts\callmod.zip mainscript.py
Can you explain me where does the below import statement work? on driver or executor?
import callmodule as cm
Why is the code not failing with Error: No Module Named callmodule when only the driver classpath is set or only the executor classpath is set?
You are using --master local, so the driver is the same as the executor. Therefore, setting classpath on either driver or executor produces the same behaviour, and neither would cause an error.

A master URL must be set in your configuration gives lot of confusion

I have compiled my spark-scala code in eclipse.
I am trying to run my jar in EMR (5.9.0 Spark 2.2.0)using spark-submit option.
But when I run I get an error:
Details : Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
After reading lots of StackOverflow solution I get confused and did not find a correct explanation of how and why to set app master.
This is how I run my jar.I have tried all below option
spark-submit --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master yarn --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master yarn-client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --deploy-mode cluster --master yarn --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --deploy-mode cluster --master yarn-client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[*] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[1] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[2] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[3] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[4] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[5] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
I am not setting any app master in my Scala code .
package financialLineItem
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{ Date, Timestamp }
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions._
object FinancialLineItem {
def main(args: Array[String]) {
println("Enterin In to Spark Mode ")
val conf = new SparkConf().setAppName("FinanicalLineItem");
println("After conf")
val sc = new SparkContext(conf); //Creating spark context
println("After SC")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val get_cus_val = sqlContext.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val rdd = sc.textFile("s3://path/FinancialLineItem/MAIN")
val header = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)
val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)
val df1resultFinal = data.withColumn("DataPartition", get_cus_val(input_file_name))
val rdd1 = sc.textFile("s3://path/FinancialLineItem/INCR")
val header1 = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)
val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition").as("DataPartition"),
when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
when($"FFAction_1".isNotNull, $"FFAction_1").otherwise($"FFAction|!|").as("FFAction|!|"))
.filter(!$"FFAction|!|".contains("D|!|"))
val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition", $"StatementTypeCode", concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))
val headerColumn = dataHeader.columns.toSeq
val headerLast = headerColumn.mkString("", "|^|", "|!|").dropRight(3)
val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "|^|null", "")).withColumnRenamed("concatenated", headerLast)
dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition", "StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("delimiter", "\t")
.option("quote", "\u0000")
.option("header", "true")
.option("codec", "gzip")
.save("s3://path/FinancialLineItem/output")
Even i tried setting master url in spark-scala code.
This is working in EMR example for spark
spark-submit --deploy-mode cluster --class org.apache.spark.examples.JavaSparkPi /usr/lib/spark/examples/jars/spark-examples.jar 5
If this working then why my jar is not working ?
I tried printing statement in my scala class before creating spark context and it is printing ,so there is no issue in jar file creation .
I don't know what am i missing ?
Updating my eclipse IDE setup also .
Followed below docs
AWS add steps document
This is what my observation
A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf,
More findings .
When i tried to submit from spark-shell i got below error
User class threw exception: java.lang.UnsupportedOperationException: empty collection.
I think there might some issue with my code also .
But i am getting correct result when i run it from Zeppelin .
There's a lot of confusion going on here in the question and in the first answer. If you're running on EMR, which runs Spark on YARN, you do not need to set a master URL at all. It automatically defaults to "yarn", which is the correct value when running Spark on YARN (as opposed to Spark Standalone, which would have a master URL like spark://:7077).
As mentioned in one of the other answers, "--master local" and "--deploy-mode cluster" also don't make sense together. "--master local" should only be used for local development and testing purposes and doesn't make sense to use on a cluster of machines such as on EMR. All it does is run your entire application in a single JVM; it won't run on YARN, it won't be distributed across the cluster, and there won't even be a separation between your driver code and the tasks.
As for "--deploy-mode cluster", as also stated in the other answer, this means that your driver runs in a YARN container on the cluster along with the executors, as opposed to the default of "--deploy-mode client", where the driver runs on the master node outside of YARN.
For more information, please see the Spark documentation, mainly https://spark.apache.org/docs/latest/submitting-applications.html and https://spark.apache.org/docs/latest/running-on-yarn.html.
As explained in the documentation, --deploy-mode cluster asks spark-submit to run the driver on one of the executors.
That, however, isn't applicable to your execution. as you're running locally. You should be using the client deploy mode. For that, just remove the --deploy-mode parameter altogether.
You have to choose one of the following calls, depending on how you want to run the driver program (or executors, for the last option). It's important to understand the differences as they are consequential.
If you want to run the driver program on the cluster (cluster mode, master chooses where on the cluster):
spark-submit --master master.address.com:7077 --deploy-mode cluster #other options
If you want to run the driver program on the compute that is calling spark-submit (client mode, executors remain on the cluster):
spark-submit --master master.address.com:7077 --deploy-mode client #other options
If you are running all locally (driver and executors), then your local master is appropriate:
spark-submit --master local[*] #other options

Spark-submit throwing error from yarn-cluster mode

How do we pass a config file to executor when we submit a spark job on yarn-cluster?
If I change my below spark-submit command as --master yarn-client then it works fine , I get respective output
spark-submit\
--files /home/cloudera/conf/omega.config \
--class com.mdm.InitProcess \
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
/home/cloudera/Omega.jar \
/home/cloudera/conf/omega.config
My Spark Code:
object InitProcess
{
def main(args: Array[String]): Unit = {
val config_loc = args(0)
val config = ConfigFactory.parseFile(new File(config_loc ))
val jobName =config.getString("job_name")
.....
}
}
I am getting the below error
17/04/05 12:01:39 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
Could someone help me on running this command in --master yarn-cluster ?
The different between yarn-client and yarn-cluster is that in yarn-client the driver location is on the machine running the spark-submit command.
In your case, the location of the config file is /home/cloudera/conf/omega.config which can be found when you running as yarn-client as the driver is running from the current machine which holds this full/path/to/file.
But can't be access in yarn-cluster mode, as the driver is running on other host, which doesn't holds this full/path/to/file.
I'd suggest execution the command in the following format:
spark-submit\
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
--class com.mdm.InitProcess \
--files /home/cloudera/conf/omega.config \
--jar /home/cloudera/Omega.jar omega.config
Sending the config file using --files with its full-path-name, and providing it as parameter to the jar as it filename (not with a full path) as the file will be downloaded to unknown location on the workers.
In your code, you can use SparkFiles.get(filename) in order to get the actual full-path-name of the downloaded file on the worker
The change in your code should be something similar to:
val config_loc = SparkFiles.get(args(0))
SparkFiles docs
public class SparkFiles
Resolves paths to files added through SparkContext.addFile().

How to spark-submit a python file in spark 2.1.0?

I am currently running spark 2.1.0. I have worked most of the time in PYSPARK shell, but I need to spark-submit a python file(similar to spark-submit jar in java) . How do you do that in python?
pythonfile.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("appName").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1,2,3,4,5,6,7])
print(rdd.count())
Run the above program with configurations you want : eg :
YOUR_SPARK_HOME/bin/spark-submit --master yourSparkMaster --num-executors 20 \
--executor-memory 1G --executor-cores 2 --driver-memory 1G \
pythonfile.py
These options are not mandatory. You can even run like
YOUR_SPARK_HOME/bin/spark-submit --master sparkMaster/local pythonfile.py

Resources