Directory expansion does not work in standalone deployment mode : Apache Spark - apache-spark

I'm trying to deploy a spark streaming consuming Kafka topic job on a standalone spark cluster using the following command:
./bin/spark-submit --class MaxwellCdc.MaxwellSreaming
~/cdc/cdc_2.11-0.1.jar --jars ~/cdc/kafka_2.11-0.10.0.1.jar,
~/cdc/kafka-clients-0.10.0.1.jar,~/cdc/mysql-connector-java-5.1.12.jar,
~/cdc/spark-streaming-kafka-0-10_2.11-2.2.1.jar
and getting this exception:
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/kafka/common/serialization/StringDeserializer
at MaxwellCdc.MaxwellSreaming$.main(MaxwellSreaming.scala:30)
at MaxwellCdc.MaxwellSreaming.main(MaxwellSreaming.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.kafka.common.serialization.StringDeserializer
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
Any help would be appreciated.

Quoting from the documentation:
When using spark-submit, the application jar along with any jars
included with the --jars option will be automatically transferred to
the cluster. URLs supplied after --jars must be separated by commas.
That list is included in the driver and executor classpaths.
Directory expansion does not work with --jars..
What is directory expansion?
Expanding a file name means converting a relative file name to an absolute one. Since this is done relative to a default directory, you must specify the default directory name as well as the file name to be expanded. It also involves expanding abbreviations like ~/.
Therefore, try providing the absolute path for all the jars that are being provided with --jars option. I hope this helps.

Related

Databricks PySpark with PEX: how can I configure a PySpark job on Databricks using PEX for dependencies?

I am attempting to create a PySpark job via the Databricks UI (with spark-submit) using the spark-submit parameters below (dependencies are on the PEX file), but I am getting an exception that the PEX file does not exist. It's my understanding that the --files option puts the file in the working directory of the driver & every executor, so I am confused as to why I am encountering this issue.
Config
[
"--files","s3://some_path/my_pex.pex",
"--conf","spark.pyspark.python=./my_pex.pex",
"s3://some_path/main.py",
"--some_arg","2022-08-01"
]
Standard Error
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: libraryDownload.sleepIntervalSeconds
Warning: Ignoring non-Spark config property: libraryDownload.timeoutSeconds
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
Exception in thread "main" java.io.IOException: Cannot run program "./my_pex.pex": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 14 more
What I have tried
Given that the PEX file doesn't seem to be visible, I have tried adding it via the following ways:
Adding the PEX via the --files option in Spark submit
Adding the PEX via the the spark.files config when starting up the actual cluster
Putting the PEX in DBFS (as opposed to s3)
Playing around with the configs (e.g. using spark.pyspark.driver.python instead of spark.pyspark.python)
Note: given that instructions at the bottom of this page, I believe PEX should work on Databricks; I'm just not sure as to the right configs: https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
Note also, the following spark submit command works on AWS EMR:
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
"spark-submit",
"--deploy-mode", "cluster",
"--master", "yarn",
"--files", "s3://some_path/my_pex.pex",
"--conf", "spark.pyspark.driver.python=./my_pex.pex",
"--conf", "spark.executorEnv.PEX_ROOT=./tmp",
"--conf", "spark.yarn.appMasterEnv.PEX_ROOT=./tmp",
"s3://some_path/main.py",
"--some_arg", "some-val"
]
Any help would be much appreciated, thanks.

Spark job not running when jar is in HDFS

I am trying to run a spark job in standalone mode but the command is not picking up the jar from HDFS.The jar is present in the HDFS location and Its working fine when I run it in local mode.
Below is the command I am using
spark-submit --deploy-mode client --master yarn --class com.main.WordCount /spark/wc.jar
Below is my program:
val conf = new SparkConf().setAppName("WordCount").setMaster("yarn")
val spark = new SparkContext(conf)
val file = spark.textFile(args(0))
val count = file.flatMap(f=>f.split(" ")).map(word=>(word,1)).reduceByKey(_+_).collect
count.foreach(println)
And I am getting below error:
Warning: Local jar /spark/wc.jar does not exist, skipping.
java.lang.ClassNotFoundException: com.main.WordCount
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:228)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:693)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
But If i use deploy mode cluster I am getting below error:
Exception in thread "main" java.io.FileNotFoundException: File file:/spark/wc.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:340)
at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:530)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:529)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:529)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1119)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1178)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Could you please clarify what is local mode. There are only two deploy mode client and cluster, the only difference is in client mode Driver program will run on the system and in cluster mode driver program will run from random node in the cluster.
For spark submit command:
When you execute spark submit command spark will pull all the local resources/files defined with --files , --py-files argument as well as Spark Main Jar to temporary HDFS location/directory, which is created by that particular spark application with the application name. when you give HDFS location, it will fail to location the Jar on local machine. It is mandatory to keep the Jar on local.

Convert Excel file to csv in Spark 1.X

Is there a tool to convert Excel files into csv using Spark 1.X ?
got this issue when executing this tuto
https://github.com/ZuInnoTe/hadoopoffice/wiki/Read-Excel-document-using-Spark-1.x
Exception in thread "main" java.lang.NoClassDefFoundError: org/zuinnote/hadoop/office/format/mapreduce/ExcelFileInputFormat
at org.zuinnote.spark.office.example.excel.SparkScalaExcelIn$.convertToCSV(SparkScalaExcelIn.scala:63)
at org.zuinnote.spark.office.example.excel.SparkScalaExcelIn$.main(SparkScalaExcelIn.scala:56)
at org.zuinnote.spark.office.example.excel.SparkScalaExcelIn.main(SparkScalaExcelIn.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.zuinnote.hadoop.office.format.mapreduce.ExcelFileInputFormat
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
Spark is unable to find org.zuinnote.hadoop.office.format.mapreduce.ExcelFileInputFormat File format class in classpath.
Supply below dependency to spark-submit using --jars parameter-
<!-- https://mvnrepository.com/artifact/com.github.zuinnote/hadoopoffice-fileformat -->
<dependency>
<groupId>com.github.zuinnote</groupId>
<artifactId>hadoopoffice-fileformat</artifactId>
<version>1.0.4</version>
</dependency>
Command:
spark-submit --jars hadoopoffice-fileformat-1.0.4.jar \
#rest of the command arguments
You have to build a fat jar that contains all the necessary dependencies. The example project on the HadoopOffice page shows how you build one. One you build the fat/uber jar you simply use it in Spark summit.

running spark on yarn as client

I'm trying to run a spark job with yarn using:
./bin/spark-submit --class "KafkaToMaprfs" --master yarn --deploy-mode client /home/mapr/kafkaToMaprfs/target/scala-2.10/KafkaToMaprfs.jar
But facing this error:
/opt/mapr/hadoop/hadoop-2.7.0 17/01/03 11:19:26 WARN NativeCodeLoader:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable 17/01/03 11:19:38 ERROR
SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended!
It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:530)
at KafkaToMaprfs$.main(KafkaToMaprfs.scala:61)
at KafkaToMaprfs.main(KafkaToMaprfs.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:752)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/01/03 11:19:39 WARN MetricsSystem: Stopping a MetricsSystem that is
not running Exception in thread "main"
org.apache.spark.SparkException: Yarn application has already ended!
It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:530)
at KafkaToMaprfs$.main(KafkaToMaprfs.scala:61)
at KafkaToMaprfs.main(KafkaToMaprfs.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:752)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have a multi node cluster, i'm deploying the application from a remote node.
I'm using spark 1.6.1 and hadoop 2.7.x versions.
I didn't set the cluster, so I couldn't find where the mistake lies.
Can anyone please help me fix this?
In my case i'm using MapR distribution.And i didn't configure the environment.
So, when i dug down to the all the conf folders.I made some changes in the below files,
1. In Spark-env.sh,Make sure these values are set right.
export SPARK_LOG_DIR=
export SPARK_PID_DIR=
export HADOOP_HOME=
export HADOOP_CONF_DIR=
export JAVA_HOME=
export SPARK_SUBMIT_OPTIONS=
2. yarn-env.sh.
Make sure the yarn_conf_dir, and java_home are set with correct values.
3. In spark-defaults.conf
1.spark.driver.extraClassPath
2.set value for HADOOP_CONF_DIR
4. HADOOP_CONF_DIR and JAVA_HOME in $SPARK_HOME/conf/spark-env.sh
1.export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop
2.export JAVA_HOME =
5.spark assembly jar
1.Copy the following JAR file from the local file system to a world-readable location on MapR-FS: Substitute your Spark version and
specific JAR file name in the command.
/opt/mapr/spark/spark-/lib/spark-assembly--hadoop-mapr-.jar
Now i'm able to run my spark application as YARN-CLIENT smoothly using spark-submit.
These are basic essentials to make spark connect with yarn.
Correct me if i missed any other things.

IllegalArgumentException with Spark 1.6

I'm running Spark 1.6.0 on CDH 5.7 and I've upgraded my original application from 1.4.1 to 1.6.0. When I try to run my application (which previously worked fine) I get the following error:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$3.apply(Client.scala:473)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$3.apply(Client.scala:471)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:471)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:469)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:469)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:725)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:143)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1023)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1083)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I submit the application with:
--jars is a comma-separated list of jars (with absolute paths)
--files is a comma-separated list of files (with absolute paths)
--driver-class-path is a colon-separated list of resources (without the full path, just the file names)
I have tried it with full paths for the driver (and executor) class paths, but that gives me the same issue. All files and jars submitted with the app exist, I checked.
Could this be related to the issue with duplicates in the distributed cache or is this another issue?
From the source code I see that the only calls to require without a custom message (as in the stack trace) are related to the distribute() method. If so, how can I run applications without upgrading Spark?
This is the exception which results from having the same path/URI appear twice in the argument to the --files option.

Resources