Databricks PySpark with PEX: how can I configure a PySpark job on Databricks using PEX for dependencies? - apache-spark

I am attempting to create a PySpark job via the Databricks UI (with spark-submit) using the spark-submit parameters below (dependencies are on the PEX file), but I am getting an exception that the PEX file does not exist. It's my understanding that the --files option puts the file in the working directory of the driver & every executor, so I am confused as to why I am encountering this issue.
Config
[
"--files","s3://some_path/my_pex.pex",
"--conf","spark.pyspark.python=./my_pex.pex",
"s3://some_path/main.py",
"--some_arg","2022-08-01"
]
Standard Error
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: libraryDownload.sleepIntervalSeconds
Warning: Ignoring non-Spark config property: libraryDownload.timeoutSeconds
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
Exception in thread "main" java.io.IOException: Cannot run program "./my_pex.pex": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 14 more
What I have tried
Given that the PEX file doesn't seem to be visible, I have tried adding it via the following ways:
Adding the PEX via the --files option in Spark submit
Adding the PEX via the the spark.files config when starting up the actual cluster
Putting the PEX in DBFS (as opposed to s3)
Playing around with the configs (e.g. using spark.pyspark.driver.python instead of spark.pyspark.python)
Note: given that instructions at the bottom of this page, I believe PEX should work on Databricks; I'm just not sure as to the right configs: https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
Note also, the following spark submit command works on AWS EMR:
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
"spark-submit",
"--deploy-mode", "cluster",
"--master", "yarn",
"--files", "s3://some_path/my_pex.pex",
"--conf", "spark.pyspark.driver.python=./my_pex.pex",
"--conf", "spark.executorEnv.PEX_ROOT=./tmp",
"--conf", "spark.yarn.appMasterEnv.PEX_ROOT=./tmp",
"s3://some_path/main.py",
"--some_arg", "some-val"
]
Any help would be much appreciated, thanks.

Related

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException

I've setup a 3-node cluster (1-master & 2-workers) of Hadoop with Yarn along with Spark.
My Pyspark scripts need org.elasticsearch.spark in order to write to Elasticsearch. I'm providing this with parameter --packages org.elasticsearch:elasticsearch-spark-30_2.12:8.4.1 while executing my Pyspark script , that is while executing using spark-submit .
Stuck with this error :
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/javax/ws/rs/core/NoContentException
at org.apache.hadoop.yarn.util.timeline.TimelineUtils.<clinit>(TimelineUtils.java:60)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:200)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1327)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1764)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 13 more
What have I tried :
I have tried to add all the paths listed on this answer - https://stackoverflow.com/a/25393369/6490744 - doesn't work.
I had Hadoop-3.1.1, after checking https://github.com/apache/incubator-kyuubi/issues/2904 (they've mentioned that the issue is resolved in Hadoop 3.3.3) I have upgraded to 3.3.3. But the issue still persists.
I have also tried by manually downloading the jar to my spark/jars directory using wget -U "Any User Agent" https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-spark-30_2.12/8.4.1/elasticsearch-spark-30_2.12-8.4.1.jar => after downloading, tried to do spark-submit without passing --packages (since I have the jar in path).
All of this has been giving me the same error
After 2 hours of struggle, got the clue from - https://github.com/apache/incubator-kyuubi/issues/2904#issuecomment-1158643036 :
I had yarn.timeline-service.enabled set to true in my /etc/hadoop/yarn-site.xml - updated to false , now the error is gone.
Wonder how to setup the yarn-timeline-server now

how do you use target data-validator in azure databricks?

I'm trying to run the data validation framework called data-validator created by Target to validate data from a parquet file in Azure databricks.
I have created a spark job that will use the data-validator fat jar file.
If I give a parameter --help, I can get the help regarding how to use data-validator, but when I pass the --config test_config.yaml file, the data-validator can't find the file.
enter image description here
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: libraryDownload.sleepIntervalSeconds
Warning: Ignoring non-Spark config property: libraryDownload.timeoutSeconds
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
21/12/30 06:17:29 INFO Main$: Logging configured!
21/12/30 06:17:29 INFO Main$: Data Validator
21/12/30 06:17:30 INFO ConfigParser$: Parsing `dbfs:/FileStore/shared_uploads/jyoti/test_config.yaml`
21/12/30 06:17:30 INFO ConfigParser$: Attempting to load `dbfs:/FileStore/shared_uploads/jyoti/test_config.yaml` from file system
21/12/30 06:17:30 ERROR Main$: Failed to parse config file 'dbfs:/FileStore/shared_uploads/jyoti/test_config.yaml, {}
DecodingFailure(java.io.FileNotFoundException: dbfs:/FileStore/shared_uploads/jyoti/test_config.yaml (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:94)
at scala.io.Source$.fromFile(Source.scala:79)
at scala.io.Source$.fromFile(Source.scala:57)
at com.target.data_validator.ConfigParser$.com$target$data_validator$ConfigParser$$loadFromFile(ConfigParser.scala:39)
at com.target.data_validator.ConfigParser$$anonfun$6.apply(ConfigParser.scala:57)
at com.target.data_validator.ConfigParser$$anonfun$6.apply(ConfigParser.scala:54)
at scala.util.Try$.apply(Try.scala:213)
at com.target.data_validator.ConfigParser$.parseFile(ConfigParser.scala:53)
at com.target.data_validator.Main$.loadConfigRun(Main.scala:23)
at com.target.data_validator.Main$.main(Main.scala:171)
at com.target.data_validator.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
, List())
21/12/30 06:17:30 ERROR Main$: data-validator failed!
I had stored the yaml file in the dbfs.
Please let me know how to pass the YAML config file in data-validator using the spark-job in databricks.
You need to pass file name formatted accordingly to DBFS Local File API because ConfigParser library most probably only works with local file. To do that you need to replace dbfs: with /dbfs, like in your example - change dbfs:/FileStore/shared_uploads/jyoti/test_config.yaml to /dbfs/FileStore/shared_uploads/jyoti/test_config.yaml.

Spark Submit error when running a JAR from Azure Databricks

I'm trying to issue spark submit from Azure Databricks jobs scheduler, currently stuck with the below error. Error says: File file:/tmp/spark-events does not exist. I need some pointers to understand do we need to create this directory in Azure blob location(which is my storage Layer) or in Azure DBFS location.
As per the below link, not so clear where to create the directory when running the spark-submit from Azure Databricks jobs scheduler.
SparkContext Error - File not found /tmp/spark-events does not exist
Error:
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
Exception in thread "main" java.lang.ExceptionInInitializerError
at com.dta.dl.ct.qm.hbase.reverse.pipeline.HBaseVehicleMasterLoad.main(HBaseVehicleMasterLoad.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:97)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:580)
at com.dta.dl.ct.qm.hbase.reverse.pipeline.HBaseVehicleMasterLoad$.<init>(HBaseVehicleMasterLoad.scala:32)
at com.dta.dl.ct.qm.hbase.reverse.pipeline.HBaseVehicleMasterLoad$.<clinit>(HBaseVehicleMasterLoad.scala)
... 13 more
You need to create this folder on the driver node before collecting event logs (that's by design).
To do so, one way could be adding the property spark.history.fs.logDirectory (present at the spark-defaults.conf file) on a global init script as described here.
Please make sure that the folder defined on that property exist and can be accessed from the driver node

Directory expansion does not work in standalone deployment mode : Apache Spark

I'm trying to deploy a spark streaming consuming Kafka topic job on a standalone spark cluster using the following command:
./bin/spark-submit --class MaxwellCdc.MaxwellSreaming
~/cdc/cdc_2.11-0.1.jar --jars ~/cdc/kafka_2.11-0.10.0.1.jar,
~/cdc/kafka-clients-0.10.0.1.jar,~/cdc/mysql-connector-java-5.1.12.jar,
~/cdc/spark-streaming-kafka-0-10_2.11-2.2.1.jar
and getting this exception:
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/kafka/common/serialization/StringDeserializer
at MaxwellCdc.MaxwellSreaming$.main(MaxwellSreaming.scala:30)
at MaxwellCdc.MaxwellSreaming.main(MaxwellSreaming.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.kafka.common.serialization.StringDeserializer
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
Any help would be appreciated.
Quoting from the documentation:
When using spark-submit, the application jar along with any jars
included with the --jars option will be automatically transferred to
the cluster. URLs supplied after --jars must be separated by commas.
That list is included in the driver and executor classpaths.
Directory expansion does not work with --jars..
What is directory expansion?
Expanding a file name means converting a relative file name to an absolute one. Since this is done relative to a default directory, you must specify the default directory name as well as the file name to be expanded. It also involves expanding abbreviations like ~/.
Therefore, try providing the absolute path for all the jars that are being provided with --jars option. I hope this helps.

IllegalArgumentException with Spark 1.6

I'm running Spark 1.6.0 on CDH 5.7 and I've upgraded my original application from 1.4.1 to 1.6.0. When I try to run my application (which previously worked fine) I get the following error:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$3.apply(Client.scala:473)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$3.apply(Client.scala:471)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:471)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:469)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:469)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:725)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:143)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1023)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1083)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I submit the application with:
--jars is a comma-separated list of jars (with absolute paths)
--files is a comma-separated list of files (with absolute paths)
--driver-class-path is a colon-separated list of resources (without the full path, just the file names)
I have tried it with full paths for the driver (and executor) class paths, but that gives me the same issue. All files and jars submitted with the app exist, I checked.
Could this be related to the issue with duplicates in the distributed cache or is this another issue?
From the source code I see that the only calls to require without a custom message (as in the stack trace) are related to the distribute() method. If so, how can I run applications without upgrading Spark?
This is the exception which results from having the same path/URI appear twice in the argument to the --files option.

Resources