Pyspark - Failed to get main class in JAR with error 'File file:/home/xpto/spark/, does not exist' - apache-spark

Im using pyspark to write into kafka.
When I run the command:
bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10-assembly_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.2 --jars /home/xpto/spark/jars/spark-streaming-kafka-0-10-assembly_2.12-3.0.1.jar , /home/xpto/spark/jars/spark-sql-kafka-0-10_2.11-2.0.2.jar , /home/xpto/spark/jars/kafka-clients-2.6.0.jar --verbose --master local[2] /home/xavy/Documents/PersonalProjects/Covid19Analysis/pyspark_job_to_write_data_to_kafkatopic.py
Im receiving an error:
:: retrieving :: org.apache.spark#spark-submit-parent-ad9bf9ab-6d6d-4edd-bd1f-4b3145c2457f
confs: [default]
0 artifacts copied, 7 already retrieved (0kB/3ms)
20/11/22 18:35:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'File file:/home/xpto/spark/, does not exist'. Please specify one with --class.
at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:936)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:457)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I dont know which class the spark is asking for...
Im running this locally in my pc, not sure if is the right way to do it.
Can someone help and point me to the right direction?

So, spacing matters - make sure you don't put spaces in your file paths
For example, you've put this path in your packages line
, /home/xpto/spark/jars/spark-sql-kafka-0-10_2.11-2.0.2.jar
Not clear why you give a local file path when getting them from maven should work fine. However, you need to use consistent Spark versions... You've mixed 3.x and 2.x as well as Scala 2.12 and 2.11
You also shouldn't don't need both spark-streaming-kafka and spark-sql-kafka
Regarding the error, the syntax that it thinks you've tried to use is for Java
spark-submit [options] --class MainClass application.jar
For python applications, you might want to use --py-files

Related

How to specify --jars in spark-submit?

I want to add a couple of gcs connector jars to spark-submit. These will allow my spark application to read files from google storage.
I can do the following in python:
from pyspark import pandas as pd
spark = SparkSession.builder.config("spark.jars","/usr/local/share/google/dataproc/lib/gcs-connector.jar,/usr/l
ocal/share/google/dataproc/lib/gcs-connector-hadoop3-2.2.3.jar").getOrCreate()
pd.read_csv("gs://monsoon-credittech.appspot.com/mar19/training_data.csv",nrows=2)
But, I cannot achieve the same with spark-submit utility
spark-submit --master yarn --jars /usr/local/share/google/dataproc/lib/gcs-connector-hadoop3-2.2.3.jar, /usr/local/share/google/dataproc/lib/gcs-connector.jar testing_dep.py --num 2 --path gs://monsoon-credittech.appsp
ot.com/mar19/training_data.csv
output:
2022-01-08 16:34:43,605 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.SparkException: No main class set in JAR; please specify one with --class.
at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:972)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:492)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Why is the spark-submit asking me to provide --class. My entry point is testing_dep.py. I do not have --class to be specified. How can I achieve adding those jars?

spark2-submit throwing error with multiple packages (--packages)

I'm trying to submit following Spark2 job on CDH 5.16 cluster and it's only taking first parameter of --packages option and throwing error for second parameter
spark2-submit --packages com.databricks:spark-xml_2.11:0.4.1, com.databricks:spark-csv_2.11:1.5.0 /path/to/python-script
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR com.databricks:spark-csv_2.11:1.5.0 with URI com.databricks. Please specify a class through --class.
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:224)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Running this job in CDH5.16 cluster and spark installed with Spark2 CSD
Thanks in advance.
Don't give space between packages
spark2-submit --packages com.databricks:spark-xml_2.11:0.4.1,com.databricks:spark-csv_2.11:1.5.0 /path/to/python-script

When Running Spark job in hadoop cluster i am getting java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

When i tried to run my scala code which connects hbase database it works perfectly in my local IDE . But when i run the same in hadoop cluster i am getting "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration" error .
Please help me in this
Add all the HBase library jars to HADOOP_CLASSPATH -
export HBASE_HOME="YOUR_HBASE_HOME_PATH"
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$HBASE_HOME/lib/*"
You can append any external jar needed to HADOOP_CLASSPATH, so that you don't need to explicitly set it in spark-submit command. All dependent jars will be loaded and provided to your Spark application.

spark-submit classpath issue with --repositories --packages options

I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container.
When spark-submit my Java App with the --repositories and --packages options I can see that it successfully downloads the apps required dependencies. However the stderr logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder. This class is available in one of the dependencies downloaded by spark-submit. But doesn't look like it's available on the worker classpath??
16/02/22 16:17:09 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
at com.my.spark.app.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:71)
... 6 more
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
The spark-submit call:
${SPARK_HOME}/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--repositories https://oss.sonatype.org/content/groups/public/ \
--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \
--class com.my.spark.app.JavaDirectKafkaWordCount \
/app/spark-app.jar kafka-server:9092 mytopic
I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.
The --repositories option is required only if some custom repository has to be referenced
By default the maven central repository is used if the --repositories option is not provided
When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.
In my case I had observed that
spark-shell worked perfectly with the --packages option
spark-submit would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodes
spark-submit worked with the --packages option if I ran the driver locally using --deploy-mode client instead of cluster.
This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars
I found the following discussion useful but I still have to nail down this problem.
https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455
Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.
But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.

Error starting pyspark with options (Without Spack packages)

can anyone tell me why I'm getting the errors below ? According to the
README for the pyspark-cassandra connector, what I am trying below should work (without Spark packages): https://github.com/TargetHolding/pyspark-cassandra
$ pyspark_jar="$HOME/devel/sandbox/Learning/Spark/pyspark-cassandra/target/scala-2.10/pyspark-cassandra-assembly-0.2.2.jar"
$ pyspark_egg="$HOME/devel/sandbox/Learning/Spark/pyspark-cassandra/target/pyspark_cassandra-0.2.2-py2.7.egg"
$ pyspark --jars $pyspark_jar --py_files $pyspark_egg --conf spark.cassandra.connection.host=localhost
This results in:
Exception in thread "main" java.lang.IllegalArgumentException: pyspark does not support any application options.
at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:222)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:239)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:113)
at org.apache.spark.launcher.Main.main(Main.java:74)
Figured out the problem. I needed to use
--py-files
instead of
--py_files

Resources