Fail to submit spark job - apache-spark

I am trying to run the Spark-solr Twitter example with spark-solr-3.4.4-shaded.jar,
bin/spark-submit --master local[2] \ --conf "spark.driver.extraJavaOptions=-Dtwitter4j.oauth.consumerKey=?
-Dtwitter4j.oauth.consumerSecret=? -Dtwitter4j.oauth.accessToken=? -Dtwitter4j.oauth.accessTokenSecret=?" \ --class com.lucidworks.spark.SparkApp \ ./target/spark-solr-3.1.1-shaded.jar \ twitter-to-solr -zkHost localhost:9983 -collection socialdata
but it is failed and the following message is shown
INFO ContextHandler: Started o.e.j.s.ServletContextHandler#29182679{/metrics/json,null,AVAILABLE,#Spark}
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext.jobProgressListener()Lorg/apache/spark/ui/jobs/JobProgressListener;
I can confirm the path for ./target/spark-solr-3.1.1-shaded.jar is correct.
I suspect there is something wrong in --class com.lucidworks.spark.SparkApp (ClassPath), but I am not sure.
I am running on local mode and I change the parameters as instructed in the example.
Version:
Spark 2.1.1
Spark-solr 3.1.1
Solr 6.6.0

Related

Submitting a spark job to a kubernetes cluster using bitnami spark docker image

I have a local setup with minikube and I'm trying to use spark-submit to submit a job to a local Kubernetes. The idea here is to use my local machine's spark-submit to submit to the kubernetes master which will handle creating a spark cluster and taking it down when the work is finished.
I'm using the image bitnami/spark:3.2.1 and the following command:
./bin/spark-submit --master k8s://https://127.0.0.1:52388 \
--deploy-mode cluster \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=bitnami/spark:3.2.1 \
--class org.apache.spark.examples.JavaSparkPi \
--name spark-pi \
local:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.2.1.jar
This does not seem to work and the logs in the spark driver are:
[...]
Caused by: java.io.IOException: Failed to connect to spark-master:7077
[...]
and
[...]
Caused by: java.net.UnknownHostException: spark-master
[...]
If I use the docker-image-tool.sh to build a custom spark docker image with the python bindings and use that, it works perfectly. How is bitnami's image special and why doesn't it recognise that the master in this case is kubernetes?
I also tried using the option conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://127.0.0.1:7077 when submitting but the error was similar to above.

Load properties file in Spark classpath during spark-submit execution

I'm installing the Spark Atlas Connector in a spark submit script (https://github.com/hortonworks-spark/spark-atlas-connector)
Due to security restrictions, I can't put the atlas-application.properties in the spark/conf repository.
I used the two options in the spark-submit :
--driver-class-path "spark.driver.extraClassPath=hdfs:///directory_to_properties_files" \
--conf "spark.executor.extraClassPath=hdfs:///directory_to_properties_files" \
When I launch the spark-submit, I encounter this issue :
20/07/20 11:32:50 INFO ApplicationProperties: Looking for atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Looking for /atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Loading atlas-application.properties from null
Please find CDP Atals Configuration article.
https://community.cloudera.com/t5/Community-Articles/How-to-pass-atlas-application-properties-configuration-file/ta-p/322158
Client Mode:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-java-options="-Datlas.conf=/tmp/" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
Cluster Mode:
sudo -u spark spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --files /tmp/atlas-application.properties --conf spark.driver.extraJavaOptions="-Datlas.conf=./" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10

Warning: Skip remote jar hdfs

I would like to submit a spark job with configuring additional jar on hdfs, however the hadoop gives me a warning on skipping remote jar. Although I can still get my final results on hdfs, I cannot obtain the effect of additional remote jar. I would appreciate if you can give me some suggestions.
Many thanks,
root#cluster-1-m:~# hadoop fs -ls hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar
-rwxr-xr-x 2 hdfs hadoop 7097056 2019-01-23 14:44 hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar
root#cluster-1-m:~#/usr/lib/spark/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.jars=hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar \
--conf spark.driver.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar \
--conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar \
--class com.github.ehiggs.spark.terasort.TeraSort \
/root/spark-terasort-master/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /tmp/data/terasort_in /tmp/data/terasort_out
Warning: Skip remote jar hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar.
19/01/24 02:20:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at cluster-1-m/10.146.0.4:8032
19/01/24 02:20:31 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at cluster-1-m/10.146.0.4:10200
19/01/24 02:20:34 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1548293702222_0002

WARN Session: Error creating pool to /xxx.xxx.xxx.xxx:28730

I'm trying to connect to a ScyllaDB database running on IBM Cloud from Spark 2.3 running on IBM Analytics Engine.
I'm starting the spark shell like so ...
$ spark-shell --master local[1] \
--files jaas.conf \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0,datastax:spark-cassandra-connector:2.3.0-s_2.11,commons-configuration:commons-configuration:1.10 \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf" \
--conf spark.cassandra.connection.host=xxx1.composedb.com,xxx2.composedb.com,xxx3.composedb.com \
--conf spark.cassandra.connection.port=28730 \
--conf spark.cassandra.auth.username=scylla \
--conf spark.cassandra.auth.password=SECRET \
--conf spark.cassandra.connection.ssl.enabled=true \
--num-executors 1 \
--executor-cores 1
Then executing the following spark scala code:
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val stocksRdd = sc.cassandraTable("stocks", "stocks")
stocksRdd.count()
However, I see a bunch of warnings:
18/08/23 10:11:01 WARN Cluster: You listed xxx1.composedb.com/xxx.xxx.xxx.xxx:28730 in your contact points, but it wasn't found in the control host's system.peers at startup
18/08/23 10:11:01 WARN Cluster: You listed xxx1.composedb.com/xxx.xxx.xxx.xxx:28730 in your contact points, but it wasn't found in the control host's system.peers at startup
18/08/23 10:11:06 WARN Session: Error creating pool to /xxx.xxx.xxx.xxx:28730
com.datastax.driver.core.exceptions.ConnectionException: [/xxx.xxx.xxx.xxx:28730] Pool was closed during initialization
...
However, after the stacktrace in the warning, I see the output I am expecting:
res2: Long = 4
If I navigate to the compose UI, I see a map json:
[
{"xxx.xxx.xxx.xxx:9042":"xxx1.composedb.com:28730"},
{"xxx.xxx.xxx.xxx:9042":"xxx2.composedb.com:28730"},
{"xxx.xxx.xxx.xxx:9042":"xxx3.composedb.com:28730"}
]
It seems the warning is related to the map file.
What are the implications of the warning? Can I ignore it?
NOTE: I've seen a similar question, however I believe this question is different because of the map file and I have no control over how the scylladb cluster has been setup by Compose.
This is just warning. The warning is happening because the IPs that spark is trying to reach are not know to Scylla itself. Apparently Spark is connecting to the cluster and retrieving the expected information so you should be fine.

Exception in thread "main" org.apache.spark.SparkException: Must specify the driver container image

I am trying to do spark-submit on minikube(Kubernetes) from local machine CLI with command
spark-submit --master k8s://https://127.0.0.1:8001 --name cfe2
--deploy-mode cluster --class com.yyy.Test --conf spark.executor.instances=2 --conf spark.kubernetes.container.image docker.io/anantpukale/spark_app:1.1 local://spark-0.0.1-SNAPSHOT.jar
I have a simple spark job jar built on verison 2.3.0. I also have containerized it in docker and minikube up and running on virtual box.
Below is exception stack:
Exception in thread "main" org.apache.spark.SparkException: Must specify the driver container image at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep$$anonfun$3.apply(BasicDriverConfigurationStep.scala:51) at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep$$anonfun$3.apply(BasicDriverConfigurationStep.scala:51) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep.<init>(BasicDriverConfigurationStep.scala:51)
at org.apache.spark.deploy.k8s.submit.DriverConfigOrchestrator.getAllConfigurationSteps(DriverConfigOrchestrator.scala:82)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:229)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:227)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:227)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:192)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2018-04-06 13:33:52 INFO ShutdownHookManager:54 - Shutdown hook called 2018-04-06 13:33:52 INFO ShutdownHookManager:54 - Deleting directory C:\Users\anant\AppData\Local\Temp\spark-6da93408-88cb-4fc7-a2de-18ed166c3c66
Look like bug with default value for parameters spark.kubernetes.driver.container.image, that must be spark.kubernetes.container.image. So try specify driver/executor container image directly:
spark.kubernetes.driver.container.image
spark.kubernetes.executor.container.image
From the source code, the only available conf options are:
spark.kubernetes.container.image
spark.kubernetes.driver.container.image
spark.kubernetes.executor.container.image
And I noticed that Spark 2.3.0 has changed a lot in terms of k8s implementation compared to 2.2.0. For example, instead of specifying driver and executor separately, the official starter's guide is to use a single image given to spark.kubernetes.container.image.
See if this works:
spark-submit \
--master k8s://http://127.0.0.1:8001 \
--name cfe2 \
--deploy-mode cluster \
--class com.oracle.Test \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=docker/anantpukale/spark_app:1.1 \
--conf spark.kubernetes.authenticate.submission.oauthToken=YOUR_TOKEN \
--conf spark.kubernetes.authenticate.submission.caCertFile=PATH_TO_YOUR_CERT \
local://spark-0.0.1-SNAPSHOT.jar
The token and cert can be found on k8s dashboard. Follow the instructions to make Spark 2.3.0 compatible docker images.

Resources