gcloud spark submit:Path does not exist: hdfs://cluster-xxxx-m/user/root/--; - apache-spark

I am trying to use gsutil to submit my spark job from Airflow.
This is my gcloud command: gcloud dataproc jobs submit spark --cluster=xxx --region=us-central1 --class=com.xxx --jars=gs://xxx/xxx/xxx.jar -- xxx -- xxx -- xxx -- gs://xxx/xxx/xxx
I am getting this exception: Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://cluster-xxxx-m/user/root/--;
Is anything wrong with my command?

This error could be solved by disabling the flat glob algorithm in the GCS connector, by setting these Hadoop properties during cluster creation core:fs.gs.glob.flatlist.enable=false core:fs.gs.glob.concurrent.enable=false. Additionally, upgrade the GCS_CONNECTOR_VERSION to the latest with this command --metadata GCS_CONNECTOR_VERSION=2.2.6.

Related

How do I add bigquery property to a dataproc (jobs submit spark) through gcp shell?

I am trying to load google patent data using bigquery with the following code (Python).
df= spark.read \ .format("bigquery") \ .option("table", table) \ .option("filter", "country_code = '{}' AND application_kind = '{}' AND publication_date >= {} AND publication_date < {}".format(country, kind, date_from, date_to)) \ .load()
Then I got the following error that says
Py4JJavaError: An error occurred while calling o144.load. : java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
I am suspecting the cause for this error is the lack of bigquery package in my cluster (I checked it on the configuration). I tried to add bigquery using gcp shell command. I get an error of,
$ gcloud dataproc jobs submit spark \ --cluster=my-cluster\ --region=europe-west2 \ --jar=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
But I got an error again
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class.
How do I install (gcloud dataproc jobs submit spark) bigquery??? Ideally on dataproc.
I don't know what to add in terms of --class

spark error: in state DEFINE instead of RUNNING

i'm using spark-shell to run spark hbase script.
When i run this command :
val job = Job.getInstance(conf)
I got this error
java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
The error is due to running in spark-shell. Please use spark-submit, this should solve your problem.

new Spark StreamingContext failes with hdfs errors

I'm using dcos installed via Azure ACS and installed hdfs and spark via dcos tool with default options.
Creating a SparkStreamingContext gives:
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn1. Check your hdfs-site.xml file to ensure namenodes are configured properly.
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn2. Check your hdfs-site.xml file to ensure namenodes are configured properly.
Exception in thread "main" java.lang.IllegalArgumentException:
java.net.UnknownHostException: namenode1.hdfs.mesos
I expect I have to redeploy the spark package with dcos package install with –options= but can't figure out what the hdfs.config-url should be. The https://docs.mesosphere.com/1.7/usage/service-guides/spark/install/#hdfs docs seem out of date.
Yes, it is out of date. We'll fix that.
DC/OS HDFS now serves its config on http://hdfs.marathon.mesos:[port]/v1/connect

Spark 1.6 kafka streaming on dataproc py4j error

I get the following error:
Py4JError(u'An error occurred while calling o73.createDirectStreamWithoutMessageHandler. Trace:\npy4j.Py4JException: Method createDirectStreamWithoutMessageHandler([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does not exist\n\tat py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)\n\tat py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)\n\tat py4j.Gateway.invoke(Gateway.java:252)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:209)\n\tat java.lang.Thread.run(Thread.java:745)\n\n',)
I am using spark-streaming-kafka-assembly_2.10-1.6.0.jar (which is present in the /usr/lib/hadoop/lib/ folder on all my nodes + master)
(EDIT)
The actual error was: java.lang.NoSuchMethodError: org.apache.hadoop.yarn.util.Apps.crossPlatformify(Ljava/lang/String;)Ljava/lang/String;
This was due to a wrong hadoop version. Therefore spark should be compiled with the correct hadoop version:
mvn -Phadoop-2.6 -Dhadoop.version=2.7.2 -DskipTests clean package
This will result in a jar in the external/kafka-assembly/target folder.
Using image version 1, I've successfully run the pyspark streaming / kafka example wordcount
In each of these examples "ad-kafka-inst" is my test kafka instance with a 'test' topic.
Using a cluster with no initialization actions:
$ gcloud dataproc jobs submit pyspark --cluster ad-kafka2 --properties spark.jars.packages=org.apache.spark:spark-streaming-kafka_2.10:1.6.0 ./kafka_wordcount.py ad-kafka-inst:2181 test
Using initialization actions with a full kafka assembly:
Download / unpack spark-1.6.0.tgz
Build with:
$ mvn -Phadoop-2.6 -Dhadoop.version=2.7.2 package
Upload spark-streaming-kafka-assembly_2.10-1.6.0.jar to a new GCS bucket (MYBUCKET for example).
Create the following initialization action in the same GCS bucket (e.g., gs://MYBUCKET/install_spark_kafka.sh):
$ #!/bin/bash
gsutil cp gs://MY_BUCKET/spark-streaming-kafka-assembly_2.10-1.6.0.jar /usr/lib/hadoop/lib/
chmod 755 /usr/lib/hadoop/lib/spark-streaming-kafka-assembly_2.10-1.6.0.jar
Start a cluster with the above initialization action:
$ gcloud dataproc clusters create ad-kafka-init --initialization-actions gs://MYBUCKET/install_spark_kafka.sh
Start the streaming word count:
$ gcloud dataproc jobs submit pyspark --cluster ad-kafka-init ./kafka_wordcount.py ad-kafka-inst:2181 test

Error starting pyspark with options (Without Spack packages)

can anyone tell me why I'm getting the errors below ? According to the
README for the pyspark-cassandra connector, what I am trying below should work (without Spark packages): https://github.com/TargetHolding/pyspark-cassandra
$ pyspark_jar="$HOME/devel/sandbox/Learning/Spark/pyspark-cassandra/target/scala-2.10/pyspark-cassandra-assembly-0.2.2.jar"
$ pyspark_egg="$HOME/devel/sandbox/Learning/Spark/pyspark-cassandra/target/pyspark_cassandra-0.2.2-py2.7.egg"
$ pyspark --jars $pyspark_jar --py_files $pyspark_egg --conf spark.cassandra.connection.host=localhost
This results in:
Exception in thread "main" java.lang.IllegalArgumentException: pyspark does not support any application options.
at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:222)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:239)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:113)
at org.apache.spark.launcher.Main.main(Main.java:74)
Figured out the problem. I needed to use
--py-files
instead of
--py_files

Resources