spark error: in state DEFINE instead of RUNNING - apache-spark

i'm using spark-shell to run spark hbase script.
When i run this command :
val job = Job.getInstance(conf)
I got this error
java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING

java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
The error is due to running in spark-shell. Please use spark-submit, this should solve your problem.

Related

gcloud spark submit:Path does not exist: hdfs://cluster-xxxx-m/user/root/--;

I am trying to use gsutil to submit my spark job from Airflow.
This is my gcloud command: gcloud dataproc jobs submit spark --cluster=xxx --region=us-central1 --class=com.xxx --jars=gs://xxx/xxx/xxx.jar -- xxx -- xxx -- xxx -- gs://xxx/xxx/xxx
I am getting this exception: Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://cluster-xxxx-m/user/root/--;
Is anything wrong with my command?
This error could be solved by disabling the flat glob algorithm in the GCS connector, by setting these Hadoop properties during cluster creation core:fs.gs.glob.flatlist.enable=false core:fs.gs.glob.concurrent.enable=false. Additionally, upgrade the GCS_CONNECTOR_VERSION to the latest with this command --metadata GCS_CONNECTOR_VERSION=2.2.6.

Error when running spark application with zeppelin

When I run the above spark application with zeppelin in Yarn cluster with cluster mode, I get the following error:
Where may be the problem? Thanks

apache_beam spark runner with python can't be implemented on remote spark cluster?

i am following the python guide beam spark runner,and the beam_pipeline can submit job to a local jobserver which is launched by ./gradlew :runners:spark:job-server:runShadow with a local spark,
and the addition parameter-PsparkMasterUrl=spark://localhost:7077 to a pre-deployed spark.
But i have a spark cluster on yarn, i set the launch command as ./gradlew :runners:spark:job-server:runShadow -PsparkMasterUrl=yarn(also tried yarn-client), but only get org.apache.spark.SparkException: Could not parse Master URL: 'yarn'
and the source code of the spark runner(beam\sdks\python\apache_beam\runners\portability\spark_runnner.py) shows that:
parser.add_argument('--spark_master_url',
default='local[4]',
help='Spark master URL (spark://HOST:PORT). '
'Use "local" (single-threaded) or "local[*]" '
'(multi-threaded) to start a local cluster for '
'the execution.')
it doesn't mention 'yarn', and the Provided SparkContext and StreamingListeners are not supported on the Spark portable runner. So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly) and can only be test locally? or maybe i can set the job_endpoint as the remote job server url of my spark cluster.
and the every ./gradlew command blocked at 98%,but the jab server started with info like that:
19/11/28 13:47:48 INFO org.apache.beam.runners.fnexecution.jobsubmission.JobServerDriver: JobService started on localhost:8099
<============-> 98% EXECUTING [16s]
> IDLE
> :runners:spark:job-server:runShadow
> IDLE
So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly)
We've recently added portable Spark jars, which can be submitted via spark-submit. This feature isn't scheduled be included a Beam release until 2.19.0, however.
I created a JIRA ticket to track the status of YARN support, in case there are other related issues that need to be addressed.
and the every ./gradlew command blocked at 98%
That's expected behavior. The job server will stay running until canceled.

When Running Spark job in hadoop cluster i am getting java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

When i tried to run my scala code which connects hbase database it works perfectly in my local IDE . But when i run the same in hadoop cluster i am getting "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration" error .
Please help me in this
Add all the HBase library jars to HADOOP_CLASSPATH -
export HBASE_HOME="YOUR_HBASE_HOME_PATH"
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$HBASE_HOME/lib/*"
You can append any external jar needed to HADOOP_CLASSPATH, so that you don't need to explicitly set it in spark-submit command. All dependent jars will be loaded and provided to your Spark application.

java.net.ConnectException (on port 9000) while submitting a spark job

On running this command:
~/spark/bin/spark-submit --class [class-name] --master [spark-master-url]:7077 [jar-path]
I am getting
java.lang.RuntimeException: java.net.ConnectException: Call to ec2-[ip].compute-1.amazonaws.com/[internal-ip]:9000 failed on connection exception: java.net.ConnectException: Connection refused
Using spark version 1.3.0.
How do I resolve it?
When Spark is run in Cluster mode, all input files will be expected to be from HDFS (otherwise how will workers read from master's local files). But in this case, Hadoop wasn't running, so it was giving this exception.
Starting HDFS resolved this.

Resources