"FAILED: Execution Error, return code 3" after setting Hive engine from mr to Spark - apache-spark

I am trying use Spark engine in my Hive query.
It is an old query, and I don't want to convert the whole code to a spark job.
But when I run the query, it gives following error:
Status: Failed
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
The only thing I have changed is the execution engine:
set hive.execution.engine=spark;
The above change works for other similar queries. So I don't think that it's a configuration issue...
Or am I not aware that it is?
Has anybody faced this issue before?

Check the logs of the job to see the true error. Return code 1, 2 and 3 are all generic errors in both MR and Spark.

use verbose mode of beeline to run the query.
check query exeption logs, hiveserver logs, spark logs and spark webui worker logs (this often has the exact stack trace).
Try running spark in local mode.
What versions of hive, spark, hadoop do u use?

execute below command in hive client with hiveserver2 jdbc connection:
set hive.auto.convert.join=false;
It works for me.
Here is detail reason: https://www.cnblogs.com/CYan521/p/16716361.html

Related

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

Instrumenting Spark JDBC with javaagent

I am attempting to instrument JDBC calls using the Kamon JDBC Kanela agent in my Spark app.
I am able to successfully instrument JDBC calls in a non-spark test app by passing in -javaagent:kanela-agent-1.0.1.jar on the command line when I run the app from the JAR. When I do this, I see the Kanela banner display in the console, and can see that my failed statement processor is getting called when there is a SQL error.
From my research, I should be able to inject a javaagent into the executor of a Spark app by passing in the following to spark-submit: --conf "spark.executor.extraJavaOptions=-javaagent:kanela-agent-1.0.1.jar". However, when I do this, although the Kamon banner IS displaying on the console upon my call to Kamon.init(), my failed statement processor is NOT getting called when there is a SQL error.
Things I'm wondering:
Is there something about the way that spark-jdbc makes these JDBC calls that would prevent a javaagent from "seeing" them?
Does my call to Kamon.init() somehow only apply to code in the Spark driver, and not the executor?
Any other reason that you can think of that would be preventing this from working?

How to get spark SUBMISSION_ID with spark-submit?

Many places need SUBMISSION_ID, like spark-submit --status and Spark REST API. But how can I get this SUBMISSION_ID when I use spark-submit command to submit spark jobs?
P.S.:
I use python [popen][2] to start a spark-submit job. I want SUBMISSION_ID so my python program can monitor spark job status via REST API: <ip>:6066/v1/submissions/status/<SUBMISSION_ID>
Thanks to the clue by #Pandey. The answer https://stackoverflow.com/a/37980813/5634636 helps me a lot.
TL;DR
If you want to submit spark job locally, the answer https://stackoverflow.com/a/37980813/5634636 indeed works.
The only point is you must use cluster mode to submit your job,
i.e., use parameter --deploy-mode cluster.
If you want to submit a spark job remotely, use Spark submission API. It will help a lot. See https://www.nitendragautam.com/spark/submit-apache-spark-job-with-rest-api/ for details.
Detailed description
NOTE: I only test my approaches on Apache Spark 2.3.1. I can't guarantee that it will work in other versions as well.
Let's clear my requirement first. There're 3 features I wanted:
remote submit a spark job
check job status anytime (RUNNING, ERROR, FINISHED...)
get the error message if there is something error
Submit locally
NOTE: this answer only works in cluster mode
The Spark tool spark-submit will help.
To submit a job, see
https://spark.apache.org/docs/2.4.0/submitting-applications.html#launching-applications-with-spark-submit
To check the status, see https://stackoverflow.com/a/37420931/5634636. In this way, you need a SubmissionID. This answer https://stackoverflow.com/a/37980813/5634636 told you how to get a submission id in cluster mode. The submission id looks like driver-20190315142356-0004.
The error message is included in the job status message.
Submit remotely
Spark submission API is recommended. It seems that there is not any documentation on Apache Spark official website, so some people call it hidden API. For details, see: https://www.nitendragautam.com/spark/submit-apache-spark-job-with-rest-api/
To submit Spark job, use submit API
To get the status of the job, use status API: http://<master-ip>:6066/v1/submissions/status/<submission-id>. The submission-id will be returned in a json when you submit jobs.
The error message is included in the status message.
More about the error message: note the difference between status ERROR and FAILED. In short, the FAILED means that there is something wrong during executing Spark jobs (e.g. uncaught exceptions), while the ERROR means there's something error during submitting (e.g. invalid jar path). The error message is included in the status json. If you want to view the FAILED reason, it can be accessed via http://<driver-ip>:<ui-port>/log/<submission-id>.
Here is an example of error status (**** is an incorrect jar path which is miswritten intentionally):
{
"action" : "SubmissionStatusResponse",
"driverState" : "ERROR",
"message" : "Exception from the cluster:\njava.io.FileNotFoundException: File hdfs:**** does not exist.\n\torg.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)\n\torg.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)\n\torg.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)\n\torg.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)\n\torg.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)\n\torg.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:860)\n\torg.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:727)\n\torg.apache.spark.util.Utils$.doFetchFile(Utils.scala:695)\n\torg.apache.spark.util.Utils$.fetchFile(Utils.scala:488)\n\torg.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155)\n\torg.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173)\n\torg.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)",
"serverSparkVersion" : "2.3.1",
"submissionId" : "driver-20190315160943-0005",
"success" : true,
"workerHostPort" : "172.18.0.4:36962",
"workerId" : "worker-20190306214522-172.18.0.4-36962"
}

pass custom exitcode from yarn-cluster mode spark to CLI

I started a yarn cluster mode spark job through spark-submit.
To indicate partial failure etc I want to pass exitcode from driver to script calling spark-submit.
I tried both, System.exit and throwing SparkUserAppException in driver, but in both cases CLI only got 1, not what exitcode I passed.
I think it is impossible to pass custom exitcode, since any exitcode passed by driver will be converted to yarn status and yarn will convert any failed exitCode to 1 or failed.
By looking at spark code, I can conclude this:
It is possible in client mode. Look at runMain() method of SparkSubmit class
Whereas in cluster mode, it is not possible to get the exit status of the driver because your driver class will be running in one of the executors.
There an alternate solution that might/might not be suitable for your use case:
Host a REST API with an endpoint to receive the status update from your driver code. In the case of any exceptions, let your driver code use this endpoint to update the status.
You can save the exit code in the output file (on HDFS or local FS) and make your script wait for this file appearance, read and proceed. This is definitely is not an elegant way, but it may help you to proceed.
When saving file, pay attention to the permissions to this location. Your spark process has to have RW access.

how to use presto to query hive data

I just installed presto and when I use the presto-cli to query hive data, I get the following error:
$ ./presto --server node6:8080 --catalog hive --schema default
presto:default> show tables;
Query 20131113_150006_00002_u8uyp failed: Table hive.information_schema.tables does not exist
The config.properties is:
coordinator=true
datasources=jmx,hive
http-server.http.port=8080
presto-metastore.db.type=h2
presto-metastore.db.filename=/root/h2
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=`http://node6:8080`
And the hive.properties is:
connector.name=hive-cdh4
hive.metastore.uri=thrift://node6:9083
The hadoop distribution I used is CDH 4.4. I believe it's properly installed and hive can process queries successfully on its own.
Can anyone help me work it out? Any ideas will be appreciated.
As recommended by the Getting Started, I created a controller (jmx only) and a separate worker (jmx,hive), each on separate machines.
What finally solved this for me was to specify the worker's hostname and http-server.http.port as the --server argument to presto. When specifying the controller, it didn't work.
This all makes sense, but I am still wondering what will happen when I have two Presto-Hive workers...
Add more line to etc/catalog/hive.properties
"hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml"
ofcourse check values of path before do it.
presto-metastore.db.filename= <- is this the value for Hive Warehouse
Directory ?
=> this presto's metastore,not hive.
I just figured out what was wrong in my case:
you also have to add following line to $HIVE_HOME/conf/hive-env.sh for informing hive to open thrift port(same as mentioned under hive.metastore.uris property in hive-site.xml file). This port is used by hive client to connect to Metastore through RPC.
export METASTORE_PORT=9084
in the hive-env.sh file in the conf folder.
This should sync your hive with presto.

Resources