Instrumenting Spark JDBC with javaagent - apache-spark

I am attempting to instrument JDBC calls using the Kamon JDBC Kanela agent in my Spark app.
I am able to successfully instrument JDBC calls in a non-spark test app by passing in -javaagent:kanela-agent-1.0.1.jar on the command line when I run the app from the JAR. When I do this, I see the Kanela banner display in the console, and can see that my failed statement processor is getting called when there is a SQL error.
From my research, I should be able to inject a javaagent into the executor of a Spark app by passing in the following to spark-submit: --conf "spark.executor.extraJavaOptions=-javaagent:kanela-agent-1.0.1.jar". However, when I do this, although the Kamon banner IS displaying on the console upon my call to Kamon.init(), my failed statement processor is NOT getting called when there is a SQL error.
Things I'm wondering:
Is there something about the way that spark-jdbc makes these JDBC calls that would prevent a javaagent from "seeing" them?
Does my call to Kamon.init() somehow only apply to code in the Spark driver, and not the executor?
Any other reason that you can think of that would be preventing this from working?

Related

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

submit job to remote hazelcast cluster

I'm new to Hazelcast Jet and have a very basic question. I have a 3-node JET cluster set up. I have a sample code to read from Kafka and drain to an IMap. When I run it from command-line (using jet-submit.sh and use JetBootstrap.getInstance() to acquire JET client instance) it works perfectly fine. When I run the same code (using Jet.newJetClient() to acquire the instance and Run As -> Java application on Eclipse), I get:
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field com.hazelcast.jet.core.ProcessorMetaSupplier.
Could you please let me know where am I going wrong?
One of your lambda functions captures an outside variable, probably defined at class level, and that class is not Serializable or not added to the Job config when submitting from client. This is done automatically when submitting via the script.
Please see http://docs.hazelcast.org/docs/jet/0.6.1/manual/#remember-that-a-jet-job-is-distributed
When you use a client instance to submit the job, you have to add all classes that contain the code called by the job to the JobConfig:
JobConfig config = new JobConfig();
config.addClass(...);
config.addJar(...);
...
client.newJob(pipeline, config);
For example, if you use a lambda for stage.map(), the class containing the lambda has to be added.
The jet-submit.sh script makes this easier by automatically adding the entire submitted .jar file.

pass custom exitcode from yarn-cluster mode spark to CLI

I started a yarn cluster mode spark job through spark-submit.
To indicate partial failure etc I want to pass exitcode from driver to script calling spark-submit.
I tried both, System.exit and throwing SparkUserAppException in driver, but in both cases CLI only got 1, not what exitcode I passed.
I think it is impossible to pass custom exitcode, since any exitcode passed by driver will be converted to yarn status and yarn will convert any failed exitCode to 1 or failed.
By looking at spark code, I can conclude this:
It is possible in client mode. Look at runMain() method of SparkSubmit class
Whereas in cluster mode, it is not possible to get the exit status of the driver because your driver class will be running in one of the executors.
There an alternate solution that might/might not be suitable for your use case:
Host a REST API with an endpoint to receive the status update from your driver code. In the case of any exceptions, let your driver code use this endpoint to update the status.
You can save the exit code in the output file (on HDFS or local FS) and make your script wait for this file appearance, read and proceed. This is definitely is not an elegant way, but it may help you to proceed.
When saving file, pay attention to the permissions to this location. Your spark process has to have RW access.

How to leverage a spark cluster from a web app?

A lot of people have asked this question but there is no clear answer except links and references and also most of them are not recent. The question is this :
I have a web app that needs to leverage a spark cluster to run a spark-sql query. My understanding is that submit-job script is asynchronous hence this won't work here. How do I leverage spark in such a setup? Can I just write code in the web app like I do in a self-contained spark application i.e. create a context, set the master URL and do what I need to do ? Will this work in a web app ? If yes, then when would I need the job server that provides REST APIs to submit jobs?
Library for launching Spark applications.
This library allows applications to launch Spark programmatically. There's only one entry point to the library - the SparkLauncher class.
To launch a Spark application, just instantiate a SparkLauncher and configure the application to run. For example:
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
References:
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/launcher/package-summary.html
I think options will be
Through rest api like Livy (Livy is a new open source Spark REST
Server for submitting and interacting with your Spark jobs from
anywhere. ) or spark server (REST APIs) - See how they connect to
spark interactively from using kernel -
https://www.youtube.com/watch?v=TD1J7MzYcFo&feature=youtu.be&t=33m19s
https://developer.ibm.com/open/apache-toree/
Through jdbc (Running via the Thrift JDBC/ODBC server)
Through ssh and submit a job and wait for yarn status (this will
be SSH to the cluster and do a spark submit through YARN - YARN
give you an application ID and you can keep track of application
status with yarn application status command)

insert data into Microsoft SQL server using Spark

I am trying to insert data into sql server using spark using the below Jdbc methods.
Option 1:
prop.put("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
dataf.write.mode(org.apache.spark.sql.SaveMode.Append).jdbc(url,table_name, prop)
Table is already created. Appending new data.Job Error-ed with the below exception
Exception in thread "main"
com.microsoft.sqlserver.jdbc.SQLServerException: CREATE TABLE
permission denied in database
Question is : Why create table permission is required for appending the data?
Option2:
prop.put("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(dataf, url, table_name, prop)
Above command working from spark-shell. when the same is used in scala code and packaged with dependencies giving below exception
Exception in thread "main" java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
I tried setting driver class-path and executor class-path and also --jars still no luck. Included sqljdbc4.jar in driver-classpath and --jars.
Copied sqljdbc4.jar to all worker nodes as well still no luck.
Any Ideas on this?
After Lot of searching and Testing, I found the answer. It might be useful for someone.
Option 1: This is because of bug in spark 1.5.X. the same was resolved
in 1.6.x and later. Because of the bug, It always try to create a new
table.
Option2: This causes because , driver name on classpath given
priority than properties we are passing as argument. Workaround for
this is to create connection and then invoke savetable.
workaround if you are using spark 1.5.x or lower.
JdbcUtils.createConnection(url, prop)
JdbcUtils.saveTable()

Resources