Using logback.xml in Apache Livy - apache-spark

I am trying to use logback.xml with Apache Livy ReST API and I'm having trouble getting it to work. I've tried submitting the logback.xml path as follows.
data = {
"file" : "<PATH_TO_JAR_FILE>",
"className" : "<CLASS_NAME>",
"files": ["file:///path/to/log4j.properties", "file:///path/to/logback.xml"],
"conf": {"driver-java-options":"-Dlogback.configurationFile=file:///path/to/logback.xml"}
}
However, I get the warning Ignoring non-spark config property: driver-java-options=-Dlogback.configurationFile. I can add the logback.xml file using spark.driver.extraJavaOptions instead of driver-java-options, but it doesn't make a difference since the JVM has already started. So my question is, How do I get Livy to use the logback.xml.

I see that you're trying to configure logging of Apache Livy server. But using its REST API (IIUC you are sending POST /batches or /sessions) you are submitting Spark job and feeding logback.xml to Spark Driver process, which is separate from Livy Server process.
What you need to do is to pass -Dlogback.configurationFile=... option to Livy Server process on its startup. Most probably (depending on your Livy version and environment) you should set export LIVY_SERVER_JAVA_OPTS=-Dlogback.configurationFile=... environment variable and start Livy with $LIVY_HOME/bin/livy-server start. Refer the script on GitHub.

Related

Providing AWS_PROFILE when reading S3 files with Spark

I want my Spark app (Scala) to be able to read S3 files
spark.read.parquet("s3://my-bucket-name/my-object-key")
On my dev machine, I could access S3 files using awscli a pre-configured profile in ~/.aws/config or ~/.aws/credentials, like:
aws --profile my-profile s3 ls s3://my-bucket-name/my-object-key
But when trying to read those files from Spark, with the aws_profile provided as an env variable (AWS_PROFILE), I got the following error:
doesBucketExist on my-bucket-name: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint
Also tried to provide the profile as a JVM option (-Daws.profile=my-profile), with no luck.
Thanks for reading.
The solution is to provide the spark property: fs.s3a.aws.credentials.provider, setting it to com.amazonaws.auth.profile.ProfileCredentialsProvider.
If I could change the code to build the Spark Session, then something like:
SparkSession
.builder()
.config("fs.s3a.aws.credentials.provider","com.amazonaws.auth.profile.ProfileCredentialsProvider")
.getOrCreate()
The other way is to provide the JVM option -Dspark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider.*NOTE the prefix spark.hadoop
If problems arise still after setting fs.s3a.aws.credentials.provider to com.amazonaws.auth.profile.ProfileCredentialsProvider and correctly setting AWS_PROFILE, it might be because you're using Hadoop 2 for which the above configuration is not supported.
Therefore, the only workaround I found was to upgrade to Hadoop 3.
Check this post and Hadoop docs for more information.

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

SnappyData REST API to Submit Job

I am trying to submit Snappy Job using REST API.
We have been able to submit SnappyJob using snappy-job submit Command
Line tool.
I could not find any documentation how to do the same thing through
REST API.
I found somewhere mentioned in the forum that SnappyData is using the
spark jobserver REST API.
Could you point to the Documentation / User Guide how to do that?
Snappydata internally uses spark-jobserver for submitting the jobs. Hence, all the spark-jobserver REST APIs are accessible on Snappydata's lead node.
You can refer to all spark-jobserver API here: https://github.com/SnappyDataInc/spark-jobserver#api
Here are some useful curl commands to clarify it further:
deploy application jar on job-server:
curl --data-binary #/path/to/applicaton.jar localhost:8090/jars/testApp
testApp is the name of the job server app which will be used to submit the job
create context:
curl -X POST "localhost:8090/contexts/testSnappyContext?context-factory=org.apache.spark.sql.SnappySessionFactory"
testSnappyContext is the name of the context which will be used to submit the job.
Also, note that we are passing a custom context-factory argument here which is necessary for submitting snappy job.
submit the job:
curl -d "configKey1=configValue1,configKey2=configValue2" "localhost:8090/jobs?appName=testApp&classPath=com.package.Main&context=testSnappyContext"
com.package.Main is the fully-qualified name of the class which is extending org.apache.spark.sql.SnappySQLJob.
stop the job
curl -X DELETE localhost:8090/jobs/bfed84a1-0b06-47ca-81a7-9b8defb51e38
bfed84a1-0b06-47ca-81a7-9b8defb51e38 is the job-id which you will get in the response of job submit request
stop the context
curl -X DELETE localhost:8090/contexts/testSnappyContext
undeploying the application jar
The version of job-server being used by snappydata doesn't have a RESTful API exposed for undeploying the jar. However, deploying any jar with the same app name (testApp in our example) will override the previously deployed jar for the same app.

PySpark logging from the executor in a standalone cluster

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the log4j logger that we can derive from spark-context.
But how can I log from within an RDD's foreach or a foreachPartition? Is there any way I can collect these logs and print?
The answer to this is to import python logging and to write the messages using logging and the logged messages will be in the work directory which is created under the spark installation location
There is nothing else which is needed
I went crazy modifying log4j.properties file and adding driver-java-option and spakrk.executor.extraJavaOptions
In your spark program, import logging add log messages straightaway as
logging.warning(whatever is your message and variable values you want to check)
Then if you navigate to the work directory - if i have installed spark at /home/vagrant/spark then we are talking about /home/vagrant/spark/work directory
There will be a directory for each application.And the workers used for the application will have numbers 0, 1, 2, 3 etc.You have to check in each worker.And whichever worker your executor was created to execute the task in the stderr you will see the logging messages
Hope this helps to see the user logged messages on the executor when using the spark standalone cluster mode

How to leverage a spark cluster from a web app?

A lot of people have asked this question but there is no clear answer except links and references and also most of them are not recent. The question is this :
I have a web app that needs to leverage a spark cluster to run a spark-sql query. My understanding is that submit-job script is asynchronous hence this won't work here. How do I leverage spark in such a setup? Can I just write code in the web app like I do in a self-contained spark application i.e. create a context, set the master URL and do what I need to do ? Will this work in a web app ? If yes, then when would I need the job server that provides REST APIs to submit jobs?
Library for launching Spark applications.
This library allows applications to launch Spark programmatically. There's only one entry point to the library - the SparkLauncher class.
To launch a Spark application, just instantiate a SparkLauncher and configure the application to run. For example:
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
References:
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/launcher/package-summary.html
I think options will be
Through rest api like Livy (Livy is a new open source Spark REST
Server for submitting and interacting with your Spark jobs from
anywhere. ) or spark server (REST APIs) - See how they connect to
spark interactively from using kernel -
https://www.youtube.com/watch?v=TD1J7MzYcFo&feature=youtu.be&t=33m19s
https://developer.ibm.com/open/apache-toree/
Through jdbc (Running via the Thrift JDBC/ODBC server)
Through ssh and submit a job and wait for yarn status (this will
be SSH to the cluster and do a spark submit through YARN - YARN
give you an application ID and you can keep track of application
status with yarn application status command)

Resources