PySpark batch job's configuration submitted through Apache Livy have no effect - apache-spark

I submitted spark batch job through Livy to the remote cluster with the following request body.
REQUEST_BODY = {
'file': '/spark/batch/job.py',
'conf': {
'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation': 'true',
'spark.driver.cores': 1,
'spark.driver.memory': f'12g',
'spark.executor.cores': 1,
'spark.executor.memory': f'8g',
'spark.dynamicAllocation.maxExecutors': 4,
},
}
And in the python file containing application to be run, the SparkSession is created using the following command:
-- inside /spark/batch/job.py --
spark = SparkSession.builder.getOrCreate()
// spark application after this point using the SparkSession created above
The application work just fine, but the spark acquire all of the resources in the cluster neglecting the configuration set in the request body 😕.
I am suspicious that the /spark/batch/job.py create another SparkSession apart from the one specifying in the Livy request body. But I am not sure how to use the SparkSession provided by the Livy though. The document about this topic is so less.
Does anyone facing the same issue? How can I solved this problem?
Thanks in advance everyone!

Related

how can spark read / write from azurite

I am trying to read (and eventually write) from azurite (version 3.18.0) using spark (3.1.1)
i can't understand what spark configurations and file uri i need to set to make this work properly
for example these are the containers and files i have inside azurite
/devstoreaccount1/container1/file1.avro
/devstoreaccount1/container2/file2.avro
This is the code that im running - the uri val is one of the values below
val uri = ...
val spark = SparkSession.builder()
.appName(appName)
.master("local")
.config("spark.driver.host", "127.0.0.1").getOrCreate()
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"spark.hadoop.fs.azure.account.auth.type.devstoreaccount1.blob.core.windows.net", "SharedKey")
spark.conf.set(s"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net", <azurite account key>)
spark.read.format("avro").load(uri)
uri value - what is the correct one?
http://127.0.0.1:10000/container1/file1.avro
I get UnsupportedOperationException when i perform the spark.read.format("avro").load(uri) because spark will use the HttpFileSystem implementation and it doesn't support listStatus
wasb://container1#devstoreaccount1.blob.core.windows.net/file1.avro
Spark will try to authenticate against azure servers (and will fail for obvious reasons)
I have tried to follow this stackoverflow post without success.
I have also tried to remove the blob.core.windows.net configuration postfix but then i don't how to give spark the endpoint for the azurite container?
So my question is what are the correct configurations to give spark so it will be able to read from azurite, and what are the correct file path formats to pass as the URI?

Spark NiFi site to site connection

I am new with NiFi, I am trying to send data from NiFi to Spark or to establish a stream from NiFi output port to Spark according to this tutorial.
Nifi is running on Kubernetes and I am using Spark operator on the same cluster to submit my applications.
It seems like Spark is able to access the web NiFi and it starts a streaming receiver. However, data is not coming to the Spark app through output and I have empty rdds. I have not seen any warnings or errors in Spark logs
Any Idea or information which could help me to solve this issue is appreciated.
My code:
val conf = new SiteToSiteClient.Builder()
.keystoreFilename("..")
.keystorePass("...")
.keystoreType(...)
.truststoreFilename("..")
.truststorePass("..")
.truststoreType(...)
.url("https://...../nifi")
.portName("spark")
.buildConfig()
val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))

How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow

I am working on submitting Spark job using Apache Livy batches POST method.
This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.
I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.
Is this possible to do using Apache Livy REST API?
Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.
Documentation:
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-sessionssessionidlog
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-batchesbatchidlog
You can create python functions like the one shown below to get logs:
http = HttpHook("GET", http_conn_id=http_conn_id)
def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
if not extra_options:
extra_options = {}
self.http.method = method
response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
return response
def _get_batch_session_logs(self, batch_id):
method = "GET"
endpoint = "batches/" + str(batch_id) + "/log"
response = self._http_rest_call(method=method, endpoint=endpoint)
# return response.json()
return response
Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:
curl http://livy-server-IP:8998/batches
Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:
curl http://livy-server-IP:8998/batches/{batchId}/log
You can find the documentation at:
https://livy.incubator.apache.org/docs/latest/rest-api.html
If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace:
https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).

How to print out Spark connection of Spark session ?

Suppose I run pyspark command and got global variable spark of type SparkSession. As I understand, this spark holds a connection to the Spark master. Can I print out the details of this connection including the hostname of this Spark master ?
For basic information you can use master property:
spark.sparkContext.master
To get details on YARN you might have to dig through hadoopConfiguration:
hadoopConfiguration = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConfiguration.get("yarn.resourcemanager.hostname")
or
hadoopConfiguration.get("yarn.resourcemanager.address")
When submitted to YARN Spark uses Hadoop configuration to determine the resource manger so these values should match ones present in configuration placed in HADOOP_CONF_DIR or YARN_CONF_DIR.

Run spark-submit in Scala code

Is it possible to execute below spark-submit script within code and then get application ID that'll assign by YARN?
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
This is to enable user to start and stop the job via REST API.
I found,
https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
But I couldn't find a method to get application ID , also seems like app.jar has to be pre built before executing above code ?
Yes your application jar does need to be prebuilt in those cases. It seems like something like the Spark Job Server or IBM Spark Kernel may be closer to what you want (although they reuse a Spark Context).
SparkLauncher will only submit your built application. To get the application ID, you need to access the SparkContext within your application jar.
In your example, you could access the application ID in "/my/app.jar" (perhaps in "my.spark.app.Main") with:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
...
val sc = new SparkContext(new SparkConf())
sc.applicationId
This application ID will be the YARN application ID when the application is built and submitted in yarn-cluster mode.
See the Spark Scala API docs.
Support for accessing launched applications seems to be coming in Spark 1.6 (SPARK-8673). A Scala example derived from this test suite is below.
val handle = new SparkLauncher()
... // application configuration
.setMaster("yarn-client")
.startApplication()
try {
handle.getAppId() should startWith ("application_")
handle.stop()
} finally {
handle.kill()
}
Handlers may be added to launched applications, but a listener API is exposed and is the recommended way for monitoring launched applications. See this pull request for details.
Scala has SparkContext.applicationId, which is a unique identifier for the Spark application. Its format depends on the scheduler implementation. (i.e. in case of local spark app something like 'local-1433865536131' in case of YARN something like 'application_1433865536131_34483' )
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext

Resources