Apache Spark Submit Job API Driver State ERROR Null Pointer Exception - apache-spark

I have a spark cluster deployed on windows. I'm trying to submit a simple spark job using the rest api. The job is just python code that does simple hello world sentence as follows :
from pyspark.sql import SparkSession
def main(args):
print('hello world')
return 0
if __name__ == '__main__':
main(None)
The url Im using to submit the job is:
http://:6066/v1/submissions/create
With the following Post Body:
{
"appResource": "file:../../helloworld.py",
"sparkProperties": {
"spark.executor.memory": "2g",
"spark.master": "spark://<Master IP>:7077",
"spark.app.name": "Spark REST API - Hello world",
"spark.driver.memory": "2g",
"spark.eventLog.enabled": "false",
"spark.driver.cores": "2",
"spark.submit.deployMode": "cluster",
"spark.driver.supervise": "true"
},
"clientSparkVersion": "3.3.1",
"mainClass": "org.apache.spark.deploy.SparkSubmit",
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"action": "CreateSubmissionRequest",
"appArgs": [
"../../helloworld.py", "80"
]
}
After I run this post using postmant, I get the following response:
`
{
"action": "CreateSubmissionResponse",
"message": "Driver successfully submitted as driver-20221216112633-0005",
"serverSparkVersion": "3.3.1",
"submissionId": "driver-20221216112633-0005",
"success": true
}
`
However when I try to get the job status using :
http://:6066/v1/submissions/status/driver-20221216112633-0005
I get the driverState: ERROR , NullPointerException as follows:
`
{
"action": "SubmissionStatusResponse",
"driverState": "ERROR",
"message": "Exception from the cluster:\njava.lang.NullPointerException\n\torg.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:158)\n\torg.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:179)\n\torg.apache.spark.deploy.worker.DriverRunner$$anon$2.run(DriverRunner.scala:99)",
"serverSparkVersion": "3.3.1",
"submissionId": "driver-20221216112633-0005",
"success": true,
"workerHostPort": "10.9.8.120:56060",
"workerId": "worker-20221216093629-<IP>-56060"
}
`
 
Not sure why Im getting this error and what it means. Can someone please point me in the right direction or help me at least how I can trouble this farther? Thanks
 
I tried to submit the API request using Postman and I was expecting a FINISHED or RUNNING state since its just simple hello application
EDIT 2022-01-12:
I was finally able to figure out the problem. To resolve this issue basically it seems like the py\jar file as specified in the "appResource" & ""spark.jars" needs to be accessible by all nodes in the cluster, for example if you have network path you can specify the network path in both attributes as follows:
"appResource": "file:////Servername/somefolder/HelloWorld.jar",
...
"spark.jars": "file:////Servername/someFolder/HelloWorld.jar",
...
Hope that helps and save them the time and agony I had to go through.
If anybody knows why the py\jar file has to be accessible by all nodes even though we are submitting the job to the master please help me understand. Thanks

You should set the property deployMode to client since you are launching the dirver locally:
"spark.submit.deployMode": "client",

Related

How to do a Spark Submit using spark rest and add my arguments

I would like to use spark rest API to make spark-submit through postman. Using the command prompt this is how I would execute a spark-submit with arguments:
C:\Spark\spark-3.3.0-bin-hadoop3\bin\spark-submit --class com.example.spark.cobol.app.CopyBookParser --master local[*] C:\App\spark.jar C:\App\MONHAD.CPY C:\App\data\ json C:\App\output
So now I want to use spark rest API to execute the same command. This is the example that I see being used on the internet:
{
"appResource": "/home/hduser/sparkbatchapp.jar",
"sparkProperties": {
"spark.executor.memory": "8g",
"spark.master": "spark://192.168.1.1:7077",
"spark.driver.memory": "8g",
"spark.driver.cores": "2",
"spark.eventLog.enabled": "false",
"spark.app.name": "Spark REST API - PI",
"spark.submit.deployMode": "cluster",
"spark.jars": "/home/user/spark-examples_versionxx.jar",
"spark.driver.supervise": "true"
},
"clientSparkVersion": "2.4.0",
"mainClass": "org.apache.spark.examples.SparkPi",
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"action": "CreateSubmissionRequest",
"appArgs": [
"80"
]
}
but what values do I add in appResource, spark.jars, and how do I add my own arguments like the ones above.
I really appreciate any assistance thanks.

How to provide cluster name in Azure Databricks Notebook Run Now JSON

I am able to use the below JSON through POSTMAN to run my Databricks notebook.
I want to be able to give a name to the cluster that is created through the "new_cluster" options.
Is there any such option available?
{
"tasks": [
{
"task_key": "Job_Run_Api",
"description": "To see how the run and trigger api works",
"new_cluster": {
"spark_version": "9.0.x-scala2.12",
"node_type_id": "Standard_E8as_v4",
"num_workers": "1",
"custom_tags": {
"Workload": "Job Run Api"
}
},
"libraries": [
{
"maven": {
"coordinates": "net.sourceforge.jtds:jtds:1.3.1"
}
}
],
"notebook_task": {
"notebook_path": "/Shared/POC/Job_Run_Api_POC",
"base_parameters": {
"name": "Junaid Khan"
}
},
"timeout_seconds": 2100,
"max_retries": 0
}
],
"job_clusters": null,
"run_name": "RUN_API_TEST",
"timeout_seconds": 2100
}
When the above API call is done, the cluster created has a name like "job-5975-run-2" and that is not super explanatory.
I have tried to use the tag "cluster_name" inside the "new_cluster" tag but I got an error that I can't do that, like this:
{
"error_code": "INVALID_PARAMETER_VALUE",
"message": "Cluster name should not be provided for jobs."
}
Appreciate any help here
Cluster name for jobs are automatically generated and can't be changed. If you want somehow track specific jobs, use tags.
P.S. If you want to have more "advanced" tracking capability, look onto Overwatch project.

Attach aws emr cluster to remote jupyter notebook using sparkmagic

I am trying to connect and attach an AWS EMR cluster (emr-5.29.0) to a Jupyter notebook that I am working on my local windows machine. I have started a cluster with Hive 2.3.6, Pig 0.17.0, Hue 4.4.0, Livy 0.6.0, Spark 2.4.4 and the subnets are public. I found that this can be done with Azure HDInsight, so was hoping something similar can be done using EMR. The issue I am having is with passing the correct values in the config.json file. How should I attach a EMR cluster?
I could work on the EMR notebooks native to AWS, but thought I can go the develop locally route and have hit a road block.
{
"kernel_python_credentials" : {
"username": "{IAM ACCESS KEY ID}", # not sure about the username for the cluster
"password": "{IAM SECRET ACCESS KEY}", # I use putty to ssh into the cluster with the pem key, so again not sure about the password for the cluster
"url": "ec2-xx-xxx-x-xxx.us-west-2.compute.amazonaws.com", # as per the AWS blog When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy
"auth": "None"
},
"kernel_scala_credentials" : {
"username": "{IAM ACCESS KEY ID}",
"password": "{IAM SECRET ACCESS KEY}",
"url": "{Master public DNS}",
"auth": "None"
},
"kernel_r_credentials": {
"username": "{}",
"password": "{}",
"url": "{}"
},
Update 1/4/2021
On 4/1, I got sparkmagic to work on my local jupyter notebook. Used these documents as a references (ref-1, ref-2 & ref-3) to setup local port forwarding (if possible avoid using sudo).
sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 hadoop#ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
Configuration details
Release label:emr-5.32.0
Hadoop distribution:Amazon 2.10.1
Applications:Hive 2.3.7, Livy 0.7.0, JupyterHub 1.1.0, Spark 2.4.7, Zeppelin 0.8.2
Updated config file
{
"kernel_python_credentials" : {
"username": "",
"password": "",
"url": "http://localhost:8998"
},
"kernel_scala_credentials" : {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
},
"kernel_r_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998"
},
"logging_config": {
"version": 1,
"formatters": {
"magicsFormatter": {
"format": "%(asctime)s\t%(levelname)s\t%(message)s",
"datefmt": ""
}
},
"handlers": {
"magicsHandler": {
"class": "hdijupyterutils.filehandler.MagicsFileHandler",
"formatter": "magicsFormatter",
"home_path": "~/.sparkmagic"
}
},
"loggers": {
"magicsLogger": {
"handlers": ["magicsHandler"],
"level": "DEBUG",
"propagate": 0
}
}
},
"authenticators": {
"Kerberos": "sparkmagic.auth.kerberos.Kerberos",
"None": "sparkmagic.auth.customauth.Authenticator",
"Basic_Access": "sparkmagic.auth.basic.Basic"
},
"wait_for_idle_timeout_seconds": 15,
"livy_session_startup_timeout_seconds": 60,
"fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",
"ignore_ssl_errors": false,
"session_configs": {
"driverMemory": "1000M",
"executorCores": 2
},
"use_auto_viz": true,
"coerce_dataframe": true,
"max_results_sql": 2500,
"pyspark_dataframe_encoding": "utf-8",
"heartbeat_refresh_seconds": 5,
"livy_server_heartbeat_timeout_seconds": 60,
"heartbeat_retry_seconds": 1,
"server_extension_default_kernel_name": "pysparkkernel",
"custom_headers": {},
"retry_policy": "configurable",
"retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
"configurable_retry_policy_max_retries": 8
}
Second update 1/9
Back to square one. Keep getting this error and spent days trying to debug. Not sure what I did previously to get things going. Also checked my security group config and it looks fine, ssh on port 22.
An error was encountered:
Error sending http request and maximum retry encountered.
Created a local port forwarding (ssh tunneling) to livy server on port 8998 and it works like magic.
sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 hadoop#ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
Did not change my config.json file from 1/4 update

Spark Streaming Kinesis on EMR throws "Error while storing block into Spark"

We have a Spark (2.3.1) Streaming app running over EMR (5.16) consuming from a single shard of AWS Kinesis from the past 2 years. Recently (for the past 2 months), we are having random errors in the app. We tried to update to the latest version of Spark and EMR and the error was not solved.
The problem
Out of the blue, in any moment and uptime and with apparently no indicators in the RAM, CPU, Networking, the Receiver stops sending messages to Spark and we have this error in the logs:
18/09/11 18:20:00 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error while storing block into Spark - java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)
at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
at org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)
at org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:306)
at org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:357)
at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)
at org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)
at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)
The Spark app continue working without problems, but it doesn't have records to process.
The only thing related I found in the internet was this Github Issue https://github.com/awslabs/amazon-kinesis-client/issues/185
What I've tried
The final version of the configuration looks like this
[
{
"Classification": "capacity-scheduler",
"Properties": {
"yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.yarn.am.attemptFailuresValidityInterval": "1h",
"spark.yarn.maxAppAttempts": "4",
"spark.yarn.executor.failuresValidityInterval": "1h",
"spark.task.maxFailures": "8",
"spark.task.reaper.enabled": "true",
"spark.task.reaper.killTimeout": "120s",
"spark.metrics.conf": "metrics.properties",
"spark.metrics.namespace": "spark",
"spark.streaming.stopGracefullyOnShutdown": "true",
"spark.streaming.receiver.writeAheadLog.enable": "true",
"spark.streaming.receiver.writeAheadLog.closeFileAfterWrite": "true",
"spark.streaming.driver.writeAheadLog.closeFileAfterWrite": "true",
"spark.streaming.backpressure.enabled": "true",
"spark.streaming.backpressure.initialRate": "500",
"spark.dynamicAllocation.enabled": "false",
"spark.default.parallelism": "18",
"spark.driver.cores": "3",
"spark.driver.memory": "4G",
"spark.driver.memoryOverhead": "1024",
"spark.driver.extraJavaOptions": "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops",
"spark.executor.instances": "3",
"spark.executor.cores": "3",
"spark.executor.memory": "4G",
"spark.executor.memoryOverhead": "1024",
"spark.executor.heartbeatInterval": "20s",
"spark.io.compression.codec": "zstd",
"spark.rdd.compress": "true",
"spark.executor.extraJavaOptions": "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops",
"spark.python.worker.memory": "640m",
"spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2",
"spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored": "true",
"spark.network.timeout": "240s",
"spark.locality.wait": "0s",
"spark.eventLog.enabled": "false"
}
},
{
"Classification": "emrfs-site",
"Properties": {
"fs.s3.consistent.metadata.tableName": "HiddenName",
"fs.s3.consistent": "true",
"fs.s3.consistent.retryCount": "1",
"fs.s3.consistent.retryPeriodSeconds": "2",
"fs.s3.consistent.retryPolicyType": "fixed",
"fs.s3.consistent.throwExceptionOnInconsistency": "false"
}
}
]
Nothing there helped.
Workarounds
To have the Receiver sending messages again, I have to go to the Spark UI, Jobs Tab and kill the active Job (I can't upload images because I don't have enough reputation). That triggers a new app and the Receiver starts receiving sending records to Spark again. Of course, this is not a solution for a Production System.
Programmatic workaround
I wrote a StreamingListener that captures the onReceiverStopped and onReceiverError. When that happens it creates a signal (external file) that is checked everytime ssc.awaitTerminationOrTimeout(check_interval) is processed. If the signal was created, the app raises an Exception and the app is killed. This generates a new appattempt and a new Streaming App runs. The problem with this, 1st is that it doesn't fix the original problem and 2nd it looks like after the first appattempt the StreamingListener is not registered anymore, so I can't recover the App.

In iOS,How to sync CBL with server through terminal without using CouchDB dmg?

I trying to run sync gateway code into the terminal but i don't understand how it works, even through needless response. see the code below,
{
"log": ["HTTP+"],
"databases": {
"grocery-sync": {
"server": "http://localhost:8091",
"bucket": "grocery-sync",
"users": {
"GUEST": {"disabled": false, "admin_channels": ["*"] }}}}
}
But,i getting below response but i couldn't understand what exactly need to do for auto replication.

Resources