I'm submitting the sparkStandalone task using the sparklauncherAPI to submit the task to the spark cluster in the local idea area but I'm having a problem reporting that the command line is too long to submit the task to the sparkStandalone cluster
enter image description here
enter image description here
I modified the configuration of idea and Shorten command line according to the online configuration, but it still has no effect
enter image description here
Related
while running pipeline creation python script facing the following error.
"AzureMLCompute job failed. JobConfigurationMaxSizeExceeded: The specified job configuration exceeds the max allowed size of 32768 characters. Please reduce the size of the job's command line arguments and environment settings"
I haven't seen that error before! My guess is that you're passing data as a string argument to a downstream pipeline step when you should be using PipelineData or OutputFileDatasetConfig.
I strongly suggest you read more about moving data between steps of an AML pipeline
When we tried to pass a quite lengthy content as argument value to a Pipeline. You can try to upload file to blob, optionally create a dataset, then pass on dataset name or file path to AML pipeline as parameter. The pipeline step will read content of the file from the blob.
I am having some problems with Apache Zeppelin and I am not sure what I am missing, basically
I am trying to trigger a paragraph from another paragraph in Apache Zeppelin using z.run. When I run the paragraph with z.run I get no error, but the paragraph I specified is also not run. On the other hand, if I enter the wrong paragraphID on purpose, I get an error that this paragraph doesn't exit.
I have tried using z.runParagraph, z.run with spark and z.z.run with pyspark. However, when I copy the code from here angular_frontend.html
I am able to run another paragraph when filling in the paragraphID and clicking "run paragraph".
What I am trying is the following:
%spark
z.run("paragraphID")
%pyspark
print("Hello World")
I expect the "Hello World" will be printed, but it is not.
I have spark 2.1 standalone cluster installed on 2 hosts.
There are two notebook's Zeppelin(0.7.1):
first one: preparing data, make aggregation and save output to files by:
data.write.option("header", "false").csv(file)
second one: notebook with shell paragraphs merge all part* files from spark output to one file
I would like to ask about 2 cases:
How to configure Spark to write output to one file
After notebook 1 is completed how to add relations to run all paragraphs in notebook2 eg:
NOTEBOOK 1:
data.write.option("header", "false").csv(file)
"run notebook2"
NOTEBOOK2:
shell code
Have you tried adding a paragraph at the end of note1 that executes note2 through Zeppelin API? You can optionally add a loop that checks whether all paragraphs finished execution, also through API.
When we submit application to Spark, and after performing any operation Spark Web UI displays Job and Stages like count at MyJob.scala:15. But in my application there are multiple count and save operations are there. So it is very difficult to understand UI. Instead of count at MyJob.scala:15, can we add custom description to give more detailed information to job.
While googling found https://issues.apache.org/jira/browse/SPARK-3468 and https://github.com/apache/spark/pull/2342, author attached image, with detailed description like 'Count', 'Cache and Count', 'Job with delays'. So can we achieve same? I am using Spark 2.0.0.
use the sc.setJobGroup:
Examples:
python:
In [28]: sc.setJobGroup("my job group id", "job description goes here")
In [29]: lines = sc.parallelize([1,2,3,4])
In [30]: lines.count()
Out[30]: 4
Scala:
scala> sc.setJobGroup("my job group id", "job description goes here")
scala> val lines = sc.parallelize(List(1,2,3,4))
scala> lines.count()
res3: Long = 4
SparkUI:
I hope this is what you are looking for.
Note that new Zeppelin 0.8 loose his tracking hook if you change the JobGroup name, and can't display his job progress bar (job still working, no effect on the job itself)
You can use
sc.setLocalProperty("callSite.short","my job description")
sc.setLocalProperty("callSite.long","my job details long description")
instead
See
How to change job/stage description in web UI?
for some screen captures and scala syntax
I am submitting a Spark Job in cluster deploy-mode. i am getting Submission ID in my code. In order to use Spark rest Api we need applicationId. So how can we get Application Id via Submission ID programmatically
To answer this question there is assumed that there is known that the certificate ID can be obtained with the function:
scala> spark.sparkContext.applicationId
If you want to get it is a little different. Over the rest API it is possible to send commands. For example like in the following useful tutorial about the Apache REST API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Another solution is to program it yourself. You can send a Spark submit in the following style:
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://my-cluster:7077 \
--deploy-mode cluster \
/path/to/examples.jar 1000
Here in the background of the terminal you should see the log file being produced. In the log file there is the Application ID (App_ID). The $SPARK_HOME is here the environment variable leading to the folder where spark is located.
With for example Python it is possible to obtain the App_ID in the code described below. First we create a list to send the command with Python its subprocess module. The subprocess module has the possibility to make a PIPE from which you can extract the log information instead of using Spark its standard option to post it to the terminal. Make sure to use communicate() after the Popen, to prevent waiting for the OS. Then split it in lines and scrape throught it to find the App_ID. The example can be find below:
submitSparkList=['$SPARK_HOME/bin/spark-submit','--class','org.apache.spark.examples.SparkPi','--master spark://my-cluster:7077','--deploy-mode cluster','/path/to/examples.py','1000']
sparkCommand=subprocess.Popen(submitSparkList,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
stdout, stderr = sparkCommand.communicate()
stderr=stderr.splitlines()
for line in stderr:
if "Connected to Spark cluster" in line: #this is the first line from the rest API that contains the ID. Scrape through the logs to find it.
app_ID_index=line.find('app-')
app_ID=line[app_ID_index:] #this gives the app_ID
print('The app ID is ' + app_ID)
Following the Python guide this contains a warning when not using communicate() function:
https://docs.python.org/2/library/subprocess.html
Warning This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.