Zeppelin - run paragraphs in order - apache-spark

I have spark 2.1 standalone cluster installed on 2 hosts.
There are two notebook's Zeppelin(0.7.1):
first one: preparing data, make aggregation and save output to files by:
data.write.option("header", "false").csv(file)
second one: notebook with shell paragraphs merge all part* files from spark output to one file
I would like to ask about 2 cases:
How to configure Spark to write output to one file
After notebook 1 is completed how to add relations to run all paragraphs in notebook2 eg:
NOTEBOOK 1:
data.write.option("header", "false").csv(file)
"run notebook2"
NOTEBOOK2:
shell code

Have you tried adding a paragraph at the end of note1 that executes note2 through Zeppelin API? You can optionally add a loop that checks whether all paragraphs finished execution, also through API.

Related

Is it possible to install a Databricks notebook into a cluster similarly to a library?

I want to be able to have the outputs/functions/definitions of a notebook available to be used by other notebooks in the same cluster without always have run the original one over and over...
For instance, i want to avoid:
definitions_file: has multiple commands, functions etc...
notebook_1
#invoking definitions file
%run ../../0_utilities/definitions_file
notebook_2
#invoking definitions file
%run ../../0_utilities/definitions_file
.....
Therefore i want that definitions_file is available for all other notebooks running in the same cluster.
I am using azure databricks.
Thank you!
No, there is no such thing as "shared notebook" that is implicitly imported. The closest thing you can do is to package your code as a Python library or into Python file inside Repos, but you still will need to write from my_cool_package import * in all notebooks.

double percent spark sql in jupyter notebook

I'm using a Jupyter Notebook on a Spark EMR cluster, want to learn more about a certain command but I don't know what the right technology stack is to search. Is that Spark? Python? Jupyter special syntax? Pyspark?
When I try to google it, I get only a couple results and none of them actually include the content I quoted. It's like it ignores the %%.
What does "%%spark_sql" do, what does it originate from, and what are arguments you can pass to it like -s and -n?
An example might look like
%%spark_sql -s true
select
*
from df
These are called magic commands/functions. Try running %pinfo %%spark_sql or %pinfo2 %%spark_sqlin a Jupyter cell and see if it gives you a detailed information about %%spark_sql.

Triggering a paragraph in Apache Zeppelin using z.run()

I am having some problems with Apache Zeppelin and I am not sure what I am missing, basically
I am trying to trigger a paragraph from another paragraph in Apache Zeppelin using z.run. When I run the paragraph with z.run I get no error, but the paragraph I specified is also not run. On the other hand, if I enter the wrong paragraphID on purpose, I get an error that this paragraph doesn't exit.
I have tried using z.runParagraph, z.run with spark and z.z.run with pyspark. However, when I copy the code from here angular_frontend.html
I am able to run another paragraph when filling in the paragraphID and clicking "run paragraph".
What I am trying is the following:
%spark
z.run("paragraphID")
%pyspark
print("Hello World")
I expect the "Hello World" will be printed, but it is not.

How to export data from a dataframe to a file databricks

I'm doing right now Introduction to Spark course at EdX.
Is there a possibility to save dataframes from Databricks on my computer.
I'm asking this question, because this course provides Databricks notebooks which probably won't work after the course.
In the notebook data is imported using command:
log_file_path = 'dbfs:/' + os.path.join('databricks-datasets',
'cs100', 'lab2', 'data-001', 'apache.access.log.PROJECT')
I found this solution but it doesn't work:
df.select('year','model').write.format('com.databricks.spark.csv').save('newcars.csv')
Databricks runs a cloud VM and does not have any idea where your local machine is located. If you want to save the CSV results of a DataFrame, you can run display(df) and there's an option to download the results.
You can also save it to the file store and donwload via its handle, e.g.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/df.csv")
You can find the handle in the Databricks GUI by going to Data > Add Data > DBFS > FileStore > your_subdirectory > part-00000-...
Download in this case (for Databricks west europe instance)
https://westeurope.azuredatabricks.net/files/df/df.csv/part-00000-tid-437462250085757671-965891ca-ac1f-4789-85b0-akq7bc6a8780-3597-1-c000.csv
I haven't tested it but I would assume the row limit of 1 million rows that you would have when donwloading it via the mentioned answer from #MrChristine does not apply here.
Try this.
df.write.format("com.databricks.spark.csv").save("file:///home/yphani/datacsv")
This will save the file into Unix Server.
if you give only /home/yphani/datacsv it looks for the path on HDFS.

how to get Application ID from Submission ID or Driver ID programmatically

I am submitting a Spark Job in cluster deploy-mode. i am getting Submission ID in my code. In order to use Spark rest Api we need applicationId. So how can we get Application Id via Submission ID programmatically
To answer this question there is assumed that there is known that the certificate ID can be obtained with the function:
scala> spark.sparkContext.applicationId
If you want to get it is a little different. Over the rest API it is possible to send commands. For example like in the following useful tutorial about the Apache REST API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Another solution is to program it yourself. You can send a Spark submit in the following style:
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://my-cluster:7077 \
--deploy-mode cluster \
/path/to/examples.jar 1000
Here in the background of the terminal you should see the log file being produced. In the log file there is the Application ID (App_ID). The $SPARK_HOME is here the environment variable leading to the folder where spark is located.
With for example Python it is possible to obtain the App_ID in the code described below. First we create a list to send the command with Python its subprocess module. The subprocess module has the possibility to make a PIPE from which you can extract the log information instead of using Spark its standard option to post it to the terminal. Make sure to use communicate() after the Popen, to prevent waiting for the OS. Then split it in lines and scrape throught it to find the App_ID. The example can be find below:
submitSparkList=['$SPARK_HOME/bin/spark-submit','--class','org.apache.spark.examples.SparkPi','--master spark://my-cluster:7077','--deploy-mode cluster','/path/to/examples.py','1000']
sparkCommand=subprocess.Popen(submitSparkList,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
stdout, stderr = sparkCommand.communicate()
stderr=stderr.splitlines()
for line in stderr:
if "Connected to Spark cluster" in line: #this is the first line from the rest API that contains the ID. Scrape through the logs to find it.
app_ID_index=line.find('app-')
app_ID=line[app_ID_index:] #this gives the app_ID
print('The app ID is ' + app_ID)
Following the Python guide this contains a warning when not using communicate() function:
https://docs.python.org/2/library/subprocess.html
Warning This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.

Resources