How do I read task parameters in a Databricks Job? - databricks

The job is running a Python file, not a notebook. All the examples I can find online use the notebook utility which Databricks says is unreliable inside of executors.

Figured it out - when calling from the Jobs API, you have to include
"spark_python_task":{"python_file":"dbfs:/FileStore/my_script.py", "parameters": ["param1", "param2"]}
and when calling from the UI (run with new paramaters) you need to include
{"spark_python_task": ["param1","param2"]}
Each time, for reading you read with sys.argv. (sys.argv[1] and sys.argv[2] in this example).

Related

pyspark using Pytest is not showing spark UI

I have written a pytest case using pyspark (Spark 3.0) for reading a file and getting a count of data frames, but I am unable to see spark UI and I am getting an OOM Error. What is the solution and how to debug without seeing spark UI
Thanks,
Xi
Wrap all tests inside an shell script , and call that shell script in a python code
then call using spark-submit python.py

Is that possible to run "spark-submit" in databricks without creating jobs ? if yes ! What is the possiblities,

I am trying to execute spark-submit in databricks workspace notebook without creating jobs, Help me!
No, that is not possible like one would do with /bin/spark-submit as it does not fit in with their notebook approach to making things easier for less techy persons.
The closest you can get is as stated here: https://docs.databricks.com/dev-tools/api/latest/examples.html#create-a-spark-submit-job

how to measure the read and write time on hdfs using job spark?

I just started the work on the qualification of a big data platform, and I would like to have proposals on how to test the performance of reading and writing on hdfs.
If you are running the spark jobs for read and write operation then you can see the job time on application manager (localhost:50070) and if you are using spark-shell then you have to measure time manually or you can use time function.

Running a PySpark code in python vs spark-submit

I have a PySpark code/application. What is the best way to run it (utilize the maximum power of PySpark), using the python interpreter or using spark-submit?
The SO answer here was almost similar but did not explain it in great details. Would love to know, why?
Any help is appreciated. Thanks in advance.
I am assuming when you say python interpreter you are referring to pyspark shell.
You can run your spark code both ways using pySpark interpreter, using Spark-submit or even with multiple available notebooks (Jupyter/Zeppelin).
When to use PySpark Interpreter.
Generally when we are learning or doing some very basic operations for an understanding or exploration purpose we use pySpark interpreter.
Spark Submit.
This is usually used when you have written your entire application in pySpark and packaged into py files, so that you can submit your entire code to Spark cluster for execution.
A little analogy may help here. Let's take an example of Unix shell commands. We can execute the shell commands directly on the command prompt or we can create shell script (.sh) to execute the bunch instruction at once. Similarly, you can think of pyspark interpreter and spark-submit utility, where in pySpark interpreter you can execute individual command. However, you can package your spark application into py files and execute using spark-submit utility.
Hope this helps.
Regards,
Neeraj
Running your job on pyspark shell will always be in client mode. Where as using spark-submit you can execute it in either modes. I.e. client or cluster

SparkUI for pyspark - corresponding line of code for each stage?

I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find which Stage is corresponding to which line of code in the pyspark code.
Is there a way I can figure out which Stage is corresponding to which line of the pyspark code?
Thanks!
When you run a toPandas call, the line in the python code is shown in the SQL tab. Other collect commands, such as count or parquet do not show the line number. I'm not sure why this is, but I find it can be very handy.
Is there a way I can figure out which Stage is corresponding to which line of the pyspark code?
Yes. The Spark UI provides the Scala methods called from the PySpark actions in your Python code. Armed with the PySpark codebase, you can readily identify the calling PySpark method. In your example, cache is self-explanatory and a quick search for javaToPython reveals that it is called by the PySpark DataFrame.rdd method.

Resources