Notebook to write Java jobs for Spark - apache-spark

I am writing my first Spark job using Java API.
I want to run it using a notebook.
I am looking into Zeppelin and Jupyter.
At Zeppelin documentation I see support for Scala, IPySpark and SparkR. It is not clear to me whether using two interpreters %spark.sql %java will allow me to work with Java API of Spark SQL
Jupyter has "IJava" kernel but I see no support for Spark with Java.
Are there other options?

#Victoriia Zeppelin 0.9.0 has %java interpreter with example here
zeppelin.apache.org
I try to start with it in GoogleCloud, but had some problems...

use magic command %jars path/to/spark.jar in the IJava cell, according to the IJava's author
then take a look on import org.apache.spark.sql.* for example

Related

pyspark using Pytest is not showing spark UI

I have written a pytest case using pyspark (Spark 3.0) for reading a file and getting a count of data frames, but I am unable to see spark UI and I am getting an OOM Error. What is the solution and how to debug without seeing spark UI
Thanks,
Xi
Wrap all tests inside an shell script , and call that shell script in a python code
then call using spark-submit python.py

Kedro airflow on spark

Looking for kedro+ airflow implementation on spark. Is the plugin now available for spark ?
Looked at PipelineX but couldn't find relevant examples on spark ?
I haven't prepared or seen an example to use Spark with PipelineX or Airflow, but it should be possible to use kedro-airflow to run tasks on Spark.
The following document and DataEngineerOne's video might be helpful.
https://kedro.readthedocs.io/en/stable/10_tools_integration/01_pyspark.html?highlight=Spark
https://www.youtube.com/watch?v=vYBMpPZep6E

Running a PySpark code in python vs spark-submit

I have a PySpark code/application. What is the best way to run it (utilize the maximum power of PySpark), using the python interpreter or using spark-submit?
The SO answer here was almost similar but did not explain it in great details. Would love to know, why?
Any help is appreciated. Thanks in advance.
I am assuming when you say python interpreter you are referring to pyspark shell.
You can run your spark code both ways using pySpark interpreter, using Spark-submit or even with multiple available notebooks (Jupyter/Zeppelin).
When to use PySpark Interpreter.
Generally when we are learning or doing some very basic operations for an understanding or exploration purpose we use pySpark interpreter.
Spark Submit.
This is usually used when you have written your entire application in pySpark and packaged into py files, so that you can submit your entire code to Spark cluster for execution.
A little analogy may help here. Let's take an example of Unix shell commands. We can execute the shell commands directly on the command prompt or we can create shell script (.sh) to execute the bunch instruction at once. Similarly, you can think of pyspark interpreter and spark-submit utility, where in pySpark interpreter you can execute individual command. However, you can package your spark application into py files and execute using spark-submit utility.
Hope this helps.
Regards,
Neeraj
Running your job on pyspark shell will always be in client mode. Where as using spark-submit you can execute it in either modes. I.e. client or cluster

Why is difference between sqlContext.read.load and sqlContext.read.text?

I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load and sqlContext.read.text.
s3_single_file_inpath='s3a://bucket-name/file_name'
indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)
The sqlContext.read.load command above fails with
Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
But the second one succeeds?
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
Why is difference between sqlContext.read.load and sqlContext.read.text?
sqlContext.read.load assumes parquet as the data source format while sqlContext.read.text assumes text format.
With sqlContext.read.load you can define the data source format using format parameter.
Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format.
As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation):
NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes.
That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have csv support.
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when spark-csv Spark package was not part of Spark. It happened in Spark 2.0.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
There's none actually iff you use Spark 2.x.
If however you use Spark 1.6.x, spark-csv has to be loaded separately using --packages option (as described in Using with Spark shell):
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell
As a matter of fact, you can still use com.databricks.spark.csv format explicitly in Spark 2.x as it's recognized internally.
The difference is:
text is a built-in input format in Spark 1.6
com.databricks.spark.csv is a third party package in Spark 1.6
To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on spark-csv site, for example provide
--packages com.databricks:spark-csv_2.10:1.5.0
argument with spark-submit / pyspark commands.
Beyond that sqlContext.read.formatName(...) is a syntactic sugar for sqlContext.read.format("formatName") and sqlContext.read.load(..., format=formatName).

streaming in sparkR?

I have been using Spark in Scala for a while. I am now looking into pySpark and SparkR. I don't see streaming mentioned for PySpark and SparkR. Does any one know if you can do Spark streaming when using Python and R?
Spark is now supporting pySpark streaming in 1.3. And an implementation of SparkR streaming can be found in https://github.com/hlin09/spark/tree/SparkR-streaming.
Currently (as of Spark 1.1), Spark Streaming is only supported in Scala & Java. If you have a specific R program or Python program you want to use you can take a look at the pipe interface on RDDs along with the transform function on DStreams. This is a bit awkward but its probably the easiest way to use Python or R code in Spark Streaming currently.
sparkR streaming is not available till latest version apache spark 2.1.1
but we can use sparkR streaming from github
https://github.com/hlin09/spark/tree/SparkR-streaming
build spark using mvn then you can be able to do sparkR streaming.

Resources