Reading csv files in zeppelin using spark-csv - apache-spark

I wanna read csv files in Zeppelin and would like to use databricks'
spark-csv package: https://github.com/databricks/spark-csv
In the spark-shell, I can use spark-csv with
spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
But how do I tell Zeppelin to use that package?
Thanks in advance!

You need to add the Spark Packages repository to Zeppelin before you can use %dep on spark packages.
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.2.0")
Alternatively, if this is something you want available in all your notebooks, you can add the --packages option to the spark-submit command setting in the interpreters config in Zeppelin, and then restart the interpreter. This should start a context with the package already loaded as per the spark-shell method.

Go to the Interpreter tab, click Repository Information, add a repo and set the URL to http://dl.bintray.com/spark-packages/maven
Scroll down to the spark interpreter paragraph and click edit, scroll down a bit to the artifact field and add "com.databricks:spark-csv_2.10:1.2.0" or a newer version. Then restart the interpreter when asked.
In the notebook, use something like:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("my_data.txt")
Update:
In the Zeppelin user mailing list, it is now (Nov. 2016) stated by Moon Soo Lee (creator of Apache Zeppelin) that users prefer to keep %dep as it allows for:
self-documenting library requirements in the notebook;
per Note (and possible per User) library loading.
The tendency is now to keep %dep, so it should not be considered depreciated at this time.

BEGIN-EDIT
%dep is deprecated in Zeppelin 0.6.0. Please refer Paul-Armand Verhaegen's answer.
Please read further in this answer, if you are using zeppelin older than 0.6.0
END-EDIT
You can load the spark-csv package using %dep interpreter.
like,
%dep
z.reset()
// Add spark-csv package
z.load("com.databricks:spark-csv_2.10:1.2.0")
See Dependency Loading section in https://zeppelin.incubator.apache.org/docs/interpreter/spark.html
If you've already initialized Spark Context, quick solution is to restart zeppelin and execute zeppelin paragraph with above code first and then execute your spark code to read the CSV file

You can add jar files under Spark Interpreter dependencies:
Click 'Interpreter' menu in navigation bar.
Click 'edit' button for Spark interpreter.
Fill artifact and exclude fields.
Press 'Save'

if you define in conf/zeppelin-env.sh
export SPARK_HOME=<PATH_TO_SPARK_DIST>
Zeppelin will then look in $SPARK_HOME/conf/spark-defaults.conf and you can define jars there:
spark.jars.packages com.databricks:spark-csv_2.10:1.4.0,org.postgresql:postgresql:9.3-1102-jdbc41
then look at
http://zepplin_url:4040/environment/ for the following:
spark.jars file:/root/.ivy2/jars/com.databricks_spark-csv_2.10-1.4.0.jar,file:/root/.ivy2/jars/org.postgresql_postgresql-9.3-1102-jdbc41.jar
spark.jars.packages com.databricks:spark-csv_2.10:1.4.0,org.postgresql:postgresql:9.3-1102-jdbc41
For more reference: https://zeppelin.incubator.apache.org/docs/0.5.6-incubating/interpreter/spark.html

Another solution:
In conf/zeppelin-env.sh (located in /etc/zeppelin for me) add the line:
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.10:1.2.0"
Then start the service.

Related

Add additional jars using PYSPARK_SUBMIT_ARGS

I have code to start a spark session
spark_session = SparkSession.builder.appName(app_name)
spark_session = spark_session.getOrCreate()
sc = spark_session.sparkContext
Now I want to dynamically be able to add jars and packages using PYSPARK_SUBMIT_ARGS so I added an environment variable with the following value before I hit the code
--jars /usr/share/aws/redshift/jdbc/RedshiftJDBC4.jar --packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4
But I get the following error:
Error: Missing application resource.
From looking online I know its because of the fact that I am explicitly passing jars and packages so I need to provide the path to my main jar file. But I am confused as to what that will be. I am just tring to run some code by starting a pyspark shell. I know the other way of passing these while starting the shell but my use case is such that I want it to be able to do it using the env variable and I have not been able to find the answers to this issue online

Sharing a spark session

Lets say I have a python file my_python.py in which I have created a SparkSession 'spark' . I have a jar say my_jar.jar in which some spark logic is written. I am not creating SparkSession in my jar , rather I want to use the same session created in my_python.py. How to write a spark-submit command which take my python file , my jar and my sparksession 'spark' as an argument to my jar file.
Is it possible ?
If not , please share the alternative to do so.
So I feel there are two questions -
Q1. How in scala file you can reuse already created spark session?
Ans: Inside your scala code, you should use builder to get an existing session:
SparkSession.builder().getOrCreate()
Please check the Spark doc
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html
Q2: How you do spark-submit with a .py file as driver and scala jar(s) as supporting jars?
And: It should be in something like this
./spark-submit --jars myjar.jar,otherjar.jar --py-files path/to/myegg.egg path/to/my_python.py arg1 arg2 arg3
So if you notice the method name, it is getOrCreate() - that means if a spark session is already created, no new session will be created rather existing session will be used.
Check this link for full implementation example:
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/

How to start sparksession in pyspark

I want to change the default memory, executor and core settings of a spark session.
The first code in my pyspark notebook on HDInsight cluster in Jupyter looks like this:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Juanita_Smith")\
.config("spark.executor.instances", "2")\
.config("spark.executor.cores", "2")\
.config("spark.executor.memory", "2g")\
.config("spark.driver.memory", "2g")\
.getOrCreate()
On completion, I read the parameters back, which looks like the statement worked
However if I look in yarn, the setting have indeed not worked.
Which settings or commands do I need to make to let the session configuration take effect ?
Thank you for help in advance
By the time your notebook kernel has started, the SparkSession is already created with parameters defined in a kernel configuration file. To change this, you will need to update or replace the kernel configuration file, which I believe is usually somewhere like <jupyter home>/kernels/<kernel name>/kernel.json.
Update
If you have access to the machine hosting your Jupyter server, you can find the location of the current kernel configurations using jupyter kernelspec list. You can then either edit one of the pyspark kernel configurations, or copy it to a new file and edit that. For your purposes, you will need to add the following arguments to the PYSPARK_SUBMIT_ARGS:
"PYSPARK_SUBMIT_ARGS": "--conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2g --conf spark.driver.memory=2g"

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Resources