Installing Apache Spark Packages to run Locally - apache-spark

I am looking for a clear guide or steps to installing Spark packages (specifically spark-avro) to run locally and correctly using them with spark-submit command.
I've spent a lot of time reading many posts and guides, but still not able to get spark-submit to use the locally deployed spark-avro package. Hence, if someone has already accomplished this with spark-avro or another package, please share your wisdom :)
All the existing documentation I found is a bit unclear.
Clear steps and examples would be much appreciated! P.S. I know Python/PySpark/SQL, but not much Java (yet) ...
Michael

In spark-submit command itself you can pass avro package details (make sure avro and spark version support)
spark-submit --packages org.apache.spark:spark-avro_<required_version>:<spark_version>
Example,
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0
same way you can pass it along with spark-shell command as well to work on avro files.

Related

Importing vs Installing spark

I am new to the spark world and to some extent coding.
This question might seem too basic but please clear my confusion.
I know that we have to import spark libraries to write spark application. I use intellij and sbt.
After writing the application , I can also run them and see the output on "run".
My question is, why should I install spark separately on my machine(local) if I can just import them as libraries and run them.
Also what is the need for it to be installed on the cluster since we can just submit the jar file and jvm is already present in all the machines of the clustor
Thank you for the help!
I understand your confusion.
Actually you don't really need to install spark on your machine if you are for example running it on scala/java and you can just import spark-core or any other dependancies into your project and once you start your spark job on mainClass it will create an standalone spark runner on your machine and run your job on if (local[*]).
There are many reasons for having spark on your local machine.
One of them is for running spark job on pyspark which requires spark/python/etc libraries and a runner(local[] or remote[]).
Another reason can be if you want to run your job on-premise.
It might be easier to create cluster on your local datacenter and maybe appoint your machine as master and the other machines connected to your master as worker.(this solution might be abit naive but you asked for basics so this might spark your curiosity to read more about infrastructure design of a data processing system more)

How to install a postgresql JDBC driver in pyspark

I use pyspark with spark 2.2.0 on a lubuntu 16.04 and I want to write a Dataframe to my Postgresql database. Now as far as I understand it I have to install a jdbc driver on the spark master for it. I downloaded the postgresql jdbc driver from their website and tried to follow this post. I added spark.jars.packages /path/to/driver/postgresql-42.2.1.jar to spark-default.conf with the only result that pyspark no longer launches.
I'm kinda lost in java land for one I don't know if this is the right format.The documentation tells me I should add a list but I don't know how a path list is supposed to look like. Then I don't know if I also have to specify spark.jars and or spark.driver.extraClassPath or if spark.jars.packages is enough? And if i have to add them what kind of format are they?
spark.jars.packages is for dependencies that can be pulled from Maven (think it as pip for Java, although the analogy is probably kinda loose).
You can submit your job with the option --jars /path/to/driver/postgresql-42.2.1.jar, so that the submission will also provide the library, that the cluster manager will distribute on all worker nodes on your behalf.
If you want to set this as a configuration you can use the spark.jars key instead of spark.jars.packages. The latter requires Maven coordinates, rather then a path (which is probably the reason why your job is failing).
You can read more about the configuration keys I introduced on the official documentation.

How to modify spark source code and build

I just start learning spark. I have imported spark source code to IDEA and made some small changes (just add some println()) to spark source code. What should I do to see these updates? Should I recompile the spark? Thanks!
At the bare minimum, you will need maven 3.3.3 and Java 7+.
You can follow the steps at http://spark.apache.org/docs/latest/building-spark.html
The "make-distribution.sh" script is quite handy which comes within the spark source code root directory. This script will produce a distributable tar.gz which you can simply extract and launch spark-shell or spark-submit. After making the source code changes in spark, you can run this script with the right options (mainly passing the desired hadoop version, yarn or hive support options but these are required if you want to run on top of hadoop distro, or want to connect to existing hive).
BTW, inserting println() will not be a good idea as it can severely slow down the performance of the job. You should use a logger instead.

Connecting to cassandra using pyspark

I am a beginner learning to work with spark and cassandra.
I am trying to connect to cassandra using pyspark. I am running cassandra 2.1 and spark 1.3.
I have cloned this repo https://github.com/TargetHolding/pyspark-cassandra and followed instructions to get it working with spark shell as well as with spark-submit.
This is the command I am using ./bin/spark-submit --packages pyspark-cassandra:1.3 --conf spark.cassandra.connection.host=127.0.0.1:9042 cassandra_test.py
and similarly with pyspark replacing spark-submit (without the script in the end)
I am getting this error:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: pyspark-cassandra:1.3
I have tried to look for this error and go through related questions, but not able to get the connector working.
Any help will be greatly appreciated.
Thanks in advance.
Haven't tried it, but the spark packages page is here: http://spark-packages.org/package/TargetHolding/pyspark-cassandra
Seems to suggest:
$SPARK_HOME/bin/spark-shell --packages TargetHolding:pyspark-cassandra:0.1.5
Note the TargetHolding: bit. That might be it.

Running a Spark application on YARN, without spark-submit

I know that Spark applications can be executed on YARN using spark-submit --master yarn.
The question is:
is it possible to run a Spark application on yarn using the yarn command ?
If so, the YARN REST API could be used as interface for running spark and MapReduce applications in a uniform way.
I see this question is a year old, but to anyone else who stumbles across this question it looks like this should be possible now. I've been trying to do something similar and have been attempting to follow the Starting Spark jobs directly via YARN REST API Tutorial from Hortonworks.
Essentially what you need to do is upload your jar to HDFS, create a Spark Job JSON file per the YARN REST API Documentation, and then use a curl command to start the application. An example of that command is:
curl -s -i -X POST -H "Content-Type: application/json" ${HADOOP_RM}/ws/v1/cluster/apps \
--data-binary spark-yarn.json
Just like all YARN Applications, Spark implements a Client and an ApplicationMaster when deploying on YARN. If you look at the implementation in the Spark repository, you'll have a clue as to how to create your own Client/ApplicationMaster :
https://github.com/apache/spark/tree/master/yarn/src/main/scala/org/apache/spark/deploy/yarn . But out of the box it does not seem possible.
I have not seen the lates package, but few months back such thing was not possible "out of the box" (this is info straight from cloudera support). I know it's not what you were hoping for, but that's what I know.
Thanks for the question.
As suggested above the AM is a good route to write and submit one's application without invoking spark-submit.
The community has built around the spark-submit command for YARN with the addition of flags that ease the addition of jars and/or configs etc. that are needed to get the application to execute successfully. Submitting Applications
An alternate solution(could try): You could have the spark job as an action in an Oozie workflow. Oozie Spark Extension
Depending on what you wish to achieve, either route looks good.
Hope it helps.

Resources