My codes are they distributed in spark with Jupyter notebook Kernel - apache-spark

I need help because I don't know if the Jupyter notebook Kernel are usable in a Spark cluster.
In my local Spark I use this and I don't have problems.
I am using this Kernel for PySpark : https://github.com/Anchormen/pyspark-jupyter-kernels
I am using a Standalone Spark cluster with three nodes without Yarn.
Best regard.

You can connect to your spark cluster standalone using the master IP with the python kernel.
import pyspark
sc = pyspark.SparkContext(master='spark://<public-ip>:7077', appName='<your_app_name>')
References
How to connect Jupyter Notebook to remote spark clusters
Set up an Apache Spark cluster and integrate with Jupyter Notebook
Deploy Application from Jupyter Lab to a Spark Standalone Cluster

Related

Not able to connect to Snowflake from EMR Cluster using Pyspark

I am trying to connect to Snowflake from EMR cluster using pyspark.
I am using these two jars in spark-submit.
snowflake-jdbc-3.5.2.jar
spark-snowflake_2.11-2.7.0-spark_2.4.jar
But it failing with connect time out error.
I have correct proxy configured for the EMR cluster. From the same EC2 (EMR Master)
I am able to connect to Snowflake using snowsql and python connector.
I am not sure why it is getting timed out for pyspark.
You can use our SnowCD tool to check the connectivity diagnostics. This is related to network issues.
https://docs.snowflake.com/en/user-guide/snowcd.html
Below were my commands when I tried running through EMR shell.
pyspark --packages net.snowflake:snowflake-jdbc:3.6.27,net.snowflake:spark-snowflake_2.12:2.4.14-spark_2.4
spark-submit --packages net.snowflake:snowflake-jdbc:3.8.0,net.snowflake:spark-snowflake_2.11:2.4.14-spark_2.4 SparkConnPythonWithCert.py
Spark-shell --packages net.snowflake:snowflake-jdbc:3.8.0,net.snowflake:spark-snowflake_2.11:2.4.14-spark_2.4

Spark running in local machine instead of standalone cluster

I've created a spark cluster on ec2, after that, I installed Jupyter on the master node and started jupyter, after that I created sparkcontext using
findspark.init(spark_home='/home/ubuntu/spark')
import pyspark
from functools import partial
sc = pyspark.SparkContext(appName="Pi")
when I am trying to run any job, spark is only utilizing cores of the master machine, all the slaves are running and connected to master, but I am still not able to use the cores of any of the slave machines, anybody please help.
You need to set the master url to spark://... when creating your SparkContext

How notebook sends code to Spark?

I am using a notebook environment to try out some commands against Spark. Can someone explain how the overall flow works when we run a cell from the notebook? In a notebook environment which component acts as the driver?
Also, can we call the code snippets we run from a notebook as a "Spark Application", or we call a code snippet "Spark Application" only when we use spark-submit to submit it to spark? Basically, I am trying to find out what qualifies a "Spark Application".
Notebook environment like Zeppelin creates a SparkContext during first execution of cell. Once a SparkContext is created all further cell executions are submitted to the same SparkContext which was created earlier.
A driver program is started based on whether you're using spark cluster in standalone mode or cluster mode where a resource manager is managing you're spark cluster. So In case of standalone mode driver program is started on host where the notebook is running. And in case of cluster mode it will be started on one of the node in the cluster.
You can consider a each running SparkContext on cluster as a different Application. Notebooks like Zeppelin provide capability to share same SparkContext for all cell across all notebooks or you can even configure it to be created per notebook.
Most of the notebooks internally calls spark-submit only.

Force H2O Sparkling Water cluster to start on a specific machine in YARN mode

Tools used:
Spark 2
Sparkling Water (H2O)
Zeppeling notebook
Pyspark Code
I'm starting H2O in INTERNAL mode from my Zeppelin notebook, since my environment is YARN. I'm using the basic command:
from pysparkling import *
hc = H2OContext.getOrCreate(spark)
import h2o
My problem is that I have the zeppelin server installed on a weak machine and when I run my code FROM ZEPPELIN the H2O cluster starts on that machine using its IP automatically. The driver runs on there and i'm limited by the driver memory which H2O consumes. I have 4 strong worker node machines with 100GB and many cores and the cluster uses them while I run my models, but I would like the H2O cluster to start on one of these worker machines and run the driver there, but I didn't find a way to force H2O to do that.
I wonder if there is a solution, or if I must install the zeppelin server on a worker machine.
Help will be appreciated if a solution is possible
Start your job in yarn-cluster mode. This will make the driver run as another YARN container.
Here is another stackoverflow post describing the difference:
Spark yarn cluster vs client - how to choose which one to use?

Apache Zeppelin & Spark Streaming: Twitter Example only works local

I just added the example project to my Zeppelin Notebook from http://zeppelin-project.org/docs/tutorial/tutorial.html (section "Tutorial with Streaming Data"). The problem I now have is that the application seems only to work local. If I change the Spark interpreter setting "master" from "local[*]" to "spark://master:7077" the application won't bring any result anymore when I'm doing the same SQL statement. Am I doing anything wrong? I already restarted the Zeppelin interpreter, also the whole Zeppelin daemon and the Spark cluster, nothing solved the issue! Can someone help.
I use the following installation:
Spark 1.5.1 (prebuild for Hadoop 2.6+), Master + 2x Slaves
Zeppelin 0.5.5 (installed on Spark's master node)
EDIT
Also the following installation won't work for me:
Spark 1.5.0 (prebuild for Hadoop 2.6+), Master + 2x Slaves
Zeppelin 0.5.5 (installed on Spark's master node)
Screenshot: local setting (works!)
Screenshot: cluster setting (won't work!)
The job seems to run correctly in cluster mode:
I got it after 2 days of trying around!
The difference between the local Zeppelin Spark interpreter and the Spark Cluster seems to be, that the local one has included the Twitter Utils which are needed for executing the Twitter Streaming example, and the Spark Cluster doesn't have this library by default.
Therefore you have to add the dependency manually in the Zeppelin Notebook before starting the application with Spark cluster as master. So the first paragraph of the Notebook must be:
%dep
z.reset
z.load("org.apache.spark:spark-streaming-twitter_2.10:1.5.1")
If an error occures on running this paragraph, just try to restart the Zeppelin server via ./bin/zeppelin-daemon.sh stop (& start)!

Resources