I've created a spark cluster on ec2, after that, I installed Jupyter on the master node and started jupyter, after that I created sparkcontext using
findspark.init(spark_home='/home/ubuntu/spark')
import pyspark
from functools import partial
sc = pyspark.SparkContext(appName="Pi")
when I am trying to run any job, spark is only utilizing cores of the master machine, all the slaves are running and connected to master, but I am still not able to use the cores of any of the slave machines, anybody please help.
You need to set the master url to spark://... when creating your SparkContext
Related
I need help because I don't know if the Jupyter notebook Kernel are usable in a Spark cluster.
In my local Spark I use this and I don't have problems.
I am using this Kernel for PySpark : https://github.com/Anchormen/pyspark-jupyter-kernels
I am using a Standalone Spark cluster with three nodes without Yarn.
Best regard.
You can connect to your spark cluster standalone using the master IP with the python kernel.
import pyspark
sc = pyspark.SparkContext(master='spark://<public-ip>:7077', appName='<your_app_name>')
References
How to connect Jupyter Notebook to remote spark clusters
Set up an Apache Spark cluster and integrate with Jupyter Notebook
Deploy Application from Jupyter Lab to a Spark Standalone Cluster
This is probably a really silly question, but I can't find the answer with Google. I've written a simple pyspark ETL script that reads in a CSV and writes it to Parquet, something like this:
spark = SparkSession.builder.getOrCreate()
sqlContext = SQLContext(spark.sparkContext)
df = sqlContext.read.csv(input_filename)
df.write.parquet(output_path)
To run it, I start up a local Spark cluster in Docker:
$ docker run --network=host jupyter/pyspark-notebook
I run the Python script and it connects to this local Spark cluster and all works as expected.
Now I'd like to run the same script on a remote Spark cluster (AWS EMR). Can I just specify a remote IP address somewhere when initialising the Spark context? Or am I misunderstanding how Spark works?
You can create a spark session by specifying the IP address of the remote master.
spark = SparkSession.builder.master("spark://<ip>:<port>").getOrCreate()
In case of AWS EMR, standalone mode is not supported. You need to use yarn in either client or cluster mode, and point HADOOP_CONF_DIR to a location on your local server where all files from /etc/hadoop/conf are present. Then setup dynamic port forwarding to connect to the EMR cluster. Create a spark session like:
spark = SparkSession.builder.master('yarn').config('spark.submit.deployMode', 'cluster').getOrCreate()
refer https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/
Tools used:
Spark 2
Sparkling Water (H2O)
Zeppeling notebook
Pyspark Code
I'm starting H2O in INTERNAL mode from my Zeppelin notebook, since my environment is YARN. I'm using the basic command:
from pysparkling import *
hc = H2OContext.getOrCreate(spark)
import h2o
My problem is that I have the zeppelin server installed on a weak machine and when I run my code FROM ZEPPELIN the H2O cluster starts on that machine using its IP automatically. The driver runs on there and i'm limited by the driver memory which H2O consumes. I have 4 strong worker node machines with 100GB and many cores and the cluster uses them while I run my models, but I would like the H2O cluster to start on one of these worker machines and run the driver there, but I didn't find a way to force H2O to do that.
I wonder if there is a solution, or if I must install the zeppelin server on a worker machine.
Help will be appreciated if a solution is possible
Start your job in yarn-cluster mode. This will make the driver run as another YARN container.
Here is another stackoverflow post describing the difference:
Spark yarn cluster vs client - how to choose which one to use?
Spark has native support by EMR. When using the EMR web interface to create a new cluster, it is possible to add a custom step that would execute a Spark application when the cluster starts, basically an automated spark-submit after cluster startup.
I've been wondering how to specify the master node to the SparkConf within the application, when starting the EMR cluster and submitting the jar file through the designated EMR step?
It is not possible to know the IP of the cluster master beforehand, as would be the case if I started the cluster manually and then used the information to build into my application before calling spark-submit.
Code snippet:
SparkConf conf = new SparkConf().setAppName("myApp").setMaster("spark:\\???:7077");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
Note that I am asking about the "cluster" execution mode, so the driver program runs on the cluster as well.
Short answer: don't.
Longer answer: A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf, so when you run spark-submit, you don't even have to include "--master ...".
However, since you are asking about cluster execution mode (actually, it's called "deploy mode"), you may specify either "--master yarn-cluster" (deprecated) or "--deploy-mode cluster" (preferred). This will make the Spark driver run on a random cluster mode rather than on the EMR master.
I don't know if this is already answered in SO but I couldn't find a solution to my problem.
I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook
I have also a spark cluster created with google cloud dataproc
Spark master and the notebook are running in different VMs but in the same region and zone.
My problem is that I'm trying to connect to the spark master from the IPython notebook but without success. I use this snippet of code in my python notebook
import pyspark
conf = pyspark.SparkConf()
conf.setMaster("spark://<spark-master-ip or spark-master-hostname>:7077")
I just started working with spark, so I'm sure I'm missing something (authentication, security ...),
What I found over there is connecting a local browser over an SSH tunnel
Somebody already did this kind of set up?
Thank you in advance
Dataproc runs Spark on YARN, so you need to set master to 'yarn-client'. You also need to point Spark at your YARN ResourceManager, which requires a under-documented SparkConf -> Hadoop Configuration conversion. You also have to tell Spark about HDFS on the cluster, so it can stage resources for YARN. You could use Google Cloud Storage instead of HDFS, if you baked The Google Cloud Storage Connector for Hadoop into your image.
Try:
import pyspark
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('My Jupyter Notebook')
# 'spark.hadoop.foo.bar' sets key 'foo.bar' in the Hadoop Configuaration.
conf.set('spark.hadoop.yarn.resourcemanager.address', '<spark-master-hostname>')
conf.set('spark.hadoop.fs.default.name', 'hdfs://<spark-master-hostname>/')
sc = pyspark.SparkContext(conf=conf)
For a more permanent config, you could bake these into a local file 'core-site.xml' as described here, place that in a local directory, and set HADOOP_CONF_DIR to that directory in your environment.
It's also worth noting that while being in the same Zone is important for performance, it is being in the same Network and allowing TCP between internal IP addresses in that network that allows your VMs to communicate. If you are using the default network, then the default-allow-internal firewall rule, should be sufficient.
Hope that helps.