PySpark too slow in Google Cloud Dataproc - apache-spark

I deployed a PySpark ML model into a Google Cloud Dataproc cluster and it was running for over an hour, but my data is about 800 MB.
Is there anything being needed to declare as master on my SparkSession? I set the default option 'local'.

When you pass a local deploy mode option to SparkContext it executes your application locally on a single VM, to avoid this you should not pass any options in the SparkContext constructor - it will use pre-configured properties by Dataproc and run your application on YARN utilizing all cluster resources/nodes.

Related

Azuresynapse Spark, run jupyter notebook with localhost running on e.g. localhost

I have an Apache Spark pool running on web.azuresynapse.net and can e.g. use it to run spark jobs from my "Synapse Analytics workspace"
I can develop and run python notebooks using the spark pool from there.
From what I see when I run a job, livy is supported (though) the link provided from azure synapse is for some reason inaccessible.
How could I connect to that pool using e.g. livy from e.g. my local jupyter notebook and use the pool? Or the only way to run spark code is to use a pipeline?

How to run a Spark Standalone master on Kubernetes that will use the Kubernetes Cluser Manager to start workers

I have an application that currently uses Standalone Mode locally to use spark functionality via the SparkContext. We are not using spark-submit to upload our jobs, we are running our application in a container on kubernetes so we would like to take advantage of the dynamic scheduling that kubernetes provides to run the jobs.
We started out looking for a helm chart to create stand alone cluster running on kubernetes similar to how you would have run a standalone cluster on machines ( vms or actual machines ) a few years ago and came across the following
https://github.com/helm/charts/tree/master/stable/spark
Issues:
very old instances of spark
not using the containers provided by spark
this setup wastes a bunch of resources if you need to have large worker nodes reserved and running all the time regardless of your need
Next we started looking at the spark-operator approach here https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Issues:
Doesn't support the way we interact with spark, takes the approach that all the apps are standalone apps that are pushed to the cluster to run
No longstanding master that allows us to take advantage of cached resources in the cluster
Along this journey we discovered that spark now supports a kubernetes cluster manager ( similar to the way it does with yarn, mesos ) so we are looking that this might be the best approach, but this still does not provide a standalone master that would allow for the in memory caching. I have looked to see if there was a way that I could get the org.apache.spark.deploy.master.Master to start and use the
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager
So I guess what I'm trying to ask is does anyone have any experience in trying to run a Standalone Master, that would use the kubernetes backend such as "KubernetesClusterManager" in order to have the worker nodes dynamically created as pods and running executors while having a permanent Standalone Master that would allow a SparkContext to connect to it remotely in client mode.

Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3 ,now i want to run spark-submit from AWS lambda function to k8s master,would like to know if there is any REST interface to run Spark submit on k8s Master?
Unfortunately, it is not possible for Spark 2.3, in case you are using native Kubernetes support.
Based on description from deployment instruction, submission process contains several steps:
Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors which are also running within Kubernetes pods
The driver connects to them, and executes application code
When the application completes, executor pods terminate and are cleaned up, but the driver pod persists its logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
So, in fact, you have no place to submit a job until you start a submission process, which will launch the first Spark's pod (driver) for you. Only once application completes, everything is terminated.
Please also see similar answer for this question under the link

How to setup the cores of driver in spark?

I am using the client deploy model with spark standalone from Jupyter notebook. My configure in the application is like:
conf.setMaster('spark://hadoop-master:7077')
conf.set("spark.driver.cores","4")
conf.set("spark.driver.memory","8g")
conf.set("spark.executor.memory","16g")
But, I speculated that spark.driver.cores was not working, the Spark WEB UI shown:
Why is the cores zero? How to modify this value?
Additionally, the Environment of Spark WEB UI showed, like:
So, the config seemly has uploaded to the server.

Connecting IPython notebook to spark master running in different machines

I don't know if this is already answered in SO but I couldn't find a solution to my problem.
I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook
I have also a spark cluster created with google cloud dataproc
Spark master and the notebook are running in different VMs but in the same region and zone.
My problem is that I'm trying to connect to the spark master from the IPython notebook but without success. I use this snippet of code in my python notebook
import pyspark
conf = pyspark.SparkConf()
conf.setMaster("spark://<spark-master-ip or spark-master-hostname>:7077")
I just started working with spark, so I'm sure I'm missing something (authentication, security ...),
What I found over there is connecting a local browser over an SSH tunnel
Somebody already did this kind of set up?
Thank you in advance
Dataproc runs Spark on YARN, so you need to set master to 'yarn-client'. You also need to point Spark at your YARN ResourceManager, which requires a under-documented SparkConf -> Hadoop Configuration conversion. You also have to tell Spark about HDFS on the cluster, so it can stage resources for YARN. You could use Google Cloud Storage instead of HDFS, if you baked The Google Cloud Storage Connector for Hadoop into your image.
Try:
import pyspark
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('My Jupyter Notebook')
# 'spark.hadoop.foo.bar' sets key 'foo.bar' in the Hadoop Configuaration.
conf.set('spark.hadoop.yarn.resourcemanager.address', '<spark-master-hostname>')
conf.set('spark.hadoop.fs.default.name', 'hdfs://<spark-master-hostname>/')
sc = pyspark.SparkContext(conf=conf)
For a more permanent config, you could bake these into a local file 'core-site.xml' as described here, place that in a local directory, and set HADOOP_CONF_DIR to that directory in your environment.
It's also worth noting that while being in the same Zone is important for performance, it is being in the same Network and allowing TCP between internal IP addresses in that network that allows your VMs to communicate. If you are using the default network, then the default-allow-internal firewall rule, should be sufficient.
Hope that helps.

Resources