Apache Spark remote cluster on JupyterHub notebooks on k8s - apache-spark

I have :
Apache Spark : 2.4.4
JupyterHub : 1.1.0
Helm chart version : 0.9.0
K8S : 1.15
I build Jupyterhub on k8s with the official doc : https://zero-to-jupyterhub.readthedocs.io/
I use official Spark image to do some local jobs : jupyter/all-spark-notebook:latest
Spark works well in local mode.
But I want to use JupyterHub notebook to do some jobs on remote (homemade) Apache Spark cluster (with K8s as orchestrator).
I already tried Apache Zeppelin, it's works well ! but I want to do the same thing with Jupyterhub.
How can I do this ?

I understand your pain.
I burn a lot of time to create spark cluster + jupyter server work.
Try use my docker-compose.yaml.
docker-compose up -d
For get jupyter token run:
docker-compose logs jupyter
Copy url starting 127.0.0.1 include token and put into your browser. Change port to 7777.
You will see empty jupyter page.
Create new notebook and run cell as you see on picture
Enjoy using jupyter with spark...
Hope it's help you.

Related

Connection from local machine installed Zeppelin to Docker Spark cluster

I am trying to configure Spark interpreter on a local machine installed Zeppelin version 0.10.0 so that I can run scripts on a Spark cluster created also local on Docker. I am using docker-compose.yml from https://github.com/big-data-europe/docker-spark and Spark version 3.1.2. After docker compose-up, I can see in the browser spark-master on localhost:8080 and History Server on localhost:18081. After reading the ID of the spark-master container, I can also run shell and spark-shell on it (docker exec -it xxxxxxxxxxxx /bin/bash). As host OS I am using Ubuntu 20.04, the spark.master in Zeppelin is set now to spark://localhost:7077, zeppelin.server.port in zeppelin-site.xml to 8070.
There is a lot of information about connecting a container running Zeppelin or running both Spark and Zeppelin in the same container but unfortunately I also use that Zeppelin to connect to the Hive via jdbc on VirtualBox Hortonworks cluster like in one of my previous posts and I wouldn't want to change that configuration now due to hardware resources. In one of the posts (Running zeppelin on spark cluster mode) I saw that such a connection is possible, unfortunately all attempts end with the "Fail to open SparkInterpreter" message.
I would be grateful for any tips.
You need to change the spark.master in Zeppelin to point to the spark master in the docker container not the local machine. Hence spark://localhost:7077 won't work.
The port 7077 is fine because that is the port specified in the docker-compose file you are using. To get the IP address of the docker container you can follow this answer. Since I suppose your container is named spark-master you can try the following:
docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' spark-master
Then specify this as the spark.master in Zeppelin: spark://docker-ip:7077

How to allow pyspark to run code on emr cluster

We use python with pyspark api in order to run simple code on spark cluster.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://clusterip:7077')
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()
It works when we setup a spark cluster locally and with dockers.
We would now like to start an emr cluster and test the same code. And seems that pyspark can't connect to the spark cluster on emr
We opened ports 8080 and 7077 from our machine to the spark master
We are getting past the firewall and just seems that nothing is listening on port 7077 and we get connection refused.
We found this explaining how to serve a job using the cli but we need to run it directly from pyspark api on the driver.
What are we missing here?
How can one start an emr cluster and actually run pyspark code locally on python using this cluster?
edit: running this code from the master itself works
As opposed to what was suggested, when connecting to the master using ssh, and running python from the terminal, the very same code (with proper adjustments for the master ip, given it's the same machine) works. No issues no problems.
How does this make sense given the documentation that clearly states otherwise?
You try to run pyspark (which calls spark-submit) form a remote computer outside the spark cluster. This is technically possible but it is not the intended way of deploying applications. In yarn mode, it will make your computer participate in the spark protocol as a client. Thus it would require opening several ports and installing exactly the same spark jars as on spark aws emr.
Form the spark submit doc :
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster)
A simple deploy strategy is
sync code to master node via rsync, scp or git
cd ~/projects/spark-jobs # on local machine
EMR_MASTER_IP='255.16.17.13'
TARGET_DIR=spark_jobs
rsync -avze "ssh -i ~/dataScienceKey.pem" --rsync-path="mkdir -p ${TARGET_DIR} && rsync" --delete ./ hadoop#${EMR_MASTER_IP}:${TARGET_DIR}
ssh to the master node
ssh -i ~/dataScienceKey.pem hadoop#${EMR_HOST}
run spark-submit on the master node
cd spark_jobs
spark-submit --master yarn --deploy-mode cluster my-job.py
# my-job.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my-job-py").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4])
res = rdd.map(lambda x: x**2).collect()
print(res)
There is a way to submit the job directly to spark emr without syncing. Spark EMR runs Apache Livy on port 8998 by default. It is a rest webservice which allows to submit jobs via a rest api. You can pass the same spark-submit parameters with a curl script from your machine. See doc
For interactive development we have also configured local running jupyter notebooks which automatically submit cell runs to livy. This is done via the spark-magic project
According to this Amazon Doc, you can't do that:
Common errors
Standalone mode
Amazon EMR doesn't support standalone mode for Spark. It's not
possible to submit a Spark application to a remote Amazon EMR cluster
with a command like this:
SparkConf conf = new SparkConf().setMaster("spark://master_url:7077”).setAppName("WordCount");
Instead, set up your local machine as explained earlier in this
article. Then, submit the application using the spark-submit command.
You can follow the above linked resource to configure your local machine in order to submit spark jobs to EMR Cluster. Or more simpler, use the ssh key you specified when you create your cluster to connect to the master node and submit spark jobs:
ssh -i ~/path/ssh_key hadoop#$<master_ip_address>

handling Remote dependencies for spark-submit in spark 2.3 with kubernetes

Im trying to run spark-submit to kubernetes cluster with spark 2.3 docker container image
The challenge im facing is application have a mainapplication.jar and other dependency files & jars which are located in Remote location like AWS s3 ,but as per spark 2.3 documentation there is something called kubernetes init-container to download remote dependencies but in this case im not creating any Podspec to include init-containers in kubernetes, as per documentation Spark 2.3 spark/kubernetes internally creates Pods (driver,executor) So not sure how can i use init-container for spark-submit when there are remote dependencies.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-remote-dependencies
Please suggest
It works as it should with s3a:// urls. Unfortunatly getting s3a running on the stock spark-hadoop2.7.3 is problematic (authentication mainly), so I opted for building spark with Hadoop 2.9.1, since S3A has seen significant development there
I have created a gist with the steps needed to
build spark with new hadoop dependencies
build the docker image for k8s
push image to ECR
The script also creates a second docker image with the S3A dependencies added and base conf settings for enabling S3A using IAM credentials so running in AWS doesn't require putting access/secretkey in conf files/args
I havn't run any production spark jobs yet using the image, but have tested that basic saving and loading to s3a urls does work.
I have yet to experiment with S3Guard which uses DynamoDB to ensure that S3 writes/reads are consistent - similarly to EMRFS
The Init container is created automatically for you by Spark.
For example, you can use
kubectl describe pod [name of your driver svc]
and you'll see the Init container named spark-init.
You can also acccess the logs from the init-container via a command like:
kubectl logs [name of your driver svc] -c spark-init
Caveat: I'm not running in AWS, but a custom K8S. My init-container successfully runs a downloads dependencies from an HTTP server (but not S3, strangely).

How to setup Spark Notebook in Hue under Cloudera Quickstart?

I was recently trying to setup Spark Notebook in Hue UI. I am running Cloudera CDH 5.8 in VirtualBox. Spark notebook works on Livy Server and I installed livy server. I also remove spark from the blacklist from Hue.ini file.
But still, I do not get the Spark Notebook in Hue UI.
Update: Now I can access notebook. However, I can not submit spark jobs to cluster. I have tried several scripts only Impala, Hive scripts works but R, Pyspark or Scala scripts are not working. I get following errors.
Can somebody help me to figure the problem? I can provide more information if needed.
Thank you.
.....Thanks to Romainr, I could have managed to run Spark Notebook in Hue. Now I am facing some issue to submit jobs to Apache spark which is running in Cloudera manager on the same localhost. Errors are exposed in following screenshots. Any help will be much appreciated. Thank you.
Error: Spark session could not be created in cluster: timeout
"Session '-1' not found." (error 404)
If you run pySpark notebook from Hue, it says timeout as it can not access the resources.
In fact, if you try to run the command pyspark or scala from command line interface you will see some errors.
When you get the timeout error from Hue Notebook then look into the log and you will find permission denied issues.
So in order to give access do following: (Run on Linux shell)
$ sudo -u hdfs hadoop fs -chmod 777 /user/spark
$ sudo -u spark hadoop fs -chmod 777 /user/spark/applicationHistory
After this if you try to restart hue and spark service in CDH and create pyspark or scala notebook from hue, it should run out of the box.
If you still get errors, let me know.

Spark UI on AWS EMR

I am running a AWS EMR cluster with Spark (1.3.1) installed via the EMR console dropdown. Spark is current and processing data but I am trying to find which port has been assigned to the WebUI. I've tried port forwarding both 4040 and 8080 with no connection. I'm forwarding like so
ssh -i ~/KEY.pem -L 8080:localhost:8080 hadoop#EMR_DNS
1) How do I find out what the Spark WebUI's assigned port is?
2) How do I verify the Spark WebUI is running?
Spark on EMR is configured for YARN, thus the Spark UI is available by the application url provided by the YARN Resource Manager (http://spark.apache.org/docs/latest/monitoring.html). So the easiest way to get to it is to setup your browser with SOCKS using a port opened by SSH then from the EMR console open Resource Manager and click the Application Master URL provided to the right of the running application. Spark History server is available at the default port 18080.
Example of socks with EMR at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html
Here is an alternative if you don't want to deal with the browser setup with SOCKS as suggested on the EMR docs.
Open a ssh tunnel to the master node with port forwarding to the machine running spark ui
ssh -i path/to/aws.pem -L 4040:SPARK_UI_NODE_URL:4040 hadoop#MASTER_URL
MASTER_URL (EMR_DNS in the question) is the URL of the master node that you can get from EMR Management Console page for the cluster
SPARK_UI_NODE_URL can be seen near the top of the stderr log. The log line will look something like:
16/04/28 21:24:46 INFO SparkUI: Started SparkUI at http://10.2.5.197:4040
Point your browser to localhost:4040
Tried this on EMR 4.6 running Spark 2.6.1
Glad to announce that this feature is finally available on AWS. You won't need to run any special commands (or to configure a SSH tunnel) :
By clicking on the link to the spark history server ui, you'll be able to see the old applications logs, or to access the running spark job's ui :
For more details: https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html
I hope it helps !
Just run the following command:
ssh -i /your-path/aws.pem -N -L 20888:ip-172-31-42-70.your-region.compute.internal:20888 hadoop#ec2-xxx.compute.amazonaws.com.cn
There are 3 places you need to change:
your .pem file
your internal master node IP
your public DNS domain.
Finally, on the Yarn UI you can click your Spark Application Tracking URL, then just replace the url:
"http://your-internal-ip:20888/proxy/application_1558059200084_0002/"
->
"http://localhost:20888/proxy/application_1558059200084_0002/"
It worked for EMR 5.x
Simply use SSH tunnel
On your local machine do:
ssh -i /path/to/pem -L 3000:ec2-xxxxcompute-1.amazonaws.com:8088 hadoop#ec2-xxxxcompute-1.amazonaws.com
On your local machine browser hit:
localhost:3000

Resources