I was recently trying to setup Spark Notebook in Hue UI. I am running Cloudera CDH 5.8 in VirtualBox. Spark notebook works on Livy Server and I installed livy server. I also remove spark from the blacklist from Hue.ini file.
But still, I do not get the Spark Notebook in Hue UI.
Update: Now I can access notebook. However, I can not submit spark jobs to cluster. I have tried several scripts only Impala, Hive scripts works but R, Pyspark or Scala scripts are not working. I get following errors.
Can somebody help me to figure the problem? I can provide more information if needed.
Thank you.
.....Thanks to Romainr, I could have managed to run Spark Notebook in Hue. Now I am facing some issue to submit jobs to Apache spark which is running in Cloudera manager on the same localhost. Errors are exposed in following screenshots. Any help will be much appreciated. Thank you.
Error: Spark session could not be created in cluster: timeout
"Session '-1' not found." (error 404)
If you run pySpark notebook from Hue, it says timeout as it can not access the resources.
In fact, if you try to run the command pyspark or scala from command line interface you will see some errors.
When you get the timeout error from Hue Notebook then look into the log and you will find permission denied issues.
So in order to give access do following: (Run on Linux shell)
$ sudo -u hdfs hadoop fs -chmod 777 /user/spark
$ sudo -u spark hadoop fs -chmod 777 /user/spark/applicationHistory
After this if you try to restart hue and spark service in CDH and create pyspark or scala notebook from hue, it should run out of the box.
If you still get errors, let me know.
Related
I've created spark session to read the csv file.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('Saurabh').getOrCreate()
books_df = spark.read.csv('hdfs://<VMware-ip-address>:8020/booksData/books.csv', header=True)
While executing it throws error like
62.csv. : java.net.ConnectException: Call From <windows-ip> to <VMware-ip> failed on connection exception
All Hadoop Deamons are up and running fine. used
start-all.sh
command to bring hadoop deamons up. used
jps
command to check whhat all java services are running.
used hostname -i command to get the host name.
I'm not sure if the port which i'm using is open for external connections. I've tried several other ports as well but no luck.
I have :
Apache Spark : 2.4.4
JupyterHub : 1.1.0
Helm chart version : 0.9.0
K8S : 1.15
I build Jupyterhub on k8s with the official doc : https://zero-to-jupyterhub.readthedocs.io/
I use official Spark image to do some local jobs : jupyter/all-spark-notebook:latest
Spark works well in local mode.
But I want to use JupyterHub notebook to do some jobs on remote (homemade) Apache Spark cluster (with K8s as orchestrator).
I already tried Apache Zeppelin, it's works well ! but I want to do the same thing with Jupyterhub.
How can I do this ?
I understand your pain.
I burn a lot of time to create spark cluster + jupyter server work.
Try use my docker-compose.yaml.
docker-compose up -d
For get jupyter token run:
docker-compose logs jupyter
Copy url starting 127.0.0.1 include token and put into your browser. Change port to 7777.
You will see empty jupyter page.
Create new notebook and run cell as you see on picture
Enjoy using jupyter with spark...
Hope it's help you.
We use python with pyspark api in order to run simple code on spark cluster.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://clusterip:7077')
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()
It works when we setup a spark cluster locally and with dockers.
We would now like to start an emr cluster and test the same code. And seems that pyspark can't connect to the spark cluster on emr
We opened ports 8080 and 7077 from our machine to the spark master
We are getting past the firewall and just seems that nothing is listening on port 7077 and we get connection refused.
We found this explaining how to serve a job using the cli but we need to run it directly from pyspark api on the driver.
What are we missing here?
How can one start an emr cluster and actually run pyspark code locally on python using this cluster?
edit: running this code from the master itself works
As opposed to what was suggested, when connecting to the master using ssh, and running python from the terminal, the very same code (with proper adjustments for the master ip, given it's the same machine) works. No issues no problems.
How does this make sense given the documentation that clearly states otherwise?
You try to run pyspark (which calls spark-submit) form a remote computer outside the spark cluster. This is technically possible but it is not the intended way of deploying applications. In yarn mode, it will make your computer participate in the spark protocol as a client. Thus it would require opening several ports and installing exactly the same spark jars as on spark aws emr.
Form the spark submit doc :
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster)
A simple deploy strategy is
sync code to master node via rsync, scp or git
cd ~/projects/spark-jobs # on local machine
EMR_MASTER_IP='255.16.17.13'
TARGET_DIR=spark_jobs
rsync -avze "ssh -i ~/dataScienceKey.pem" --rsync-path="mkdir -p ${TARGET_DIR} && rsync" --delete ./ hadoop#${EMR_MASTER_IP}:${TARGET_DIR}
ssh to the master node
ssh -i ~/dataScienceKey.pem hadoop#${EMR_HOST}
run spark-submit on the master node
cd spark_jobs
spark-submit --master yarn --deploy-mode cluster my-job.py
# my-job.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my-job-py").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4])
res = rdd.map(lambda x: x**2).collect()
print(res)
There is a way to submit the job directly to spark emr without syncing. Spark EMR runs Apache Livy on port 8998 by default. It is a rest webservice which allows to submit jobs via a rest api. You can pass the same spark-submit parameters with a curl script from your machine. See doc
For interactive development we have also configured local running jupyter notebooks which automatically submit cell runs to livy. This is done via the spark-magic project
According to this Amazon Doc, you can't do that:
Common errors
Standalone mode
Amazon EMR doesn't support standalone mode for Spark. It's not
possible to submit a Spark application to a remote Amazon EMR cluster
with a command like this:
SparkConf conf = new SparkConf().setMaster("spark://master_url:7077”).setAppName("WordCount");
Instead, set up your local machine as explained earlier in this
article. Then, submit the application using the spark-submit command.
You can follow the above linked resource to configure your local machine in order to submit spark jobs to EMR Cluster. Or more simpler, use the ssh key you specified when you create your cluster to connect to the master node and submit spark jobs:
ssh -i ~/path/ssh_key hadoop#$<master_ip_address>
I am new to spark. I want to submit a spark job from local to a remote EMR cluster.
I am following the link here to set up all the prerequisites: https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/
here is the command as below:
spark-submit --class mymain --deploy-mode client --master yarn myjar.jar
Issue: sparksession creation is not able to be finished with no error. Seems an access issue.
From the aws document, we know that by given the master with yarn, yarn uses the config files I copied from EMR to know where is the master and slaves (yarn-site.xml).
As my EMR cluster is located in a VPC, which need a special ssh config to access, how could I add this info to yarn so it can access to the remote cluster and submit the job?
I think the resolution proposed in aws link is more like - create your local spark setup with all dependencies.
If you don't want to do local spark setup, I would suggest easier way would be, you can use:
1. Livy: for this you emr setup should have livy installed. Check this, this, this and you should be able to infer from this
2. EMR ssh: this requires you to have aws-cli installed locally, cluster id and pem file used while creating emr cluster. Check this
Eg. aws emr ssh --cluster-id j-3SD91U2E1L2QX --key-pair-file ~/.ssh/mykey.pem --command 'your-spark-submit-command' (This prints command output on console though)
I want to run my spark application on my hortonworks data platform. As in this setup I don't have a spark master standalone I want to run as a yarn client.
I am trying to create the SparkSession like this:
SparkSession
.builder()
.master("yarn-client")
.appName("my-app")
.getOrCreate())
I know I am missing some properties to let spark client where my yarn server is running but I can't seem to find those properties.
Currently the app just hangs init with no error or exception.
Any ideas what I am missing?
It looks like you're trying to run your app locally while your Hortonworks HDP is somewhere else.
Unlike Spark standalone and Mesos modes, in which the master’s address
is specified in the --master parameter, in YARN mode the
ResourceManager’s address is picked up from the Hadoop configuration.
So your app should be run from Hortonworks itself, which has all the Hadoop configuration in place.