Connecting SparkR to the spark cluster - apache-spark

I have a spark cluster running on 10 machines (1 - 10) with the master at machine 1. All of these run on CentOS 6.4.
I am trying to connect a jupyterhub installation (which is running inside a ubuntu docker because of issues with installing on CentOS), using sparkR, to the cluster and get the spark context.
The code I am using is
Sys.setenv(SPARK_HOME="/usr/local/spark-1.4.1-bin-hadoop2.4")
library(SparkR)
sc <- sparkR.init(master="spark://<master-ip>:7077")
The output I get is
attaching package: ‘SparkR’
The following object is masked from ‘package:stats’:
filter
The following objects are masked from ‘package:base’:
intersect, sample, table
Launching java with spark-submit command spark-submit sparkr-shell/tmp/Rtmpzo6esw/backend_port29e74b83c7b3 Error in sparkR.init(master = "spark://10.10.5.51:7077"): JVM is not ready after 10 seconds
Error in sparkRSQL.init(sc): object 'sc' not found
I am using Spark 1.4.1. The spark cluster is also running CDH 5.
The jupyterhub installation can connect to the cluster via pyspark and I have python notebooks which use pyspark.
Can someone tell me what I am doing wrong?

I have a similar problem and have searching all around but no solutions. Can you please tell me what do you mean by "jupyterhub installation (which is running inside a ubuntu docker because of issues with installing on CentOS), "?
We have 4 clusters too on CentOS 6.4. One of my other problem is that how do use an IDE like IPython or RStudio to interact with these 4 servers? Do I use my laptop to connect to these servers remotely (if yes, then how?) and if no then what can be the other solution.
Now to answer your question, I can give it a try. I think the you have to use --yarn-cluster option as stated here I hope this helps you solving the problem.
Cheers,
Ashish

Related

Pyspark: - Failed to initialise Spark session (Another SparkContext is being constructed)

HI I am pretty new to spark i want to use pyspark to stream data from Kafka to mongo but i am not able to run pyspark. and every-time i run it on terminal it gives following error.I have deleted and reinstalled Java Kafka Scala and pyspark multiple times but unable to resolve it found few methods tried to do them but unable to get it resolved. If it run spark shell on terminal it works while giving warning
and here is my pyspark and java version that i have right now:
If you have solution on this please help me with it i have stuck a wall with this error.
Hey guys if you are facing the same issue you can do what i did.
i removed all spark and scala and java also pyspark
reinstall brew reinstall apache-spark
after that you cam use pyspark or spark-shell to run it again.
it worked for me because my spark-shell was also giving an error so reinstalling apache spark was able to solve it.

Can PySpark work without Spark?

I have installed PySpark standalone/locally (on Windows) using
pip install pyspark
I was a bit surprised I can already run pyspark in command line or use it in Jupyter Notebooks and that it does not need a proper Spark installation (e.g. I did not have to do most of the steps in this tutorial https://medium.com/#GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c ).
Most of the tutorials that I run into say one needs to "install Spark before installing PySpark". That would agree with my view of PySpark being basically a wrapper over Spark. But maybe I am wrong here - can someone explain:
what is the exact connection between these two technologies?
why is installing PySpark enough to make it run? Does it actually install Spark under the hood? If yes, where?
if you install only PySpark, is there something you miss (e.g. I cannot find the sbin folder which contains e.g. script to start history server)
As of v2.2, executing pip install pyspark will install Spark.
If you're going to use Pyspark it's clearly the simplest way to get started.
On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars
PySpark installed by pip is a subfolder of full Spark. you can find most of PySpark python file in spark-3.0.0-bin-hadoop3.2/python/pyspark. so if you'd like to use java or scala interface, and deploy distribute system with hadoop, you must download full Spark from Apache Spark and install it.
PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark.
This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

how to setup pyspark with zeppelin on windows 10

I have had difficulties installing Zeppelin 0.7.2
Using the Zeppelin version 0.7.2 of spark that comes with it, I can run spark code, but I am unable to run %pyspark code even after modifying python environment variables to point to where python is installed (python was installed using anaconda).
%python code works fine.
If anyone can help resolve this issue I would be grateful. (The odd thing is I have done the same installation on another windows 10 laptop and pyspark does execute.)
The error I get is that: pyspark is not responding

Running R on amazon EMR with spark 1.6 and Zeppelin 0.5.6

I am trying to setup the R interpreter to run in Zeppelin which is currently running on EMR. Zeppelin is working perfectly and I am able to write script in Scala and Python. When I use %r, %sparkR or %knitr I receive an error : "r interpreter not found"
The applications which I have running in my emr-4.7.2 cluster are: Hive 1.0.0, Zeppelin-Sandbox 0.5.6, Spark 1.6.2, Pig 0.14.0
Within the interpreter there is no mention of R so figure I am missing something but do not know what.
Any pointers greatly appreciated.
Zeppelin on Amazon EMR (till at least emr-5.0.0) does not support the SparkR interpreter.
You ought following the Elastic Map Reduce Release Guide/Zeppelin documentation to get more information.

Hadoop multi-node cluster manual installation over Ubuntu 14.04

I am a newcomer to Hadoop. For my College project we are given 4 VMs. I need to configure a multi-mode Hadoop cluster on this ( 1 master 3 slaves) and run my webapp on it. I would be using HBase in my project. Usually CentOS is used for installation and deployment of HDP, whereas I was given ubuntu. I cannot use Apache ambari plugin for installation as it is not supported in Ubuntu. I need to manually deploy them, Hence I tried looking out for tutorials.
I looked out for a tutorial to install HDP multinode clusters on ubuntu and found this [http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/]
But its too outdated (2010)
I have the official documentation here, but I am not able to follow it properly.
[http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1-latest/bk_installing_manually_book/content/rpm-chap2-3.html] and I tried following them.
Could someone suggest me somelinks which are latest, a tutorial with decent amount of screenshots for installation of multinode clusters over Ubuntu 14.04 ( 12.04 is also fine).
Thanks a lot!!
The Michael Noll tutorial is too old, I think. I found this site:
https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10
I have a mini cluster (with 5 slaves and a master) in my University Lab. Ubuntu 12.04 and Hadoop 2.5.0 is there. Furthermore, I have a VM cluster in my laptop (2 slaves and a master) of Hadoop 1.2.1 on Ubuntu 12.04 too.
But I couldn't install Hadoop (any version) in Ubuntu 14.04. I don't remember the cause, but I think it was some problem with Java version (I don't check that).
I hope that help you!
I can across the same issue to install HDP 2.2 on Ubuntu 14.04, and found a solution.
I documented everything here: http://www.swiss-scalability.com/2014/12/install-hdp-22-on-ubuntu-1404-trusty.html
In a nutshell, the magic happens here:
sed -e "s/14.04/12.04/g" -i /etc/*-release
And the you can install or restart ambari-agent, it will be able to communicate with ambari-server.

Resources