can't run apache spark on Mac - apache-spark

I installed apache spark via homebrew but when I try to use spark-submitenter image description here this picture happens

Related

Can PySpark work without Spark?

I have installed PySpark standalone/locally (on Windows) using
pip install pyspark
I was a bit surprised I can already run pyspark in command line or use it in Jupyter Notebooks and that it does not need a proper Spark installation (e.g. I did not have to do most of the steps in this tutorial https://medium.com/#GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c ).
Most of the tutorials that I run into say one needs to "install Spark before installing PySpark". That would agree with my view of PySpark being basically a wrapper over Spark. But maybe I am wrong here - can someone explain:
what is the exact connection between these two technologies?
why is installing PySpark enough to make it run? Does it actually install Spark under the hood? If yes, where?
if you install only PySpark, is there something you miss (e.g. I cannot find the sbin folder which contains e.g. script to start history server)
As of v2.2, executing pip install pyspark will install Spark.
If you're going to use Pyspark it's clearly the simplest way to get started.
On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars
PySpark installed by pip is a subfolder of full Spark. you can find most of PySpark python file in spark-3.0.0-bin-hadoop3.2/python/pyspark. so if you'd like to use java or scala interface, and deploy distribute system with hadoop, you must download full Spark from Apache Spark and install it.
PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark.
This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

Apache Spark, Error on sbt/assembly

I tried to install apache spark on my local Linux Mint but when I do
sbt assembly
I got error like below :
How to fix this?
Please advise
From your picture which is spark binary folder.
This mean you don't need to install that just use it by running bin/spark-shell, bin/spark-submit.
If you want to compile it need to download the source code

Connecting SparkR to the spark cluster

I have a spark cluster running on 10 machines (1 - 10) with the master at machine 1. All of these run on CentOS 6.4.
I am trying to connect a jupyterhub installation (which is running inside a ubuntu docker because of issues with installing on CentOS), using sparkR, to the cluster and get the spark context.
The code I am using is
Sys.setenv(SPARK_HOME="/usr/local/spark-1.4.1-bin-hadoop2.4")
library(SparkR)
sc <- sparkR.init(master="spark://<master-ip>:7077")
The output I get is
attaching package: ‘SparkR’
The following object is masked from ‘package:stats’:
filter
The following objects are masked from ‘package:base’:
intersect, sample, table
Launching java with spark-submit command spark-submit sparkr-shell/tmp/Rtmpzo6esw/backend_port29e74b83c7b3 Error in sparkR.init(master = "spark://10.10.5.51:7077"): JVM is not ready after 10 seconds
Error in sparkRSQL.init(sc): object 'sc' not found
I am using Spark 1.4.1. The spark cluster is also running CDH 5.
The jupyterhub installation can connect to the cluster via pyspark and I have python notebooks which use pyspark.
Can someone tell me what I am doing wrong?
I have a similar problem and have searching all around but no solutions. Can you please tell me what do you mean by "jupyterhub installation (which is running inside a ubuntu docker because of issues with installing on CentOS), "?
We have 4 clusters too on CentOS 6.4. One of my other problem is that how do use an IDE like IPython or RStudio to interact with these 4 servers? Do I use my laptop to connect to these servers remotely (if yes, then how?) and if no then what can be the other solution.
Now to answer your question, I can give it a try. I think the you have to use --yarn-cluster option as stated here I hope this helps you solving the problem.
Cheers,
Ashish

Do I need Hadoop in my windows to connect on hbase running on linux?

Do I need Hadoop in my windows to connect on hbase running on ununtu with hadoop?
My hbase is running fine on my ubuntu machine. I am able to connect with eclipse on same machine ( I am using kundera to connect hbase). Now I want to connect hbase from my windows 7 eclipse IDE . Do I need to install hadoop on my windows to connect remote hbase which is on ubuntu .?? when I tried I am getting something like this
Failed to locate the winutils binary in the hadoop binary path
Read about open-source technology .IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
All you need are the hadoop, hbase jars and then Configuration object initialized
with:
1. hbase.zookeeper.quorum(if cluster) details and other information initialized.
2. hbase.zookeeper.property.clientPort
3. zookeeper.znode.parent
And then, getting connection with the above config object
This problem usually occurs in Hadoop 2.x.x version. One of the option is to build Windows distribution for Hadoop version.
Refer this link:
http://www.srccodes.com/p/article/38/build-install-configure-run-apache-hadoop-2.2.0-microsoft-windows-os
But, before building try to use the zip file given in this link:
http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path
Extract this zip file and paste the files under hadoop-common-2.2.0/bin to $HADOOP_HOME/bin directory.
Note: For me this works even for Hadoop 2.5 version.

Hadoop multi-node cluster manual installation over Ubuntu 14.04

I am a newcomer to Hadoop. For my College project we are given 4 VMs. I need to configure a multi-mode Hadoop cluster on this ( 1 master 3 slaves) and run my webapp on it. I would be using HBase in my project. Usually CentOS is used for installation and deployment of HDP, whereas I was given ubuntu. I cannot use Apache ambari plugin for installation as it is not supported in Ubuntu. I need to manually deploy them, Hence I tried looking out for tutorials.
I looked out for a tutorial to install HDP multinode clusters on ubuntu and found this [http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/]
But its too outdated (2010)
I have the official documentation here, but I am not able to follow it properly.
[http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1-latest/bk_installing_manually_book/content/rpm-chap2-3.html] and I tried following them.
Could someone suggest me somelinks which are latest, a tutorial with decent amount of screenshots for installation of multinode clusters over Ubuntu 14.04 ( 12.04 is also fine).
Thanks a lot!!
The Michael Noll tutorial is too old, I think. I found this site:
https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10
I have a mini cluster (with 5 slaves and a master) in my University Lab. Ubuntu 12.04 and Hadoop 2.5.0 is there. Furthermore, I have a VM cluster in my laptop (2 slaves and a master) of Hadoop 1.2.1 on Ubuntu 12.04 too.
But I couldn't install Hadoop (any version) in Ubuntu 14.04. I don't remember the cause, but I think it was some problem with Java version (I don't check that).
I hope that help you!
I can across the same issue to install HDP 2.2 on Ubuntu 14.04, and found a solution.
I documented everything here: http://www.swiss-scalability.com/2014/12/install-hdp-22-on-ubuntu-1404-trusty.html
In a nutshell, the magic happens here:
sed -e "s/14.04/12.04/g" -i /etc/*-release
And the you can install or restart ambari-agent, it will be able to communicate with ambari-server.

Resources