Single node standalone Spark? - apache-spark

Is there any way I can install and use Spark without Hadoop/HDFS on a single node? I just want to try some simple examples and I would like to keep it as light as possible.

Yes it is possible.
However it needs hadoop libraries underneath, so download any version of prebuilt spark like,
wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.4.tgz
tar -xvzf spark-1.6.0-bin-hadoop2.4.tgz
cd spark-1.6.0-bin-hadoop2.4/bin
./spark-shell

Related

Building Spark with provided Hadoop

I've been trying to build a custom Spark build with a custom built Hadoop (I need to apply a patch to Hadoop 2.9.1 that allows me to use S3Guard on paths that start with s3://).
Here is how I build spark, after cloning it and being on Spark 2.3.1 on my Dockerfile:
ARG HADOOP_VER=2.9.1
RUN bash -c \
"MAVEN_OPTS='-Xmx2g -XX:ReservedCodeCacheSize=512m' \
./dev/make-distribution.sh \
--name hadoop${HADOOP_VER} \
--tgz \
-Phadoop-provided \
-Dhadoop.version=${HADOOP_VER} \
-Phive \
-Phive-thriftserver \
-Pkubernetes"
This compiles successfully, but when I try to use Spark with s3:// paths I still an error on the Hadoop code that I'm sure I removed through my patch when compiling it. So that Spark build is not using my Hadoop provided JARs as far as I can tell.
What is the right way to compile Spark so that it does not includes the Hadoop JARs and uses the one I provide.
Note: I run on standalone mode and I set SPARK_DIST_CLASSPATH=$(hadoop classpath) so that it points to my Hadoop classpath.
For custom hadoop versions, you need to get your own artifacts onto the local machines, and into the spark tar file which is distributed round the cluster (usually in HDFS), and downloaded when the workers are deployed (in YARN; no idea about k8s)
The best way to do this reliably is to locally build a hadoop release with a new version number, and build spark against that.
dev/make-distribution.sh -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Phadoop-3.1 -Phadoop-cloud -Dhadoop.version=2.9.3-SNAPSHOT
That will create a spark distro with the hadoop-aws and matching SDK which you have built up.
It's pretty slow: run nailgun/zinc if you can for some speedup. If you refer to a version which is also in the public repos, there's a high chance whatever cached copies in your maven repo ~/.m2/repository has crept in.
Then: bring up spark shell and test from there, before trying any more complex setups.
Finally, there is some open JIRA for s3guard to not worry about s3 vs s3a in URLs. Is that your patch? If not, does it work? we could probably get it in to future hadoop releases if people who need it are happy

Cloudera Quick Start VM lacks Spark 2.0 or greater

In order to test and learn Spark functions, developers require Spark latest version. As the API's and methods earlier to version 2.0 are obsolete and no longer work in the newer version. This throws a bigger challenge and developers are forced to install Spark manually which wastes a considerable amount of development time.
How do I use a later version of Spark on the Quickstart VM?
Every one should not waste setup time which I have wasted, so here is the solution.
SPARK 2.2 Installation Setup on Cloudera VM
Step 1: Download a quickstart_vm from the link:
Prefer a vmware platform as it is easy to use, anyways all the options are viable.
Size is around 5.4gb of the entire tar file. We need to provide the business email id as it won’t accept personal email ids.
Step 2: The virtual environment requires around 8gb of RAM, please allocate sufficient memory to avoid performance glitches.
Step 3: Please open the terminal and switch to root user as:
su root
password: cloudera
Step 4: Cloudera provides java –version 1.7.0_67 which is old and does not match with our needs. To avoid java related exceptions, please install java with the following commands:
Downloading Java:
wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
Switch to /usr/java/ directory with “cd /usr/java/” command.
cp the java download tar file to the /usr/java/ directory.
Untar the directory with “tar –zxvf jdk-8u31-linux-x64.tar.gz”
Open the profile file with the command “vi ~/.bash_profile”
export JAVA_HOME to the new java directory.
export JAVA_HOME=/usr/java/jdk1.8.0_131
Save and Exit.
In order to reflect the above change, following command needs to be executed on the shell:
source ~/.bash_profile
The Cloudera VM provides spark 1.6 version by default. However, 1.6 API’s are old and do not match with production environments. In that case, we need to download and manually install Spark 2.2.
Switch to /opt/ directory with the command:
cd /opt/
Download spark with the command:
wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
Untar the spark tar with the following command:
tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
We need to define some environment variables as default settings:
Please open a file with the following command:
vi /opt/spark-2.2.0-bin-hadoop2.7/conf/spark-env.sh
Paste the following configurations in the file:
SPARK_MASTER_IP=192.168.50.1
SPARK_EXECUTOR_MEMORY=512m
SPARK_DRIVER_MEMORY=512m
SPARK_WORKER_MEMORY=512m
SPARK_DAEMON_MEMORY=512m
Save and exit
We need to start spark with the following command:
/opt/spark-2.2.0-bin-hadoop2.7/sbin/start-all.sh
Export spark_home :
export SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.7/
Change the permissions of the directory:
chmod 777 -R /tmp/hive
Try “spark-shell”, it should work.

Is it necessary to install hadoop in /usr/local?

I am trying to build a hadoop cluster with four nodes.
The four machines are from my school's lab and I found their /usr/local are mount from a same public disk which means their /usr/local are identical.
The problem is, I can not start data node on slaves because the hadoop files are always the same(like tmp/dfs/data).
I am planning to configure and insatll hadoop in other dirs like /opt .
The problem is I found almost all the installation tutorial ask us to install it in /usr/local , so I was wondering will there be any bad consequence if I install hadoop in other place like /opt ?
Btw, I am using Ubuntu 16.04
As long as HADOOP_HOME points to where you extracted the hadoop binaries, then it shouldn't matter.
You'll also want to update PATH in ~/.bashrc, for example.
export HADOOP_HOME=/path/to/hadoop_x.yy
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
For reference, I have some configuration files inside of /etc/hadoop.
(Note: Apache Ambari makes installation easier)
It is not at all necessary to install hadoop under /usr/local. That location is generally used when you install single node hadoop cluster (although it is not mandatory). As long as you have following variables specified in .bashrc, any location should work.
export HADOOP_HOME=<path-to-hadoop-install-dir>
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

Cassandra service not starting

Im trying to start cassandra but dont see any response.
hduser#vagrant:~/cassandra/apache-cassandra-2.1.12-src/bin$ cassandra -f
hduser#vagrant:~/cassandra/apache-cassandra-2.1.12-src/bin$
I have downloaded cassandra in this folder -
/home/hduser/cassandra/apache-cassandra-2.1.12-src
It looks like you downloaded the source distribution of cassandra, try downloading the binary ('bin') distribution instead from the download page.
Otherwise, you can compile the source distribution using Apache Ant by running ant from the root directory of the extracted archive and then running cassandra per usual.

How to upgrade Spark to newer version?

I have a virtual machine which has Spark 1.3 on it but I want to upgrade it to Spark 1.5 primarily due certain supported functionalities which were not in 1.3. Is it possible I can upgrade the Spark version from 1.3 to 1.5 and if yes then how can I do that?
Pre-built Spark distributions, like the one I believe you are using based on another question of yours, are rather straightforward to "upgrade", since Spark is not actually "installed". Actually, all you have to do is:
Download the appropriate Spark distro (pre-built for Hadoop 2.6 and later, in your case)
Unzip the tar file in the appropriate directory (i.e.where folder spark-1.3.1-bin-hadoop2.6 already is)
Update your SPARK_HOME (and possibly some other environment variables depending on your setup) accordingly
Here is what I just did myself, to go from 1.3.1 to 1.5.2, in a setting similar to yours (vagrant VM running Ubuntu):
1) Download the tar file in the appropriate directory
vagrant#sparkvm2:~$ cd $SPARK_HOME
vagrant#sparkvm2:/usr/local/bin/spark-1.3.1-bin-hadoop2.6$ cd ..
vagrant#sparkvm2:/usr/local/bin$ ls
ipcluster ipcontroller2 iptest ipython2 spark-1.3.1-bin-hadoop2.6
ipcluster2 ipengine iptest2 jsonschema
ipcontroller ipengine2 ipython pygmentize
vagrant#sparkvm2:/usr/local/bin$ sudo wget http://apache.tsl.gr/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
[...]
vagrant#sparkvm2:/usr/local/bin$ ls
ipcluster ipcontroller2 iptest ipython2 spark-1.3.1-bin-hadoop2.6
ipcluster2 ipengine iptest2 jsonschema spark-1.5.2-bin-hadoop2.6.tgz
ipcontroller ipengine2 ipython pygmentize
Notice that the exact mirror you should use with wget will be probably different than mine, depending on your location; you will get this by clicking the "Download Spark" link in the download page, after you have selected the package type to download.
2) Unpack the tgz file with
vagrant#sparkvm2:/usr/local/bin$ sudo tar -xzf spark-1.*.tgz
vagrant#sparkvm2:/usr/local/bin$ ls
ipcluster ipcontroller2 iptest ipython2 spark-1.3.1-bin-hadoop2.6
ipcluster2 ipengine iptest2 jsonschema spark-1.5.2-bin-hadoop2.6
ipcontroller ipengine2 ipython pygmentize spark-1.5.2-bin-hadoop2.6.tgz
You can see that now you have a new folder, spark-1.5.2-bin-hadoop2.6.
3) Update accordingly SPARK_HOME (and possibly other environment variables you are using) to point to this new directory instead of the previous one.
And you should be done, after restarting your machine.
Notice that:
You don't need to remove the previous Spark distribution, as long as all the relevant environment variables point to the new one. That way, you may even quickly move "back-and-forth" between the old and new version, in case you want to test things (i.e. you just have to change the relevant environment variables).
sudo was necessary in my case; it may be unnecessary for you depending on your settings.
After ensuring that everything works fine, it's good idea to delete the downloaded tgz file.
You can use the exact same procedure to upgrade to future versions of Spark, as they come out (rather fast). If you do this, either make sure that previous tgz files have been deleted, or modify the tar command above to point to a specific file (i.e. no * wildcards as above).
Set your SPARK_HOME to /opt/spark
Download the latest pre-built binary i.e. spark-2.2.1-bin-hadoop2.7.tgz - can use wget
Create the symlink to the latest download - ln -s /opt/spark-2.2.1 /opt/spark
Edit files in $SPARK_HOME/conf accordingly
For every new version you download just create the symlink to it (step 3)
ln -s /opt/spark-x.x.x /opt/spark

Resources