Setting up a (Linux) Hadoop cluster - linux

Do you need to set up a Linux cluster first in order to setup a Hadoop cluster ?

No. Hadoop has its own software to manage a "cluster". Just install linux and make sure the machines can talk to each other.

Deploying the Hadoop software, along with the appropriate config files, and starting it on each node (which Hadoop can do automatically) creates the cluster from the Linux machines you have. So, no, by that definition you don't need to have a separate linux cluster. If your question is whether or not you need to have a multiple-machine cluster to use Hadoop: no, you can run Hadoop on a single machine for either testing or small-sized jobs, via either local mode (where everything is confined to a single process) or pseudodistributed mode (where you trick Hadoop into thinking it's running on multiple computers).

Related

Is it possible to run ANY application or program with HADOOP YARN?

I'm studying distributed computing recently and found out Hadoop Yarn is one of them.
So thought if I just establish Hadoop Yarn cluster, then every application will run distributed.
But now someone told me that HADOOP Yarn cannot do anything by itself and need other things like mapreduce, spark, and hbase.
If this is correct, then is that mean only limited tasks can be run with Yarn?
Or can I apply Yarn's distributed computing to all applications I want?
Hadoop is the name which refers to the entire system.
HDFS is the actual storage system. Think of it as S3 or a distributed Linux filesystem.
YARN is a framework for scheduling jobs and allocating resources. It handles these things for you, but you don't interact very much with it.
Spark and MapReduce are managed by Yarn. With these two, you can actually write your code/applications and give work to the cluster.
HBase uses the HDFS storage (with is file based) and provides NoSql storage.
Theoretically you can run more than just Spark and MapReduce on Yarn and you can use something else then Yarn (Kubernetes is in works or supported now). You can even write your own processing tool, queue/resource management system, storage... Hadoop has many pieces which you may use or not, depending on your case. But the majority of Hadoop systems use Yarn and Spark.
If you want to deploy Docker containers for example, just a Kubernetes cluster would be a better choice. If you need batch/real time processing with Spark, use Hadoop.
YARN itself is a resource manager. You will need to write code that can be deployed onto those resources, and then that could do anything, given that the nodes running the tasks are themselves capable of running the job. For example, you cannot distribute a Python library without first installing the dependencies for that script. Mesos is a bit more generalized / accessible than YARN, if you want more flexibility for the same affect.
YARN mostly supports running JAR files, shell scripts (at least, from Oozie) or Docker containers can be deployed to it as well (refer Apache docs)
You may also refer to the Apache Slider or Twill projects for more information.

Spark program difference in local mode and cluster

If i write a spark program and run it in stand alone mode and when I want to deploy it in a cluster, do I have to change my program codes or no change in codes needed? Is spark programming independent of number of clusters?
I don't think you need to make any changes. Your program should run the same way as it run in local mode.
Yes, Spark programs are independent of clusters, until and unless you are using something specific to cluster. Normally this is managed by the YARN.
You just need to set option master to yarn or other resource manager, when you want to run it on cluster.
If you want to run it locally just use local[*] by using the number of threads which equal to your machine cores.

hadoop nodes with Linux + windows

I have 4 windows machines, On which i have installed hadoop on 3 out of 4.
One machine having the Harton work Sandbox ( say for 4-th machine) , Now i need to make the 4th machine as my server ( Name node )
and rest of the machine as slaves.
Whether it will work if i update the configuration files in the rest of 3 machines
Or is there any other way to do this ?
Any other suggestions ?
Thanks
finally i got this but i could not find any help
Hadoop cluster configuration with Ubuntu Master and Windows slave
A non-secure cluster will work (non-secure in the sense that you do not enable Hadoop Kerberos based auth and security, ie. hadoop.security.authentication is left as simple). You need to update all nodes config to point to the new 4th node as the master for various services you plan to host on it. You mention namenode, but I assume you really mean to make the 4th node the 'head' node, meaning it will probably also run resourcemanager and historyserver (or the jobtracker for old-style Hadoop). And that is only core, w/o considering higher level components like Hive, Pig, Oozie etc, and not even mentioning Ambari or Hue.
Doing a post-install configuration of existing Windows (or Linux, makes no difference) nodes is possible, editing the various xx-site.xml files. You'll have to know what to change and is not trivial. Probably it would be much easier to just deploy again the windows machines, with an appropriate clusterproperties.txt config file. See Option III - Manual Install One Node At A Time.

For a single CDH (Hadoop) cluster installation, which host should I use?

I started with a Windows 7 computer, and set up an Ubuntu Linux virtual machine which I run using VirtualBox. The Cloudera Manager Free Edition version 4 has been executed, and I have been following the prompts on localhost:7180.
I am now stuck when the prompt asks me to "Specify hosts for your CDH cluster installation." Can I install all of the Hadoop components, as well as run them, in the linux virtual machine alone?
Please help point me in the right direction in which host I should specify.
Yes, you can run cdh in a linux virtual machine alone. You could do it using "standalone" or "pseudo distributed" modes. IMHO, the most effective method for doing it is to use the "pseudo distributed" mode.
In this case, there are multiple java-virtual-machines (JVM) running, so they simulated as they were a cluster with multiples nodes (each thread simulated to be a cluster node).
Cloudera has documented how to get deployed as "pseudo distributed":
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_qs_cdh5_pseudo.html
Note: 3 ways for deploying cdh:
standalone: using a machine alone, with a unique jvm
pseudo-distributed: using a machine alone, but several jvm's, so
simulated to be a cluster
distributed: using a cluster, so several
nodes with different purposes (workers, namenode, etc).
you can specify hostname of your machine. it will install everything on your machine only.

can HBase , MapReduce and HDFS can work on a single machine having Hadoop installed and running on it?

I am working on a search engine design, which is to be run on cloud.
We have just started, and have not much idea about Hdoop.
Can anyone tell if HBase , MapReduce and HDFS can work on a single machine having Hdoop installed and running on it ?
Yes you can. You can even create a Virtual Machine and run it on there on a single "computer" (which is what I have :) ).
The key is to simply install Hadoop in "Pseudo Distributed Mode" which is even described in the Hadoop Quickstart.
If you use the Cloudera distribution they have even created the configs needed for that in an RPM. Look here for more info in that.
HTH
Yes. In my development environment, I run
NameNode (HDFS)
SecondaryNameNode (HDFS)
DataNode (HDFS)
JobTracker (MapReduce)
TaskTracker (MapReduce)
Master (HBase)
RegionServer (HBase)
QuorumPeer (ZooKeeper - needed for HBase)
In addition, I run my applications, and map and reduce tasks launched by the task tracker.
Running so many processes on the same machine results in a lot of contention for CPU cores, memory, and disk I/O, so it's definitely not great for high performance, but there is no limitation other than the amount of resources available.
same here, I am running hadoop/hbase/hive on a single computer.
If you really really want to see distributed computing on a single computer, grab lots of RAM, some hard disk space and go like this -
make one or two virtual machines (use virtual box)
install hadoop on each of them, make ur real instalation (not any virtual one) as the master, rest slave
configure hadoop for real distributed environment
now when hadoop starts, you should actually have a cluster of multiple computers (one real, rest virtual)
this could just be an experiment, because unless you have a decent multi-cpu or multi-core system, such a configuration will actually consume more on maintaining itself than giving you any performance.
gud luck.
--l4l

Resources