For a single CDH (Hadoop) cluster installation, which host should I use? - linux

I started with a Windows 7 computer, and set up an Ubuntu Linux virtual machine which I run using VirtualBox. The Cloudera Manager Free Edition version 4 has been executed, and I have been following the prompts on localhost:7180.
I am now stuck when the prompt asks me to "Specify hosts for your CDH cluster installation." Can I install all of the Hadoop components, as well as run them, in the linux virtual machine alone?
Please help point me in the right direction in which host I should specify.

Yes, you can run cdh in a linux virtual machine alone. You could do it using "standalone" or "pseudo distributed" modes. IMHO, the most effective method for doing it is to use the "pseudo distributed" mode.
In this case, there are multiple java-virtual-machines (JVM) running, so they simulated as they were a cluster with multiples nodes (each thread simulated to be a cluster node).
Cloudera has documented how to get deployed as "pseudo distributed":
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_qs_cdh5_pseudo.html
Note: 3 ways for deploying cdh:
standalone: using a machine alone, with a unique jvm
pseudo-distributed: using a machine alone, but several jvm's, so
simulated to be a cluster
distributed: using a cluster, so several
nodes with different purposes (workers, namenode, etc).

you can specify hostname of your machine. it will install everything on your machine only.

Related

How to add slaves to local mode? How to setup Spark cluster on Windows 7?

I am able to run the apache spark on windows with spark-shell --master local[2]. How we can add slaves to the master node?
I think YARN and Mesos are not available on Windows. What are the steps to setup the Spark cluster on Windows 7?
Switching to Unix based system is not option available to us as of now.
Finally found the relevant link to setup spark cluster on Windows.
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-start-master-and-workers-on-Windows-td12669.html
tl;dr You cannot add slaves to local mode.
local mode is the only non-cluster mode where all Spark services run within a single JVM. Read Master URLs for the other options (regarding the master URLs).
If you want to have a clustered deployment environment you should use Spark Standalone, Hadoop YARN or Apache Mesos as described in Cluster Mode Overview. I highly recommend using Spark Standalone first before going into a more advanced cluster managers.
I'm on Mac OS, so I can be sure the cluster managers work on Windows 7 reliably, but I did see Spark Standalone working on Windows. You should use spark-class to start the Master and slaves as the startup scripts are for Unix OSes.

Running remote cqlsh to execute commands on Cassandra Cluster

So I have a Cassandra cluster of 6 nodes on my Ubuntu machines, now I have got another machine running Windows Server 2008. I have installed DataStax Apache Cassandra on this new Windows machine, and I want to be able to run all the CQL commands from Windows machine onto Ubuntu machines. So its like remote command execution.
I tried opening cqlsh in cmd using cqlsh with the IP of my one of the nodes and port like cqlsh 192.168.4.7 9160
But I can't seem to make it work. Also I don't want to add the new machine to my existing cluster Please suggest.
Provided version 3.1.1 is not supported by this server (supported: 2.0.0, 3.0.5)
any workaround u could suggest?
Basically, you have two options here. The harder one would be to upgrade your cluster (the tough, long-term solution). But there have been many improvements since 1.2.9 that you could take advantage of. Not to mention bugs fixed long ago that you may be running into.
The other, quicker option would be to install 1.2.9 on your Windows machine. Probably the easiest way to do this, would be to zip-up your Cassandra dir on Ubuntu (minus the data, commitlog, and saved caches dirs of course), copy it to your Windows machine, and expand it. Then the cqlsh versions would match-up, and you could solve your immediate problem.

hadoop nodes with Linux + windows

I have 4 windows machines, On which i have installed hadoop on 3 out of 4.
One machine having the Harton work Sandbox ( say for 4-th machine) , Now i need to make the 4th machine as my server ( Name node )
and rest of the machine as slaves.
Whether it will work if i update the configuration files in the rest of 3 machines
Or is there any other way to do this ?
Any other suggestions ?
Thanks
finally i got this but i could not find any help
Hadoop cluster configuration with Ubuntu Master and Windows slave
A non-secure cluster will work (non-secure in the sense that you do not enable Hadoop Kerberos based auth and security, ie. hadoop.security.authentication is left as simple). You need to update all nodes config to point to the new 4th node as the master for various services you plan to host on it. You mention namenode, but I assume you really mean to make the 4th node the 'head' node, meaning it will probably also run resourcemanager and historyserver (or the jobtracker for old-style Hadoop). And that is only core, w/o considering higher level components like Hive, Pig, Oozie etc, and not even mentioning Ambari or Hue.
Doing a post-install configuration of existing Windows (or Linux, makes no difference) nodes is possible, editing the various xx-site.xml files. You'll have to know what to change and is not trivial. Probably it would be much easier to just deploy again the windows machines, with an appropriate clusterproperties.txt config file. See Option III - Manual Install One Node At A Time.

can HBase , MapReduce and HDFS can work on a single machine having Hadoop installed and running on it?

I am working on a search engine design, which is to be run on cloud.
We have just started, and have not much idea about Hdoop.
Can anyone tell if HBase , MapReduce and HDFS can work on a single machine having Hdoop installed and running on it ?
Yes you can. You can even create a Virtual Machine and run it on there on a single "computer" (which is what I have :) ).
The key is to simply install Hadoop in "Pseudo Distributed Mode" which is even described in the Hadoop Quickstart.
If you use the Cloudera distribution they have even created the configs needed for that in an RPM. Look here for more info in that.
HTH
Yes. In my development environment, I run
NameNode (HDFS)
SecondaryNameNode (HDFS)
DataNode (HDFS)
JobTracker (MapReduce)
TaskTracker (MapReduce)
Master (HBase)
RegionServer (HBase)
QuorumPeer (ZooKeeper - needed for HBase)
In addition, I run my applications, and map and reduce tasks launched by the task tracker.
Running so many processes on the same machine results in a lot of contention for CPU cores, memory, and disk I/O, so it's definitely not great for high performance, but there is no limitation other than the amount of resources available.
same here, I am running hadoop/hbase/hive on a single computer.
If you really really want to see distributed computing on a single computer, grab lots of RAM, some hard disk space and go like this -
make one or two virtual machines (use virtual box)
install hadoop on each of them, make ur real instalation (not any virtual one) as the master, rest slave
configure hadoop for real distributed environment
now when hadoop starts, you should actually have a cluster of multiple computers (one real, rest virtual)
this could just be an experiment, because unless you have a decent multi-cpu or multi-core system, such a configuration will actually consume more on maintaining itself than giving you any performance.
gud luck.
--l4l

Setting up a (Linux) Hadoop cluster

Do you need to set up a Linux cluster first in order to setup a Hadoop cluster ?
No. Hadoop has its own software to manage a "cluster". Just install linux and make sure the machines can talk to each other.
Deploying the Hadoop software, along with the appropriate config files, and starting it on each node (which Hadoop can do automatically) creates the cluster from the Linux machines you have. So, no, by that definition you don't need to have a separate linux cluster. If your question is whether or not you need to have a multiple-machine cluster to use Hadoop: no, you can run Hadoop on a single machine for either testing or small-sized jobs, via either local mode (where everything is confined to a single process) or pseudodistributed mode (where you trick Hadoop into thinking it's running on multiple computers).

Resources