hadoop nodes with Linux + windows - linux

I have 4 windows machines, On which i have installed hadoop on 3 out of 4.
One machine having the Harton work Sandbox ( say for 4-th machine) , Now i need to make the 4th machine as my server ( Name node )
and rest of the machine as slaves.
Whether it will work if i update the configuration files in the rest of 3 machines
Or is there any other way to do this ?
Any other suggestions ?
Thanks
finally i got this but i could not find any help
Hadoop cluster configuration with Ubuntu Master and Windows slave

A non-secure cluster will work (non-secure in the sense that you do not enable Hadoop Kerberos based auth and security, ie. hadoop.security.authentication is left as simple). You need to update all nodes config to point to the new 4th node as the master for various services you plan to host on it. You mention namenode, but I assume you really mean to make the 4th node the 'head' node, meaning it will probably also run resourcemanager and historyserver (or the jobtracker for old-style Hadoop). And that is only core, w/o considering higher level components like Hive, Pig, Oozie etc, and not even mentioning Ambari or Hue.
Doing a post-install configuration of existing Windows (or Linux, makes no difference) nodes is possible, editing the various xx-site.xml files. You'll have to know what to change and is not trivial. Probably it would be much easier to just deploy again the windows machines, with an appropriate clusterproperties.txt config file. See Option III - Manual Install One Node At A Time.

Related

How to sync configuration between hadoop worker machines

We have huge hadoop cluster and we installed one coordinator preso node
and 850 presto workers nodes. now we want to change the values in the file - config.properties but this should be done on all the workers!
so under
/opt/DBtasks/presto/presto-server-0.216/etc
the file is like this
[root#worker01 etc]# more config.properties
#
coordinator=false
http-server.http.port=8008
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://master01.sys76.com:8008
and we want to change it to
coordinator=false
http-server.http.port=8008
query.max-memory=500GB
query.max-memory-per-node=5GB
query.max-total-memory-per-node=20GB
discovery.uri=http://master01.sys76.com:8008
but this was done only on the first node - worker01, but we need to do it also on all workers. well - we can copy this file by scp to all other workers , but not in case root is restricted but what I want to know , if presto already think about more elegant approach that sync the configuration on all workers node as all know after we set new values we need also to restart the presto louncer script
dose presto have solution to this ?
I must to tell that my cluster is restricted root , so we cant copy the files VIA ssh
Presto does not have the ability to sync the configurations. This is something you would need to manage outside e.g. using a tool like Ansible. There is also project command line tool presto-admin (https://github.com/prestosql/presto-admin) that can assist with deploying the configs across the cluster.
Additionally, if you are using public clouds such as AWS, there are commercial solutions from Starburst (https://www.starburstdata.com/) that can assist management of the configurations as well.

Typical Hadoop setup for remote job submission

So I am still a bit new to hadoop and am currently in the process of setting up a small test cluster on Amazonaws. So my question relates to some tips on the structuring of the cluster so it is possible to work submit jobs from remote machines.
Currently I have 5 machines. 4 are basically the Hadoop cluster with the NameNodes, Yarn etc. One machine is used as a manager machine( Cloudera Manager). I am gonna describe my thinking process on the setup and if anyone can chime in the points I am not clear with, that would be great.
I was thinking what was the best setup for a small cluster. So I decided to expose only one manager machine and probably use that to submit all the jobs through it. The other machines will see each other etc, but not be accessible from the outside world. I am have conceptual idea on how to do this,but I am not sure how to properly go about doing this though, if anyone could point me in the right direction that would great.
Also another big point is, I want to be able to submit jobs to the cluster through exposed machine from a client machine (might be Windows). I am not so clear on this setup as well. Do I need to have Hadoop installed on the machine in order to use the normal hadoop commands, and to write/submit jobs say from Eclipse or something similar.
So to sum it up my questions are,
Is this an ok setup for a small test cluster
How can I go about using one exposed machine to submit/route jobs to the cluster, without having any of the Hadoop nodes on it.
How do I setup a client machine to submit jobs to a remote cluster, and an example on how to do it on Windows. Also if there are any reason not to use Windows as a client machine in this setup.
Thanks I would greatly appreciate any advice or help on this.
Since this is not answered I will attempt to answer it.
1. Rest api to submit an application:
Resource 1(Cluster Applications API(Submit Application)): https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_APISubmit_Application
Resource 2: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-management/content/ch_yarn_rest_apis.html
Resource 3: https://hadoop-forum.org/forum/general-hadoop-discussion/miscellaneous/2136-how-can-i-run-mapreduce-job-by-rest-api
Resource 4: Run a MapReduce job via rest api
2. Submitting hadoop job fromĀ  client machine
Resource 1: https://pravinchavan.wordpress.com/2013/06/18/submitting-hadoop-job-from-client-machine/
3. Sending program to remote hadoop cluster
It is possible to send the program to a remote Hadoop cluster for running it. All you need to ensure is that you have set the resource manager address, fs.defaultFS, library files, and mapreduce.framework.name correctly before running the actual job.
Resource 1: (how to submit mapreduce job with yarn api in java)

For a single CDH (Hadoop) cluster installation, which host should I use?

I started with a Windows 7 computer, and set up an Ubuntu Linux virtual machine which I run using VirtualBox. The Cloudera Manager Free Edition version 4 has been executed, and I have been following the prompts on localhost:7180.
I am now stuck when the prompt asks me to "Specify hosts for your CDH cluster installation." Can I install all of the Hadoop components, as well as run them, in the linux virtual machine alone?
Please help point me in the right direction in which host I should specify.
Yes, you can run cdh in a linux virtual machine alone. You could do it using "standalone" or "pseudo distributed" modes. IMHO, the most effective method for doing it is to use the "pseudo distributed" mode.
In this case, there are multiple java-virtual-machines (JVM) running, so they simulated as they were a cluster with multiples nodes (each thread simulated to be a cluster node).
Cloudera has documented how to get deployed as "pseudo distributed":
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_qs_cdh5_pseudo.html
Note: 3 ways for deploying cdh:
standalone: using a machine alone, with a unique jvm
pseudo-distributed: using a machine alone, but several jvm's, so
simulated to be a cluster
distributed: using a cluster, so several
nodes with different purposes (workers, namenode, etc).
you can specify hostname of your machine. it will install everything on your machine only.

can HBase , MapReduce and HDFS can work on a single machine having Hadoop installed and running on it?

I am working on a search engine design, which is to be run on cloud.
We have just started, and have not much idea about Hdoop.
Can anyone tell if HBase , MapReduce and HDFS can work on a single machine having Hdoop installed and running on it ?
Yes you can. You can even create a Virtual Machine and run it on there on a single "computer" (which is what I have :) ).
The key is to simply install Hadoop in "Pseudo Distributed Mode" which is even described in the Hadoop Quickstart.
If you use the Cloudera distribution they have even created the configs needed for that in an RPM. Look here for more info in that.
HTH
Yes. In my development environment, I run
NameNode (HDFS)
SecondaryNameNode (HDFS)
DataNode (HDFS)
JobTracker (MapReduce)
TaskTracker (MapReduce)
Master (HBase)
RegionServer (HBase)
QuorumPeer (ZooKeeper - needed for HBase)
In addition, I run my applications, and map and reduce tasks launched by the task tracker.
Running so many processes on the same machine results in a lot of contention for CPU cores, memory, and disk I/O, so it's definitely not great for high performance, but there is no limitation other than the amount of resources available.
same here, I am running hadoop/hbase/hive on a single computer.
If you really really want to see distributed computing on a single computer, grab lots of RAM, some hard disk space and go like this -
make one or two virtual machines (use virtual box)
install hadoop on each of them, make ur real instalation (not any virtual one) as the master, rest slave
configure hadoop for real distributed environment
now when hadoop starts, you should actually have a cluster of multiple computers (one real, rest virtual)
this could just be an experiment, because unless you have a decent multi-cpu or multi-core system, such a configuration will actually consume more on maintaining itself than giving you any performance.
gud luck.
--l4l

Setting up a (Linux) Hadoop cluster

Do you need to set up a Linux cluster first in order to setup a Hadoop cluster ?
No. Hadoop has its own software to manage a "cluster". Just install linux and make sure the machines can talk to each other.
Deploying the Hadoop software, along with the appropriate config files, and starting it on each node (which Hadoop can do automatically) creates the cluster from the Linux machines you have. So, no, by that definition you don't need to have a separate linux cluster. If your question is whether or not you need to have a multiple-machine cluster to use Hadoop: no, you can run Hadoop on a single machine for either testing or small-sized jobs, via either local mode (where everything is confined to a single process) or pseudodistributed mode (where you trick Hadoop into thinking it's running on multiple computers).

Resources