Typical Hadoop setup for remote job submission - linux

So I am still a bit new to hadoop and am currently in the process of setting up a small test cluster on Amazonaws. So my question relates to some tips on the structuring of the cluster so it is possible to work submit jobs from remote machines.
Currently I have 5 machines. 4 are basically the Hadoop cluster with the NameNodes, Yarn etc. One machine is used as a manager machine( Cloudera Manager). I am gonna describe my thinking process on the setup and if anyone can chime in the points I am not clear with, that would be great.
I was thinking what was the best setup for a small cluster. So I decided to expose only one manager machine and probably use that to submit all the jobs through it. The other machines will see each other etc, but not be accessible from the outside world. I am have conceptual idea on how to do this,but I am not sure how to properly go about doing this though, if anyone could point me in the right direction that would great.
Also another big point is, I want to be able to submit jobs to the cluster through exposed machine from a client machine (might be Windows). I am not so clear on this setup as well. Do I need to have Hadoop installed on the machine in order to use the normal hadoop commands, and to write/submit jobs say from Eclipse or something similar.
So to sum it up my questions are,
Is this an ok setup for a small test cluster
How can I go about using one exposed machine to submit/route jobs to the cluster, without having any of the Hadoop nodes on it.
How do I setup a client machine to submit jobs to a remote cluster, and an example on how to do it on Windows. Also if there are any reason not to use Windows as a client machine in this setup.
Thanks I would greatly appreciate any advice or help on this.

Since this is not answered I will attempt to answer it.
1. Rest api to submit an application:
Resource 1(Cluster Applications API(Submit Application)): https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_APISubmit_Application
Resource 2: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-management/content/ch_yarn_rest_apis.html
Resource 3: https://hadoop-forum.org/forum/general-hadoop-discussion/miscellaneous/2136-how-can-i-run-mapreduce-job-by-rest-api
Resource 4: Run a MapReduce job via rest api
2. Submitting hadoop job fromĀ  client machine
Resource 1: https://pravinchavan.wordpress.com/2013/06/18/submitting-hadoop-job-from-client-machine/
3. Sending program to remote hadoop cluster
It is possible to send the program to a remote Hadoop cluster for running it. All you need to ensure is that you have set the resource manager address, fs.defaultFS, library files, and mapreduce.framework.name correctly before running the actual job.
Resource 1: (how to submit mapreduce job with yarn api in java)

Related

Does EMR still have any advantages over EC2 for Spark?

I know this question has been asked before but those answers seem to revolve around Hadoop. For Spark you don't really need all the extra Hadoop cruft. With the spark-ec2 script (available via GitHub for 2.0) your environment is prepared for Spark. Are there any compelling use cases (other than a far superior boto3 sdk interface) for running with EMR over EC2?
This question boils down to the value of managed services, IMHO.
Running Spark as a standalone in local mode only requires you get the latest Spark, untar it, cd to its bin path and then running spark-submit, etc
However, creating a multi-node cluster that runs in cluster mode requires that you actually do real networking, configuring, tuning, etc. This means you've got to deal with IAM roles, Security groups, and there are subnet considerations within your VPC.
When you use EMR, you get a turnkey cluster in which you can 1-click install many popular applications (spark included), and all of the Security Groups are already configured properly for network communication between nodes, you've got logging already setup and pointing at S3, you've got easy SSH instructions, you've got an already-installed apparatus for tunneling and viewing the various UI's, you've got visual usage metrics at the IO level, node level, and job submission level, you also have the ability to create and run Steps -- which are jobs that can be run in the command line of the drive node or as Spark applications that leverage the whole cluster. Then, on top of that, you can export that whole cluster, steps included, and copy paste the CLI script into a recurring job via DataPipeline and literally create an ETL pipeline in 60 seconds flat.
You wouldn't get any of that if you built it yourself in EC2. I know which one I would choose... EMR. But that's just me.

How do you install custom software on worker nodes in Azure HDInsight?

I have created an Azure HDInsight cluster using PowerShell. Now I need to install some custom software on the worker nodes that is required for the mappers I will be running using Hadoop streaming. I haven't found any PowerShell command that could help me with this task. I can prepare a custom job that will setup all the workers, but I'm not convinced that this is the best solution. Are there better options?
edit:
With AWS Elastic MapReduce there is an option to install additional software in a bootstrap action that is defined when you create a cluster. I was looking for something similar.
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data.
from: Create Bootstrap Actions to Install Additional Software
The short answer is that you don't. It's not ideal from a caching perspective, but you ought to be able to bundle all your job dependencies into the map reduce jar which is distributed across the cluster for you by YARN (part of Hadoop). This is broadly speaking transparent to the end user, as it's all handled through the job submission process.
If you need something large which is a shared dependency across many jobs, and you don't want it copied out every time, you can keep it on wasb:// storage, and reference that in a class path, but that might cause you complexity if you are for instance using the .NET Streaming API.
I've just heard from a collage that I need to update my Azure PS because recently a new Cmdlet Add-AzureHDInsightScriptAction was added and it does just that.
Customize HDInsight clusters using Script Action

hadoop nodes with Linux + windows

I have 4 windows machines, On which i have installed hadoop on 3 out of 4.
One machine having the Harton work Sandbox ( say for 4-th machine) , Now i need to make the 4th machine as my server ( Name node )
and rest of the machine as slaves.
Whether it will work if i update the configuration files in the rest of 3 machines
Or is there any other way to do this ?
Any other suggestions ?
Thanks
finally i got this but i could not find any help
Hadoop cluster configuration with Ubuntu Master and Windows slave
A non-secure cluster will work (non-secure in the sense that you do not enable Hadoop Kerberos based auth and security, ie. hadoop.security.authentication is left as simple). You need to update all nodes config to point to the new 4th node as the master for various services you plan to host on it. You mention namenode, but I assume you really mean to make the 4th node the 'head' node, meaning it will probably also run resourcemanager and historyserver (or the jobtracker for old-style Hadoop). And that is only core, w/o considering higher level components like Hive, Pig, Oozie etc, and not even mentioning Ambari or Hue.
Doing a post-install configuration of existing Windows (or Linux, makes no difference) nodes is possible, editing the various xx-site.xml files. You'll have to know what to change and is not trivial. Probably it would be much easier to just deploy again the windows machines, with an appropriate clusterproperties.txt config file. See Option III - Manual Install One Node At A Time.

Where can I find AMI for Hadoop on EC2?

I am trying to set up Hadoop permanently on Amazon EC2. Currently what I am doing is every morning launch EC2 instances and set up Hadoop. Is there any way i can avoid this tedious step? I am looking for an Hadoop image which can be loaded on EC2 and make things easy for me.
I know I can use EMR for hadoop services. But I dont know how to start a EMR (hadoop) cluster without submitting a job flow. I mean I need a hadoop cluster without any jobs running in it.
Ultimately my aim is to run bioinformatics applications like Distmap and Seal. For these applications to run there are many dependencies. So I need a free hadoop cluster to set up the environment and then run these applications.
I hope its clear what I am trying to do.
Thanks.
What you can do is one of the below:
Option 1. Start out with an EBS backed EC2 instance with your favourite Linux distro. Go ahead and install Hadoop software that you need. Create as many EC2 instances as the types of instances you are going to need (master / slaves /etc). You can create then your own AMIs in the AWS Console (right click on the EC2 instance and click "Create AMI"). You can then launch your own instances, as many as you need, based on this AMI. You can also create AMI's from instance-store backed instances, but that will mean dumping everything to S3 and creating an AMI from there. There are a lot of tutorials about this available, please leave a comment if you need directions :)
Option 2. Start out with a Hadoop based AMI, repeat the steps above after doing your own configurations / adding dependencies to them. I went ahead and searched for Hadoop AMI's from the AWS console and there are 48 in EU-West-1 (not sure what region you're working with).
Option 3. Start an EMR Cluster in interactive mode. There is also an option to keep the cluster alive after finishing job flows. If you also set the EC2 keys for the EMR instances, you should be able to SSH into them and have a functional Hadoop cluster (not sure about the dependencies though, you might be better of rolling your own).
I hope I understood correctly what you're trying to achieve and this helps a little bit.
This is more of a configure management and automation problem. Try CMT like chef and puppet to get this done according to your desire.

how to configure high availibility with hadoop 1.0 on AWS ec virtual machines

I Have already configured this setup using heartbeat and virtual IP mechanism on the Non VM setup.
I am using hadoop 1.0.3 and using shared directory for the Namenode metadata sharing. Problem is that on the amazon cloud there is nothing like virtual Ip to get the High Availibility using Linux-ha.
Has anyone been able to achieve this. Kindly let me the know the steps required?
For now I am using the Hbase replication WAL on hbase. Hbase later than 0.92 supports this.
For the hadoop clustering on cloud , I will wait for the 2.0 release getting stable.
Used the following
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements
On the client Side I added the logic to have 2 master servers , used alternatively to reconnect in case of network disruption.
This thing worked for a simple 2 machines backking up each other , not recommended for higher number of server.
Hope it helps.
Well, there's 2 parts to Hadoop to make it highly available. The first and more important is, of course, the NameNode. There's a secondary/checkpoint NameNode that can you startup and configure. This will help keep HDFS up and running in the event that your primary NameNode goes down. Next is the JobTracker, which runs all the jobs. To the best of my (outdated by 10 months) knowledge, there is no backup to the JobTracker that you can configured, so it's up to you to monitor and start up a new one with the correct configuration in the event that it goes down.

Resources