Configuration management for Linux / Hadoop Cluster - linux

I have to setup a small size Hadoop cluster on Linux (Ubuntu) machines. For that I have to install JDK, python and some other linux utilities on all systems. After that I have to configure Hadoop for each system one by one. Is there any tool available so that I can install all these tools from a single system. For example if I have to install jdk on some system, that tool should install to that. I prefer the tool is web based .

Apache Ambari or Cloudera Manager are purposely built to accomplish these tasks for Hadoop
They also monitor the cluster, and provision extra services that communicate with it like Kafka, Hbase, Spark, etc
That only gets you so far, though, and you'll want to have something like Ansible to deploy custom configurations (AWX is a web UI for Ansible). Puppet & Chef are alternatives too

Related

Installing Hadoop in LinuxMint

I have started a course on Hadoop on Udemy. Now here the instructor is using windows OS and installs a virtual box and then runs a Horton Sandbox image for using Hadoop.
I am using LinuxMint and after doing some research on install hadoop on Linux I found(click for ref) out that we can install the VM on linux and download the Horton Sandbox image run it.
I also found another method which does not uses the VM (click for ref). I am confused as to which is the best way for install hadoop.
Should I use the VM or the second method. Which is better for learning and development?
Thanks a lot for help!
can install the VM on linux
You can use a VM on any host OS... That's the point of a VM.
The last link is only Hadoop, where Hortonworks has much, much more like Spark, Hive, Hbase, Pig, etc. Things you'd need to additionally install and configure yourself otherwise
Which is better for learning and development?
I would strongly suggest using a VM (or containers) overall
1) rather than messing up your local OS trying to get Hadoop working
2) The Hortonworks documentation has lots of tutorials that can really only be ran in the sandbox with the pre installed datasets

What's difference between Apache Mesos, Mesosphere and DCOS?

Looks to me that Apache Mesos is a distributed systems kernel, and Mesosphere is something Linux distribution based on Apache Mesos.
For example, its like Linux Kernel(Apache Mesos) and Ubuntu(Mesosphere).
Am I right about this?
and DCOS is a free edition of Mesosphere, like RedHat vs RedHat Enterprise?
Let me try ;-)
Apache Mesos - OpenSource Cluster Resource Manager, Kernel of DC/OS
Mesosphere - Company contributing to both Apache Mesos and DC/OS
DC/OS - OpenSource Distribution around Apache Mesos including UI, networking, and many other pieces. Mesosphere also offers an Enterprise Edition of DC/OS with support and some advanced features.
Hope this helped!
my two cents and from various online sources...
DC/OS is a datacenter operating system, also a distributed system. The operating system is based on the Apache Mesos distributed kernel. See Apache Mesos more details below
It comprises of three main components:
A cluster manager,
A container platform, and
An operating system.
Essentially DC/OS abstracts the infrastructure below with the Mesos and provides powerful tools that can run services and applications and more importantly you would find complete SMACK stack all pulled in under one OS platform. DC/OS has a built-in self-healing distributed system.
It is agnostic to infrastructure layer meaning the host may consist of either virtual or physical hardware as long as it provides computing, storage and networking., it is designed to run anywhere on-premises and/or virtual AWS, AZURE….see https://dcos.io/docs/1.10/overview/
Apache Mesos is a distributed kernel and it is the backbone of DC/OS. It’s programmed against your datacentre as being a single pool of resources. It abstracts CPU, memory, storage and other computing resouces.. It also provides an API for resource management , scheduling across datacentre and cloud environment. It can be scale up to 10,000’s of nodes. So it can definitely be considered as a solution for large production clusters. It supports container orchestration platforms like Kubernetes and of course Marathon.
Mesosphere - DC/OS is created and maintained by Mesosphere
js84 provides an excellent and concise answer above. Just to drive home the point, here is
an analogy to the Linux ecosystem:
Mesos is akin to Linux kernel (as identified by Linux kernel version such as 2.6, found by command $ uname -a)
DC/OS is akin to Linux Operating Systems (as identified by Linux Distribution/Releases in file such as /etc/redhat-release: RHEL 7.1, CentOS 7.2), with a whole bunch of bin and utilities in /bin, /usr/bin, ...
Mesosphere is akin to RedHat, the company which contributes a lot to the open source Linux kernel and Linux Distribution, as well as provides paid-support to enterprise customers and additional features required by enterprise.
This is a good overview of what DC/OS is:
https://docs.mesosphere.com/1.11/overview/what-is-dcos/
Apache Mesos is the Opensource distributed orchestrator for Container as well as non-Container workloads.It is a cluster manager that simplifies the complexity of running applications on a shared pool of servers and responsible on sharing resources across application framework by using scheduler and executor.
DC/OS(Datacenter Operating System) are built on top of Apache Mesos. Opensource DC/OS adds Service discovery, Universe package for different frameworks, CLI and GUI support for management and Volume support for persistent storage. DC/OS used unified API to manage multiple system on cloud or on-premises such as deploys containers, distributed services etc. Unlike traditional operating systems, DC/OS spans multiple machines within a network, aggregating their resources to maximize utilization by distributed applications.
Mesosphere company has products that are built on top of Apache Mesos. Mesosphere contribute to both Apache Mesos and Opensource DC/OS. Mesosphere offers a layer of software that organizes your machines, VMs, and cloud instances and lets applications draw from a single pool of intelligently- and dynamically-allocated resources, increasing efficiency and reducing operational complexity
I understood this way, I might wrong.
DC/OS will give more features like #js84 said and Mesos will give less of DC/OS.
Apologies for bad writing on board or diagram

Running two versions of Apache Spark in cluster mode

I want to be able to run Spark 2.0 and Spark 1.6.1 in cluster mode on a single cluster to be able to share resources, what are the best practices to do this? this is because I want to be able to shield a certain set of applications from code changes that rely on 1.6.1 and others on Spark 2.0.
Basically the cluster could rely on dynamic allocation for Spark 2.0 but maybe not for 1.6.1 - this is flexible.
By using Docker this is possible you can run various version of Spark application, since Docker runs the application in Isolation.
Docker is an open platform for developing, shipping, and running applications. . With Docker you can separate your applications from your infrastructure and treat your infrastructure like a managed application.
Industries are adopting Docker since it provide this flexibility to run various version application in a single nut shell and many more
Mesos also allows to Run Docker Containers using Marathon
For more information please refer
https://www.docker.com/
https://mesosphere.github.io/marathon/docs/native-docker.html
Hope this helps!!!....

Restart tasktracker and job tracker of hadoop CDH4 using Cloudera services

I have made few entries in mapred-site.xml, to pick these changes i need to restart TT and JT running at my cluster nodes.
Is there any i can restart them using Cloud Era manager web services from command line.
So I can automate those steps any time changed made configuration files for hadoop it will restart TT and JT..
Since version 4.0, Cloudera Manager exposes its functionality through an HTTP API which allows you to do the operations through "curl" from the shell. The API is available in both the Free Edition and the Enterprise Edition.
Their repository hosts a set of client-side utilities for communicating with the Cloudera Manager API. You can find more on the documentation page.

Setting up a (Linux) Hadoop cluster

Do you need to set up a Linux cluster first in order to setup a Hadoop cluster ?
No. Hadoop has its own software to manage a "cluster". Just install linux and make sure the machines can talk to each other.
Deploying the Hadoop software, along with the appropriate config files, and starting it on each node (which Hadoop can do automatically) creates the cluster from the Linux machines you have. So, no, by that definition you don't need to have a separate linux cluster. If your question is whether or not you need to have a multiple-machine cluster to use Hadoop: no, you can run Hadoop on a single machine for either testing or small-sized jobs, via either local mode (where everything is confined to a single process) or pseudodistributed mode (where you trick Hadoop into thinking it's running on multiple computers).

Resources