Linux HA vs Apache Hadoop - linux

I'm using Cloudera (Apache Hadoop), so I have a pretty good idea about it.
However, I just found out about Linux HA project and
I cannot find out what is the difference between Linux HA and Apache Hadoop.
When should we use Apache Hadoop and when should we use Linux HA?
Thank you!

Linux HA is a software based High-availability cluster services which are used to improve the ability of many kinds of services. That means - This Linux HA is used to keep desired services up and running with no downtime. This uses the concept of heartbeat to identify the service state in the cluster. For example if you have a web server running on hostA, it is replicated to run on hostB also. Whenever the hostA is down, hostB starts and serves requests. i.e there is no downtime provided by the server.
Whereas, Apache Hadoop is a Framework that solves the problem of storing large amount of data and processing it.

Related

What's difference between Apache Mesos, Mesosphere and DCOS?

Looks to me that Apache Mesos is a distributed systems kernel, and Mesosphere is something Linux distribution based on Apache Mesos.
For example, its like Linux Kernel(Apache Mesos) and Ubuntu(Mesosphere).
Am I right about this?
and DCOS is a free edition of Mesosphere, like RedHat vs RedHat Enterprise?
Let me try ;-)
Apache Mesos - OpenSource Cluster Resource Manager, Kernel of DC/OS
Mesosphere - Company contributing to both Apache Mesos and DC/OS
DC/OS - OpenSource Distribution around Apache Mesos including UI, networking, and many other pieces. Mesosphere also offers an Enterprise Edition of DC/OS with support and some advanced features.
Hope this helped!
my two cents and from various online sources...
DC/OS is a datacenter operating system, also a distributed system. The operating system is based on the Apache Mesos distributed kernel. See Apache Mesos more details below
It comprises of three main components:
A cluster manager,
A container platform, and
An operating system.
Essentially DC/OS abstracts the infrastructure below with the Mesos and provides powerful tools that can run services and applications and more importantly you would find complete SMACK stack all pulled in under one OS platform. DC/OS has a built-in self-healing distributed system.
It is agnostic to infrastructure layer meaning the host may consist of either virtual or physical hardware as long as it provides computing, storage and networking., it is designed to run anywhere on-premises and/or virtual AWS, AZURE….see https://dcos.io/docs/1.10/overview/
Apache Mesos is a distributed kernel and it is the backbone of DC/OS. It’s programmed against your datacentre as being a single pool of resources. It abstracts CPU, memory, storage and other computing resouces.. It also provides an API for resource management , scheduling across datacentre and cloud environment. It can be scale up to 10,000’s of nodes. So it can definitely be considered as a solution for large production clusters. It supports container orchestration platforms like Kubernetes and of course Marathon.
Mesosphere - DC/OS is created and maintained by Mesosphere
js84 provides an excellent and concise answer above. Just to drive home the point, here is
an analogy to the Linux ecosystem:
Mesos is akin to Linux kernel (as identified by Linux kernel version such as 2.6, found by command $ uname -a)
DC/OS is akin to Linux Operating Systems (as identified by Linux Distribution/Releases in file such as /etc/redhat-release: RHEL 7.1, CentOS 7.2), with a whole bunch of bin and utilities in /bin, /usr/bin, ...
Mesosphere is akin to RedHat, the company which contributes a lot to the open source Linux kernel and Linux Distribution, as well as provides paid-support to enterprise customers and additional features required by enterprise.
This is a good overview of what DC/OS is:
https://docs.mesosphere.com/1.11/overview/what-is-dcos/
Apache Mesos is the Opensource distributed orchestrator for Container as well as non-Container workloads.It is a cluster manager that simplifies the complexity of running applications on a shared pool of servers and responsible on sharing resources across application framework by using scheduler and executor.
DC/OS(Datacenter Operating System) are built on top of Apache Mesos. Opensource DC/OS adds Service discovery, Universe package for different frameworks, CLI and GUI support for management and Volume support for persistent storage. DC/OS used unified API to manage multiple system on cloud or on-premises such as deploys containers, distributed services etc. Unlike traditional operating systems, DC/OS spans multiple machines within a network, aggregating their resources to maximize utilization by distributed applications.
Mesosphere company has products that are built on top of Apache Mesos. Mesosphere contribute to both Apache Mesos and Opensource DC/OS. Mesosphere offers a layer of software that organizes your machines, VMs, and cloud instances and lets applications draw from a single pool of intelligently- and dynamically-allocated resources, increasing efficiency and reducing operational complexity
I understood this way, I might wrong.
DC/OS will give more features like #js84 said and Mesos will give less of DC/OS.
Apologies for bad writing on board or diagram

Phisycal ressources for spark cluster

I trying to learn spark to implement one of our algorithms to increase the execution time. I downloaded a pre-compiled version on my local machine to run it in a local mode and I enjoyed creating some toy apps.
Next step is to use the cluster mode ( standalone one at this level).
I found a lot of amazing tutorials talking about how to configure the cluster and the difference between local and cluster modes and this is really clear ( I will be back here if I had troubles with that ).
My question for now is:
What physical infrastructure to use for spark cluster?
No downvotes please, I will explain: For now, we have 2 dedicated server with 32Go of RAM and 8 CPUS each one.
Now I am asking:
What is the best way to fully exploit this resources with spark?
Which is better:
Use a virtualization ( ESXI / Proxmox ) in order to create virtual machines which will be my cluster nodes?
just use the two servers and create a 2-noded cluster?

clustering in node.js using mesos

I'm working on a project with Node.js that involves a server. Now due to large number of jobs, I need to perform clustering to divide the jobs between different servers (different physical machines). Note that my jobs has nothing to do do with internet, so I cannot use stateless connection (or redis to keep state) and a load balancer in front of the servers to distribute the connection.
I already read about the "cluster" module, but, from what i understood, it seems to scale only on multiprocessors on the same machine.
My question: is there any suitable distributed module available in Node.js for my work? What about Apache mesos? I have heard that mesos can abstract multiple physical machines into a single server? is it correct? If yes, it is possible to use the node.js cluster module on top of the mesos, since now we have only one virtual server?
Thanks
My question: is there any suitable distributed module available in Node.js for my work?
Don't know.
I have heard that mesos can abstract multiple physical machines into a single server? is it correct?
Yes. Almost. It allows you to pool resources (CPU, RAM, DISK) across multiple machines, gives you ability to allocate resources for your applications, run and manage the said applications. So you can ask Mesos to run X instances of node.js and specify how much resource does each instance needs.
http://mesos.apache.org
https://www.cs.berkeley.edu/~alig/papers/mesos.pdf
If yes, it is possible to use the node.js cluster module on top of the mesos, since now we have only one virtual server?
Admittedly, I don't know anything about node.js or clustering in node.js. Going by http://nodejs.org/api/cluster.html, it just forks off a bunch of child workers and then round robins the connection between them. You have 2 options off the top of my head:
Run node.js on Mesos using an existing framework such as Marathon. This will be fastest way to get something going on Mesos. https://github.com/mesosphere/marathon
Create a Mesos framework for node.js, which essentially does what cluster node.js is doing, but across the machines. http://mesos.apache.org/documentation/latest/app-framework-development-guide/
In both these solutions, you have the option of letting Mesos create as many instances of node.js as you need, or, use Mesos to run cluster node.js on each machine and let it manage all the workers on that machine.
I didn't google, but there might already be a node.js mesos framework out there!

apache zookeeper ask server RAM

Is it possible to use zookeeper to monitor the server resources or to retrieve a number representing the total RAM available?
Apache Zookeeper is not a monitoring tool, it's a distributed manager for configuration sharing, naming and synchronization.
Here is a good description of what Apache Zookeeper is and how it can be used - https://stackoverflow.com/a/8864303/2453586
You can make a workaround and write a script that will check the amount of available RAM every N minutes/seconds (via cron or like a deamonized process) on each local machine and write this value to Zookeeper znode, like zookeeper_host:2181/ram/host1/available. But it's not a good idea.
It's more likely to use a specific tools like Zabbix, Nagios or Gangila for memory monitoring purposes.
You are probably looking for something like jmx.

how to configure high availibility with hadoop 1.0 on AWS ec virtual machines

I Have already configured this setup using heartbeat and virtual IP mechanism on the Non VM setup.
I am using hadoop 1.0.3 and using shared directory for the Namenode metadata sharing. Problem is that on the amazon cloud there is nothing like virtual Ip to get the High Availibility using Linux-ha.
Has anyone been able to achieve this. Kindly let me the know the steps required?
For now I am using the Hbase replication WAL on hbase. Hbase later than 0.92 supports this.
For the hadoop clustering on cloud , I will wait for the 2.0 release getting stable.
Used the following
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements
On the client Side I added the logic to have 2 master servers , used alternatively to reconnect in case of network disruption.
This thing worked for a simple 2 machines backking up each other , not recommended for higher number of server.
Hope it helps.
Well, there's 2 parts to Hadoop to make it highly available. The first and more important is, of course, the NameNode. There's a secondary/checkpoint NameNode that can you startup and configure. This will help keep HDFS up and running in the event that your primary NameNode goes down. Next is the JobTracker, which runs all the jobs. To the best of my (outdated by 10 months) knowledge, there is no backup to the JobTracker that you can configured, so it's up to you to monitor and start up a new one with the correct configuration in the event that it goes down.

Resources