can HBase , MapReduce and HDFS can work on a single machine having Hadoop installed and running on it? - search

I am working on a search engine design, which is to be run on cloud.
We have just started, and have not much idea about Hdoop.
Can anyone tell if HBase , MapReduce and HDFS can work on a single machine having Hdoop installed and running on it ?

Yes you can. You can even create a Virtual Machine and run it on there on a single "computer" (which is what I have :) ).
The key is to simply install Hadoop in "Pseudo Distributed Mode" which is even described in the Hadoop Quickstart.
If you use the Cloudera distribution they have even created the configs needed for that in an RPM. Look here for more info in that.
HTH

Yes. In my development environment, I run
NameNode (HDFS)
SecondaryNameNode (HDFS)
DataNode (HDFS)
JobTracker (MapReduce)
TaskTracker (MapReduce)
Master (HBase)
RegionServer (HBase)
QuorumPeer (ZooKeeper - needed for HBase)
In addition, I run my applications, and map and reduce tasks launched by the task tracker.
Running so many processes on the same machine results in a lot of contention for CPU cores, memory, and disk I/O, so it's definitely not great for high performance, but there is no limitation other than the amount of resources available.

same here, I am running hadoop/hbase/hive on a single computer.
If you really really want to see distributed computing on a single computer, grab lots of RAM, some hard disk space and go like this -
make one or two virtual machines (use virtual box)
install hadoop on each of them, make ur real instalation (not any virtual one) as the master, rest slave
configure hadoop for real distributed environment
now when hadoop starts, you should actually have a cluster of multiple computers (one real, rest virtual)
this could just be an experiment, because unless you have a decent multi-cpu or multi-core system, such a configuration will actually consume more on maintaining itself than giving you any performance.
gud luck.
--l4l

Related

cassandra 3.11.x mixing vesions

We have a 6 node cassandra 3.11.3 cluster with ubuntu 16.04. These are virtual machines.
We are switching to physical machines on brand (8!) new servers that will have debian 11 and presumably cassandra 3.11.12.
Since the main version is always 3.11.x and ubuntu 16.04 is out of support, the question is: can we just let the new machines join the old cluster and then decommission the outdated?
I hope to get a tips about this becouse intuitively it seems fine but we are not too sure about that.
Thank you.
We have a 6 node cassandra 3.11.3 cluster with ubuntu 16.04. These are virtual machines. We are switching to physical machines on brand (8!)
Quick tip here; but it's a good idea to build your clusters in multiples of your RF. Not sure what your RF is, but if RF=3, I'd either stay with six or get one more and go to nine. It's all about even data distribution.
can we just let the new machines join the old cluster and then decommission the outdated?
In short, no. You'll want to upgrade the existing nodes to 3.11.12, first. I can't recall if 3.11.3 and 3.11.12 are SSTable compatible, but I wouldn't risk it.
Secondly, the best way to do this, is to build your new (physical) nodes in the cluster as their own logical data center. Start them up empty, and then run a nodetool rebuild on each. Once that's complete, then decommission the old nodes.
There is a bit simpler solution - move data from each virtual machine into a physical server, as following:
Prepare Cassandra installation on a physical machine, configure the same cluster name, etc.
1.Stop Cassandra in a virtual machine & make sure that it won't start
Copy all Cassandra data /var/lib/cassandra or something like from VM to the physical server
Start Cassandra process on a physical server
Repeat that process for all VM nodes, at some point, updating seeds, etc. After process is finished, you can add two physical servers that are left. Also, to speedup process, you can do initial copy of the data before stopping Cassandra in the VM, and after it's stopped, re-sync data with rsync or something like. This way you can minimize the downtime.
This approach would be much faster compared to the adding a new node & decommissioning the old one as we won't need to stream data twice. This works because after node is initialized, Cassandra identify nodes by assigned UUID, not by IP address.
Another approach is to follow instructions on replacement of the dead node. In this case streaming of data will happen only once, but it could be a bit slower compared to the direct copy of the data.

Why shouldn't local mode in Spark be used for production?

The official documentation and all sorts of books and articles repeat the recommendation that Spark in local mode should not be used for production purposes. Why not? Why is it a bad idea to run a Spark application on one machine for production purposes? Is it simply because Spark is designed for distributed computing and if you only have one machine there are much easier ways to proceed?
Local mode in Apache Spark is intended for development and testing purposes, and should not be used in production because:
Scalability: Local mode only uses a single machine, so it cannot
handle large data sets or handle the processing needs of a production
environment.
Resource Management: Spark’s standalone cluster manager or a cluster
manager like YARN, Mesos, or Kubernetes provides more advanced
resource management capabilities for production environments compared
to local mode.
Fault Tolerance: Local mode does not have the ability to recover from
failures, while a cluster manager can provide fault tolerance by
automatically restarting failed tasks on other nodes.
Security: Spark’s cluster manager provides built-in security features
such as authentication and authorization, which are not present in
local mode.
Therefore, it is recommended to use a cluster manager for production environments to ensure scalability, resource management, fault tolerance, and security.
I have the same question. I am certainly not an authority on the subject, but because no-one has answered this question, I'll try to list the reasons I've encountered while using Spark local mode in Java. So far:
Spark uses System.exit() calls in certain occassions, such as an out of memory error or when the local dir does not have write permissions, so if such a call is triggered, the entire JVM shuts down (including your own application from within which Spark runs, see e.g., https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala#L45, https://github.com/apache/spark/blob/b22946ed8b5f41648a32a4e0c4c40226141a06a0/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L63). Moreover, it circumvents your own shutdownHook, so there is no way to gracefully handle such system exits in your own application. On a cluster, it is usually fine if a certain machine restarts, but, if all Spark components are contained in a single JVM, the assumption that we can shut down the entire JVM upon a Spark failure does not (always) hold.

Is it possible to run ANY application or program with HADOOP YARN?

I'm studying distributed computing recently and found out Hadoop Yarn is one of them.
So thought if I just establish Hadoop Yarn cluster, then every application will run distributed.
But now someone told me that HADOOP Yarn cannot do anything by itself and need other things like mapreduce, spark, and hbase.
If this is correct, then is that mean only limited tasks can be run with Yarn?
Or can I apply Yarn's distributed computing to all applications I want?
Hadoop is the name which refers to the entire system.
HDFS is the actual storage system. Think of it as S3 or a distributed Linux filesystem.
YARN is a framework for scheduling jobs and allocating resources. It handles these things for you, but you don't interact very much with it.
Spark and MapReduce are managed by Yarn. With these two, you can actually write your code/applications and give work to the cluster.
HBase uses the HDFS storage (with is file based) and provides NoSql storage.
Theoretically you can run more than just Spark and MapReduce on Yarn and you can use something else then Yarn (Kubernetes is in works or supported now). You can even write your own processing tool, queue/resource management system, storage... Hadoop has many pieces which you may use or not, depending on your case. But the majority of Hadoop systems use Yarn and Spark.
If you want to deploy Docker containers for example, just a Kubernetes cluster would be a better choice. If you need batch/real time processing with Spark, use Hadoop.
YARN itself is a resource manager. You will need to write code that can be deployed onto those resources, and then that could do anything, given that the nodes running the tasks are themselves capable of running the job. For example, you cannot distribute a Python library without first installing the dependencies for that script. Mesos is a bit more generalized / accessible than YARN, if you want more flexibility for the same affect.
YARN mostly supports running JAR files, shell scripts (at least, from Oozie) or Docker containers can be deployed to it as well (refer Apache docs)
You may also refer to the Apache Slider or Twill projects for more information.

Phisycal ressources for spark cluster

I trying to learn spark to implement one of our algorithms to increase the execution time. I downloaded a pre-compiled version on my local machine to run it in a local mode and I enjoyed creating some toy apps.
Next step is to use the cluster mode ( standalone one at this level).
I found a lot of amazing tutorials talking about how to configure the cluster and the difference between local and cluster modes and this is really clear ( I will be back here if I had troubles with that ).
My question for now is:
What physical infrastructure to use for spark cluster?
No downvotes please, I will explain: For now, we have 2 dedicated server with 32Go of RAM and 8 CPUS each one.
Now I am asking:
What is the best way to fully exploit this resources with spark?
Which is better:
Use a virtualization ( ESXI / Proxmox ) in order to create virtual machines which will be my cluster nodes?
just use the two servers and create a 2-noded cluster?

Setting up a (Linux) Hadoop cluster

Do you need to set up a Linux cluster first in order to setup a Hadoop cluster ?
No. Hadoop has its own software to manage a "cluster". Just install linux and make sure the machines can talk to each other.
Deploying the Hadoop software, along with the appropriate config files, and starting it on each node (which Hadoop can do automatically) creates the cluster from the Linux machines you have. So, no, by that definition you don't need to have a separate linux cluster. If your question is whether or not you need to have a multiple-machine cluster to use Hadoop: no, you can run Hadoop on a single machine for either testing or small-sized jobs, via either local mode (where everything is confined to a single process) or pseudodistributed mode (where you trick Hadoop into thinking it's running on multiple computers).

Resources