Installing Hadoop in LinuxMint - linux

I have started a course on Hadoop on Udemy. Now here the instructor is using windows OS and installs a virtual box and then runs a Horton Sandbox image for using Hadoop.
I am using LinuxMint and after doing some research on install hadoop on Linux I found(click for ref) out that we can install the VM on linux and download the Horton Sandbox image run it.
I also found another method which does not uses the VM (click for ref). I am confused as to which is the best way for install hadoop.
Should I use the VM or the second method. Which is better for learning and development?
Thanks a lot for help!

can install the VM on linux
You can use a VM on any host OS... That's the point of a VM.
The last link is only Hadoop, where Hortonworks has much, much more like Spark, Hive, Hbase, Pig, etc. Things you'd need to additionally install and configure yourself otherwise
Which is better for learning and development?
I would strongly suggest using a VM (or containers) overall
1) rather than messing up your local OS trying to get Hadoop working
2) The Hortonworks documentation has lots of tutorials that can really only be ran in the sandbox with the pre installed datasets

Related

Is Apache Spark recommended to run on windows?

I have a requirement to run Spark on Windows in a production environment. I would like to get advice in understanding if Apache Spark on Windows is recommended. If not, I would like to know the reason behind the same.

Configuration management for Linux / Hadoop Cluster

I have to setup a small size Hadoop cluster on Linux (Ubuntu) machines. For that I have to install JDK, python and some other linux utilities on all systems. After that I have to configure Hadoop for each system one by one. Is there any tool available so that I can install all these tools from a single system. For example if I have to install jdk on some system, that tool should install to that. I prefer the tool is web based .
Apache Ambari or Cloudera Manager are purposely built to accomplish these tasks for Hadoop
They also monitor the cluster, and provision extra services that communicate with it like Kafka, Hbase, Spark, etc
That only gets you so far, though, and you'll want to have something like Ansible to deploy custom configurations (AWX is a web UI for Ansible). Puppet & Chef are alternatives too

Suggest a similar installer for apache spark and a notebook?

I am new to bigdata analytics. I am trying to install apache spark and a notebook to execute code like iPython. Is there an installer that comes with both spark set up and a good notebook tool inbuilt. I come from a back ground in PHP and Apache. I am used to tools like xampp, wamp that install multiple services in once click. Can any one suggest a similar installer for apache spark and a notebook? I have windows.
If iPython is not a mandatory requirement and if you can work with Zeppelin notebook with Apache spark I think you will need Sparklet. Its similar to what you seek a xampp like installer for spark engine and zeppelin tool.
You can see details here - Sparklet
It supports windows. Let me know if it solves your problem.

How to set up Spark cluster on Windows machines?

I am trying to set up a Spark cluster on Windows machines.
The way to go here is using the Standalone mode, right?
What are the concrete disadvantages of not using Mesos or YARN? And how much pain would it be to use either one of those? Does anyone have some experience here?
FYI, I got an answer in the user-group: https://groups.google.com/forum/#!topic/spark-users/SyBJhQXBqIs
The standalone mode is indeed the way to go. Mesos does not work under Windows and YARN probably neither.
Quick note, YARN should eventually work on Windows via the Hortonworks Data Platform (version 2.0 beta is on YARN but it is on Linux only at this time). Another potential route is to have it work against Hadoop 1.1 (Hortonworks Data Platform for Windows 1.1) - but your approach of having it run on Standalone mode is definitely the easiest to getting of the ground.

For a single CDH (Hadoop) cluster installation, which host should I use?

I started with a Windows 7 computer, and set up an Ubuntu Linux virtual machine which I run using VirtualBox. The Cloudera Manager Free Edition version 4 has been executed, and I have been following the prompts on localhost:7180.
I am now stuck when the prompt asks me to "Specify hosts for your CDH cluster installation." Can I install all of the Hadoop components, as well as run them, in the linux virtual machine alone?
Please help point me in the right direction in which host I should specify.
Yes, you can run cdh in a linux virtual machine alone. You could do it using "standalone" or "pseudo distributed" modes. IMHO, the most effective method for doing it is to use the "pseudo distributed" mode.
In this case, there are multiple java-virtual-machines (JVM) running, so they simulated as they were a cluster with multiples nodes (each thread simulated to be a cluster node).
Cloudera has documented how to get deployed as "pseudo distributed":
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_qs_cdh5_pseudo.html
Note: 3 ways for deploying cdh:
standalone: using a machine alone, with a unique jvm
pseudo-distributed: using a machine alone, but several jvm's, so
simulated to be a cluster
distributed: using a cluster, so several
nodes with different purposes (workers, namenode, etc).
you can specify hostname of your machine. it will install everything on your machine only.

Resources