Installing Hive and sqoop on Windows (Cygwin) - cygwin

Can someone help me providing the steps to install Hive and Sqoop on Cygwin. I already installed Hadoop-0.20.2 and Hbase latest stable-0.94.1 on Cygwin and working good.

Typically a Hadoop distribution includes both. Inspect the directories containing the Hadoop binaries and see if you discover the bin files. For example, sqoop is simply named sqoop and is executable.

Related

Can PySpark work without Spark?

I have installed PySpark standalone/locally (on Windows) using
pip install pyspark
I was a bit surprised I can already run pyspark in command line or use it in Jupyter Notebooks and that it does not need a proper Spark installation (e.g. I did not have to do most of the steps in this tutorial https://medium.com/#GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c ).
Most of the tutorials that I run into say one needs to "install Spark before installing PySpark". That would agree with my view of PySpark being basically a wrapper over Spark. But maybe I am wrong here - can someone explain:
what is the exact connection between these two technologies?
why is installing PySpark enough to make it run? Does it actually install Spark under the hood? If yes, where?
if you install only PySpark, is there something you miss (e.g. I cannot find the sbin folder which contains e.g. script to start history server)
As of v2.2, executing pip install pyspark will install Spark.
If you're going to use Pyspark it's clearly the simplest way to get started.
On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars
PySpark installed by pip is a subfolder of full Spark. you can find most of PySpark python file in spark-3.0.0-bin-hadoop3.2/python/pyspark. so if you'd like to use java or scala interface, and deploy distribute system with hadoop, you must download full Spark from Apache Spark and install it.
PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark.
This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

cql-import tool not present in sqoop 1.4.6

I am currently stuck on Data migration, I want to migrate data from Oracle Database to Cassandra.
I have following tools installed on Linux
DSE 4.8
Hadoop 2.7.3
Sqoop 1.4.6
I am not sure why my SQOOP version is not having cql-import or any cassandra related commands.
Following are the available commands I can see in the "SQOOP help" output
Available commands:
codegen
create-hive-table
eval
export
help
import
import-all-tables
import-mainframe
job
list-databases
list-tables
merge
metastore
version
I have searched throughout the net and found following link having latest sqoop version, but cql-import tool is missing in all of them.
https://www-eu.apache.org/dist/sqoop/
http://mirrors.ibiblio.org/apache/sqoop/1.4.6/
It would be very helpful if any one has the link for a sqoop version which supports cassandra data migration commands like "cql-import".
Editted:
One more point to add, I have manually configured Hadoop and Sqoop.
Thanks in advance

Installing Spark 2 on CDH 5.* with RPM?

I have a Cloudera CDH 5.11 cluster installed from RPM packages (we don't want to use Cloudera Manager or parcels). Has anyone found/built Spark 2 RPM packages for CDH? It seems Cloudera only ships Spark 2 as parcels.
You won't. For now, the doc "Spark 2 Known Issues" clearly states:
Package Install is not Supported
The Cloudera Distribution of Apache Spark 2 is only installable as a parcel.
https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_package_install
The best way is to use Spark on Yarn instead of using Spark Master/Worker. You are free to use any Spark version you like, independent of what the vendor ships.
What you need to do is to package Spark History Server to be able to look at jobs after they finishes. And, if you want to use Dynamic Allocation, you need Spark Shuffle Service configured in Yarn.
Looks like I can't comment on an issue so excuse this post as an answer.
Is it possible to install the Spark2 parcel on a RPM installed cluster using CM?
From CDH 6.0 Spark 2 is included as RPMs. Problem solved.

Do I need Hadoop in my windows to connect on hbase running on linux?

Do I need Hadoop in my windows to connect on hbase running on ununtu with hadoop?
My hbase is running fine on my ubuntu machine. I am able to connect with eclipse on same machine ( I am using kundera to connect hbase). Now I want to connect hbase from my windows 7 eclipse IDE . Do I need to install hadoop on my windows to connect remote hbase which is on ubuntu .?? when I tried I am getting something like this
Failed to locate the winutils binary in the hadoop binary path
Read about open-source technology .IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
All you need are the hadoop, hbase jars and then Configuration object initialized
with:
1. hbase.zookeeper.quorum(if cluster) details and other information initialized.
2. hbase.zookeeper.property.clientPort
3. zookeeper.znode.parent
And then, getting connection with the above config object
This problem usually occurs in Hadoop 2.x.x version. One of the option is to build Windows distribution for Hadoop version.
Refer this link:
http://www.srccodes.com/p/article/38/build-install-configure-run-apache-hadoop-2.2.0-microsoft-windows-os
But, before building try to use the zip file given in this link:
http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path
Extract this zip file and paste the files under hadoop-common-2.2.0/bin to $HADOOP_HOME/bin directory.
Note: For me this works even for Hadoop 2.5 version.

How Install Hadoop and Hive in Ubuntu Linux in VM box?

I am using Windows 7 OS, I would like to learn Hive and Hadoop. So I installed Ubuntu 13.04 version in My VM Box. When i select download the Hadoop and Hive The below URL having multiple files to download Could you please help me out to install Hive in Ubuntu box else Is there any other steps do you have any steps
http://mirror.tcpdiag.net/apache/hadoop/common/hadoop-1.1.2/
hadoop-1.1.2-1.i386.rpm
hadoop-1.1.2-1.i386.rpm.mds
hadoop-1.1.2-1.x86_64.rpm
hadoop-1.1.2-1.x86_64.rpm.mds
hadoop-1.1.2-bin.tar.gz
hadoop-1.1.2-bin.tar.gz.mds
hadoop-1.1.2.tar.gz
hadoop-1.1.2.tar.gz.mds
hadoop_1.1.2-1_i386.deb
hadoop_1.1.2-1_i386.deb.mds
hadoop_1.1.2-1_x86_64.deb
hadoop_1.1.2-1_x86_64.deb.mds
Since you are new to both Hadoop and Hive, you are better off going ahead with their .tar.gz archives, IMHO. In case things don't go smooth you don't have to do the entire uninstall and reinstall stuff again and again. Just download hadoop-1.1.2.tar.gz, unzip it, keep the unzipped folder at some convenient location and proceed with the configuration. If you want some help regarding configuration you can visit this post. I have tried to explain the complete procedure with all the details.
Configuring Hive is quite straightforward. Download the .tar.gz file. unpack it just like you did with Hadoop. Then follow the steps shown here.
i386: Compiled for a 32-bit architecture
x86_64: Compiled for a 64-bit architecture
.rpm: Red Hat Package Manager file
.deb: Debian Package Manager file
.tar.gz: GZipped archive of the source files
bin.tar.gz: GZipped archive of the compiled source files
.mds: Checksum file
A Linux Package Manager is (sort of) like an installer in Windows. It automatically collects the necessary dependencies. If you download the source files you have to link (and/or compile) all the dependencies yourself.
There you're on Ubuntu, which is a Debian Linux distribution, and you don't seem to have much experience in a Linux environment I would recommend you to download the .deb file for your architecture. Ubuntu will automatically launch the package manager when you launch the .deb file if I remember correctly.
1 .Install Hadoop as single node cluster setup.
2 . Install hive after that.Hive requires Hadoop preinstalled.
Hadoop requires Java 1.6 at least and for single node setup you require SSH installed on your machine.rest of the steps are easy.
goto this link and Download the
http://mirror.tcpdiag.net/apache/hadoop/common/stable/
hadoop-1.1.2.tar.gz file (59M) from link and Install it...same as if you want install hive then go to offical site and download the stable version from it...

Resources