How to copy hadoop examples jar from local to hadoop environment? - linux

I'm still newbie with Hadoop.
I've downloaded a cloudera VM image of hadoop and it did not contain hadoop-examples.jar.
I want to manually copy the hadoop-examples.jar (I got it from somewhere) and that is currently in my local disk, to the hadoop environment, specifically to the usr/jars
So that if I run hadoop jar usr/jars/hadoop-examples.jar wordcount words.txt out it will properly run the jar.
Thanks!

To copy a file from local filesystem to HDFS location use command
hdfs dfs -put /local_disk_path/hadoop-examples.jar /usr/jars/

Related

How to configure Spark 2.4 correctly with user-provided Hadoop

I'd like to use Spark 2.4.5 (the current stable Spark version) and Hadoop 2.10 (the current stable Hadoop version in the 2.x series). Further I need to access HDFS, Hive, S3, and Kafka.
http://spark.apache.org provides Spark 2.4.5 pre-built and bundled with either Hadoop 2.6 or Hadoop 2.7.
Another option is to use the Spark with user-provided Hadoop, so I tried that one.
As a consequence of using with user-provided Hadoop, Spark does not include Hive libraries either.
There will be an error, like here: How to create SparkSession with Hive support (fails with "Hive classes are not found")?
When I add the spark-hive dependency to the spark-shell (spark-submit is affected as well) by using
spark.jars.packages=org.apache.spark:spark-hive_2.11:2.4.5
in spark-defaults.conf, I get this error:
20/02/26 11:20:45 ERROR spark.SparkContext:
Failed to add file:/root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar to Spark environment
java.io.FileNotFoundException: Jar /root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar not found
at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1838)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1868)
at org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:458)
at org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)
because spark-shell cannot handle classifiers together with bundle dependencies, see https://github.com/apache/spark/pull/21339 and https://github.com/apache/spark/pull/17416
A workaround for the classifier probleme looks like this:
$ cp .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2-hadoop2.jar .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar
but DevOps won't accept this.
The complete list of dependencies looks like this (I have added line breaks for better readability)
root#a5a04d888f85:/opt/spark-2.4.5/conf# cat spark-defaults.conf
spark.jars.packages=com.fasterxml.jackson.datatype:jackson-datatype-jdk8:2.9.10,
com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.9.10,
org.apache.spark:spark-hive_2.11:2.4.5,
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,
org.apache.hadoop:hadoop-aws:2.10.0,
io.delta:delta-core_2.11:0.5.0,
org.postgresql:postgresql:42.2.5,
mysql:mysql-connector-java:8.0.18,
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3,
io.prestosql:presto-jdbc:307
(everything works - except for Hive)
Is the combination of Spark 2.4.5 and Hadoop 2.10 used anywhere? How?
How to combine Spark 2.4.5 with user-provided Hadoop and Hadoop 2.9 or 2.10 ?
Is it necessary to build Spark to get around the Hive dependency problem ?
There does not seem to be an easy way to configure Spark 2.4.5 with user-provided Hadoop to use Hadoop 2.10.0
As my task actually was to minimize dependency problems, I have chosen to compile Spark 2.4.5 against Hadoop 2.10.0.
./dev/make-distribution.sh \
--name hadoop-2.10.0 \
--tgz \
-Phadoop-2.7 -Dhadoop.version=hadoop-2.10.0 \
-Phive -Phive-thriftserver \
-Pyarn
Now Maven deals with the Hive dependencies/classifiers, and the resulting package is ready to be used.
In my personal opinion compiling Spark is actually easier than configuring the Spark with-user-provided Hadoop.
Integration tests so far have not shown any problems, Spark can access both HDFS and S3 (MinIO).
Update 2021-04-08
If you want to add support for Kubernetes, just add -Pkubernetes to the list of arguments
Assuming you don't want to run Spark-on-YARN -- start from bundle "Spark 2.4.5 with Hadoop 2.7" then cherry-pick the Hadoop libraries to upgrade from bundle "Hadoop 2.10.x"
Discard spark-yarn / hadoop-yarn-* / hadoop-mapreduce-client-* JARs because you won't need them, except hadoop-mapreduce-client-core that is referenced by write operations on HDFS and S3 (cf. "MR commit procedure" V1 or V2)
you may also discard spark-mesos / mesos-* and/or spark-kubernetes / kubernetes-* JARs depending on what you plan to run Spark on
you may also discard spark-hive-thriftserver and hive-* JARS if you don't plan to run a "thrift server" instance, except hive-metastore that is necessary for, as you might guess, managing the Metastore (either a regular Hive Metastore service or an embedded Metastore inside the Spark session)
Discard hadoop-hdfs / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl JARs
Replace with hadoop-hdfs-client / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl / stax2-api JARs from Hadoop 2.10 (under common/and common/lib/, or hdfs/ and hdfs/lib/)
Add the S3A connector from Hadoop 2.10 i.e. hadoop-aws / jets3t / woodstox-core JARs (under tools/lib/)
download aws-java-sdk from Amazon (cannot be bundled with Hadoop because it's not an Apache license, I guess)
and finally, run a lot of tests...
That worked for me, after some trial-and-error -- with a caveat: I ran my tests against an S3-compatible storage system, but not against the "real" S3, and not against regular HDFS. And without a "real" Hive Metastore service, just the embedded in-memory & volatile Metastore that Spark runs by default.
For the record, the process is the same with Spark 3.0.0 previews and Hadoop 3.2.1, except that
you also have to upgrade guava
you don't have to upgrade xercesImpl nor htrace-core nor stax2-api
you don't need jets3t any more
you need to retain more hadoop-mapreduce-client-* JARs (probably because of the new "S3 committers")

connecting hive to from spark in intellij

I'm trying to connect to remote hive from within my spark program in Intellij installed on local machine.
I placed the hadoop cluster config files on local machine and configured environment variables HADOOP_CONF_DIR in Intellij run configurations of this spark program to be able to detect this hadoop cluster but intelliJ is somehow not reading these files and spark program defaults to local hive metastore instance.
Is there anyway to configure intelliJ to read hadoop config files locally. Any help is highly appreciated.
Please configure SPARK_CONF_DIR variable and copy the hive-site.xml in that directory. Spark will connect the specified hive meta-store and make sure that hive-site.xml points to your cluster details.
Thanks
Ravi
Add hadoop configuration files folder to intellij project class path
Project Settings -> Libraries -> + -> java -> select folder with all the config files -> classes

Is it necessary to install hadoop in /usr/local?

I am trying to build a hadoop cluster with four nodes.
The four machines are from my school's lab and I found their /usr/local are mount from a same public disk which means their /usr/local are identical.
The problem is, I can not start data node on slaves because the hadoop files are always the same(like tmp/dfs/data).
I am planning to configure and insatll hadoop in other dirs like /opt .
The problem is I found almost all the installation tutorial ask us to install it in /usr/local , so I was wondering will there be any bad consequence if I install hadoop in other place like /opt ?
Btw, I am using Ubuntu 16.04
As long as HADOOP_HOME points to where you extracted the hadoop binaries, then it shouldn't matter.
You'll also want to update PATH in ~/.bashrc, for example.
export HADOOP_HOME=/path/to/hadoop_x.yy
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
For reference, I have some configuration files inside of /etc/hadoop.
(Note: Apache Ambari makes installation easier)
It is not at all necessary to install hadoop under /usr/local. That location is generally used when you install single node hadoop cluster (although it is not mandatory). As long as you have following variables specified in .bashrc, any location should work.
export HADOOP_HOME=<path-to-hadoop-install-dir>
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

Input/output error while copying from hadoop file system to local

hadoop fs -copyToLocal /paulp /abcd (I want to copy the folder paulp in hadoop file system to abcd folder in local)
But the oputput of that command shows like this( copyToLocal: mkdir `/abcd': Input/output error)
I use ubuntu 14.04 and hadoop 2.7.1 ...
can you provide apt solution to this?

How to manually deploy 3rd party utility jar for Apache Spark cluster?

I have a Apache Spark cluster (multi-nodes) and I would like to manually deploy some utility jars to each Spark node. Where should I put these jars to?
For example: spark-streaming-twitter_2.10-1.6.0.jar
I know we can use maven to build a fat jar which including these jars, however I would like to deploy these utilities manually. In this way, programmers would not have to deploy these utilities jars.
Any suggestion?
1, Copy your 3rd party jars to reserved HDFS directory;
for example hdfs://xxx-ns/user/xxx/3rd-jars/
2, In spark-submit, specify these jars using hdfs path;
hdfs: - executors will pull down files and JARs from hdfs directory
--jars hdfs://xxx-ns/user/xxx/3rd-jars/xxx.jar
3, spark-submit will not repleatly upload these jars
Client: Source and destination file systems are the same. Not copying hdfs://xxx-ns/user/xxx/3rd-jars/xxx.jar
spark-submit and spark-shell have a --jars option. This will distribute the jars to all the executors. The spark-submit --help for --jars is as follows
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
This is taken from the programming guide..
Or, to also add code.jar to its classpath, use:
$ ./bin/spark-shell --master local[4] --jars code.jar

Resources