Building Spark with provided Hadoop

Building Spark with provided Hadoop - apache-spark

I've been trying to build a custom Spark build with a custom built Hadoop (I need to apply a patch to Hadoop 2.9.1 that allows me to use S3Guard on paths that start with s3://).
Here is how I build spark, after cloning it and being on Spark 2.3.1 on my Dockerfile:
ARG HADOOP_VER=2.9.1
RUN bash -c \
"MAVEN_OPTS='-Xmx2g -XX:ReservedCodeCacheSize=512m' \
./dev/make-distribution.sh \
--name hadoop${HADOOP_VER} \
--tgz \
-Phadoop-provided \
-Dhadoop.version=${HADOOP_VER} \
-Phive \
-Phive-thriftserver \
-Pkubernetes"
This compiles successfully, but when I try to use Spark with s3:// paths I still an error on the Hadoop code that I'm sure I removed through my patch when compiling it. So that Spark build is not using my Hadoop provided JARs as far as I can tell.
What is the right way to compile Spark so that it does not includes the Hadoop JARs and uses the one I provide.
Note: I run on standalone mode and I set SPARK_DIST_CLASSPATH=$(hadoop classpath) so that it points to my Hadoop classpath.

For custom hadoop versions, you need to get your own artifacts onto the local machines, and into the spark tar file which is distributed round the cluster (usually in HDFS), and downloaded when the workers are deployed (in YARN; no idea about k8s)
The best way to do this reliably is to locally build a hadoop release with a new version number, and build spark against that.
dev/make-distribution.sh -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Phadoop-3.1 -Phadoop-cloud -Dhadoop.version=2.9.3-SNAPSHOT
That will create a spark distro with the hadoop-aws and matching SDK which you have built up.
It's pretty slow: run nailgun/zinc if you can for some speedup. If you refer to a version which is also in the public repos, there's a high chance whatever cached copies in your maven repo ~/.m2/repository has crept in.
Then: bring up spark shell and test from there, before trying any more complex setups.
Finally, there is some open JIRA for s3guard to not worry about s3 vs s3a in URLs. Is that your patch? If not, does it work? we could probably get it in to future hadoop releases if people who need it are happy

Related

How to configure Spark 2.4 correctly with user-provided Hadoop

I'd like to use Spark 2.4.5 (the current stable Spark version) and Hadoop 2.10 (the current stable Hadoop version in the 2.x series). Further I need to access HDFS, Hive, S3, and Kafka.
http://spark.apache.org provides Spark 2.4.5 pre-built and bundled with either Hadoop 2.6 or Hadoop 2.7.
Another option is to use the Spark with user-provided Hadoop, so I tried that one.
As a consequence of using with user-provided Hadoop, Spark does not include Hive libraries either.
There will be an error, like here: How to create SparkSession with Hive support (fails with "Hive classes are not found")?
When I add the spark-hive dependency to the spark-shell (spark-submit is affected as well) by using
spark.jars.packages=org.apache.spark:spark-hive_2.11:2.4.5
in spark-defaults.conf, I get this error:
20/02/26 11:20:45 ERROR spark.SparkContext:
Failed to add file:/root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar to Spark environment
java.io.FileNotFoundException: Jar /root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar not found
at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1838)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1868)
at org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:458)
at org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)
because spark-shell cannot handle classifiers together with bundle dependencies, see https://github.com/apache/spark/pull/21339 and https://github.com/apache/spark/pull/17416
A workaround for the classifier probleme looks like this:
$ cp .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2-hadoop2.jar .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar
but DevOps won't accept this.
The complete list of dependencies looks like this (I have added line breaks for better readability)
root#a5a04d888f85:/opt/spark-2.4.5/conf# cat spark-defaults.conf
spark.jars.packages=com.fasterxml.jackson.datatype:jackson-datatype-jdk8:2.9.10,
com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.9.10,
org.apache.spark:spark-hive_2.11:2.4.5,
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,
org.apache.hadoop:hadoop-aws:2.10.0,
io.delta:delta-core_2.11:0.5.0,
org.postgresql:postgresql:42.2.5,
mysql:mysql-connector-java:8.0.18,
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3,
io.prestosql:presto-jdbc:307
(everything works - except for Hive)
Is the combination of Spark 2.4.5 and Hadoop 2.10 used anywhere? How?
How to combine Spark 2.4.5 with user-provided Hadoop and Hadoop 2.9 or 2.10 ?
Is it necessary to build Spark to get around the Hive dependency problem ?

There does not seem to be an easy way to configure Spark 2.4.5 with user-provided Hadoop to use Hadoop 2.10.0
As my task actually was to minimize dependency problems, I have chosen to compile Spark 2.4.5 against Hadoop 2.10.0.
./dev/make-distribution.sh \
--name hadoop-2.10.0 \
--tgz \
-Phadoop-2.7 -Dhadoop.version=hadoop-2.10.0 \
-Phive -Phive-thriftserver \
-Pyarn
Now Maven deals with the Hive dependencies/classifiers, and the resulting package is ready to be used.
In my personal opinion compiling Spark is actually easier than configuring the Spark with-user-provided Hadoop.
Integration tests so far have not shown any problems, Spark can access both HDFS and S3 (MinIO).
Update 2021-04-08
If you want to add support for Kubernetes, just add -Pkubernetes to the list of arguments

Assuming you don't want to run Spark-on-YARN -- start from bundle "Spark 2.4.5 with Hadoop 2.7" then cherry-pick the Hadoop libraries to upgrade from bundle "Hadoop 2.10.x"
Discard spark-yarn / hadoop-yarn-* / hadoop-mapreduce-client-* JARs because you won't need them, except hadoop-mapreduce-client-core that is referenced by write operations on HDFS and S3 (cf. "MR commit procedure" V1 or V2)
you may also discard spark-mesos / mesos-* and/or spark-kubernetes / kubernetes-* JARs depending on what you plan to run Spark on
you may also discard spark-hive-thriftserver and hive-* JARS if you don't plan to run a "thrift server" instance, except hive-metastore that is necessary for, as you might guess, managing the Metastore (either a regular Hive Metastore service or an embedded Metastore inside the Spark session)
Discard hadoop-hdfs / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl JARs
Replace with hadoop-hdfs-client / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl / stax2-api JARs from Hadoop 2.10 (under common/and common/lib/, or hdfs/ and hdfs/lib/)
Add the S3A connector from Hadoop 2.10 i.e. hadoop-aws / jets3t / woodstox-core JARs (under tools/lib/)
download aws-java-sdk from Amazon (cannot be bundled with Hadoop because it's not an Apache license, I guess)
and finally, run a lot of tests...
That worked for me, after some trial-and-error -- with a caveat: I ran my tests against an S3-compatible storage system, but not against the "real" S3, and not against regular HDFS. And without a "real" Hive Metastore service, just the embedded in-memory & volatile Metastore that Spark runs by default.
For the record, the process is the same with Spark 3.0.0 previews and Hadoop 3.2.1, except that
you also have to upgrade guava
you don't have to upgrade xercesImpl nor htrace-core nor stax2-api
you don't need jets3t any more
you need to retain more hadoop-mapreduce-client-* JARs (probably because of the new "S3 committers")

Cloudera Quick Start VM lacks Spark 2.0 or greater

In order to test and learn Spark functions, developers require Spark latest version. As the API's and methods earlier to version 2.0 are obsolete and no longer work in the newer version. This throws a bigger challenge and developers are forced to install Spark manually which wastes a considerable amount of development time.
How do I use a later version of Spark on the Quickstart VM?

Every one should not waste setup time which I have wasted, so here is the solution.
SPARK 2.2 Installation Setup on Cloudera VM
Step 1: Download a quickstart_vm from the link:
Prefer a vmware platform as it is easy to use, anyways all the options are viable.
Size is around 5.4gb of the entire tar file. We need to provide the business email id as it won’t accept personal email ids.
Step 2: The virtual environment requires around 8gb of RAM, please allocate sufficient memory to avoid performance glitches.
Step 3: Please open the terminal and switch to root user as:
su root
password: cloudera
Step 4: Cloudera provides java –version 1.7.0_67 which is old and does not match with our needs. To avoid java related exceptions, please install java with the following commands:
Downloading Java:
wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
Switch to /usr/java/ directory with “cd /usr/java/” command.
cp the java download tar file to the /usr/java/ directory.
Untar the directory with “tar –zxvf jdk-8u31-linux-x64.tar.gz”
Open the profile file with the command “vi ~/.bash_profile”
export JAVA_HOME to the new java directory.
export JAVA_HOME=/usr/java/jdk1.8.0_131
Save and Exit.
In order to reflect the above change, following command needs to be executed on the shell:
source ~/.bash_profile
The Cloudera VM provides spark 1.6 version by default. However, 1.6 API’s are old and do not match with production environments. In that case, we need to download and manually install Spark 2.2.
Switch to /opt/ directory with the command:
cd /opt/
Download spark with the command:
wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
Untar the spark tar with the following command:
tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
We need to define some environment variables as default settings:
Please open a file with the following command:
vi /opt/spark-2.2.0-bin-hadoop2.7/conf/spark-env.sh
Paste the following configurations in the file:
SPARK_MASTER_IP=192.168.50.1
SPARK_EXECUTOR_MEMORY=512m
SPARK_DRIVER_MEMORY=512m
SPARK_WORKER_MEMORY=512m
SPARK_DAEMON_MEMORY=512m
Save and exit
We need to start spark with the following command:
/opt/spark-2.2.0-bin-hadoop2.7/sbin/start-all.sh
Export spark_home :
export SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.7/
Change the permissions of the directory:
chmod 777 -R /tmp/hive
Try “spark-shell”, it should work.

Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

I just upgraded from Spark 2.0.2 to Spark 2.1.0 (by downloading the prebuilt version for Hadoop 2.7&later). No Hive is installed.
Upon launch of the spark-shell, the metastore_db/ folder and derby.log file are created at the launch location, together with a bunch of warning logs (which were not printed in the previous version).
Closer inspection of the debug logs shows that Spark 2.1.0 tries to initialise a HiveMetastoreConnection:
17/01/13 09:14:44 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
Similar debug logs for Spark 2.0.2 do not show any initialisation of HiveMetastoreConnection.
Is this intended behaviour? Could it be related to the fact that spark.sql.warehouse.dir is now a static configuration shared among sessions? How do I avoid this, since I have no Hive installed?
Thanks in advance!

From Spark 2.1.0 documentation pages:
When not configured by the hive-site.xml, the context automatically
creates metastore_db in the current directory and creates a directory
configured by spark.sql.warehouse.dir, which defaults to the directory
spark-warehouse in the current directory that the Spark application is
started. Note that the hive.metastore.warehouse.dir property in
hive-site.xml is deprecated since Spark 2.0.0. Instead, use
spark.sql.warehouse.dir to specify the default location of database in
warehouse.
Since you do not have Hive installed, you will not have a hive-site.xml config file, and this must be defaulting to the current directory.
If you are not planning to use HiveContext in Spark, you could reinstall Spark 2.1.0 from source, rebuilding it with Maven and making sure you omit -Phive -Phive-thriftserver flags which enable Hive support.

For future googlers: the actual underlying reason for the creation of metastore_db and derby.log in every working directory is the default value of derby.system.home.
This can be changed in spark-defaults.conf, see here.

This happen also with Spark 1.6. You can change the path by adding in Spark submit extra options:
-Dderby.system.home=/tmp/derby
(or by derby.properties, there are several ways to change it).

Single node standalone Spark?

Is there any way I can install and use Spark without Hadoop/HDFS on a single node? I just want to try some simple examples and I would like to keep it as light as possible.

Yes it is possible.
However it needs hadoop libraries underneath, so download any version of prebuilt spark like,
wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.4.tgz
tar -xvzf spark-1.6.0-bin-hadoop2.4.tgz
cd spark-1.6.0-bin-hadoop2.4/bin
./spark-shell

How to manually deploy 3rd party utility jar for Apache Spark cluster?

I have a Apache Spark cluster (multi-nodes) and I would like to manually deploy some utility jars to each Spark node. Where should I put these jars to?
For example: spark-streaming-twitter_2.10-1.6.0.jar
I know we can use maven to build a fat jar which including these jars, however I would like to deploy these utilities manually. In this way, programmers would not have to deploy these utilities jars.
Any suggestion?

1, Copy your 3rd party jars to reserved HDFS directory;
for example hdfs://xxx-ns/user/xxx/3rd-jars/
2, In spark-submit, specify these jars using hdfs path;
hdfs: - executors will pull down files and JARs from hdfs directory
--jars hdfs://xxx-ns/user/xxx/3rd-jars/xxx.jar
3, spark-submit will not repleatly upload these jars
Client: Source and destination file systems are the same. Not copying hdfs://xxx-ns/user/xxx/3rd-jars/xxx.jar

spark-submit and spark-shell have a --jars option. This will distribute the jars to all the executors. The spark-submit --help for --jars is as follows
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
This is taken from the programming guide..
Or, to also add code.jar to its classpath, use:
$ ./bin/spark-shell --master local[4] --jars code.jar

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string