Using different version of hadoop client library with apache spark

Using different version of hadoop client library with apache spark - apache-spark

I'm trying to run two or more jobs in parallel. All jobs write append data using same output path, problem is that first job that finishes does cleanup and erases _temporary folder which causes other jobs to throw exception.
With hadoop-client 3 there is a configuration flag to disable auto cleanup of this folder mapreduce.fileoutputcommitter.cleanup.skipped.
I was able to exclude dependencies from spark-core and add new hadoop-client using maven. This run fine for master=local but I'm not convinced it is correct.
My questions are
Is it possible to use different hadoop-client library with apache spark (e.g. hadoop-client version 3 with apache spark 2.3) and what is the correct approach?
Is there better way to run multiple jobs in parallel writing under same path?

Related

Loading dependency jars (different versions of same jar for different actions/jobs) with oozie spark action

My main spark project have dependency on other utils jars.So set of combination could be like:
1. main_spark-1.0.jar will work with utils_spark-1.0.jar (some jobs use this set)
2. main_spark-2.0.jar will work with utils_spark-2.0.jar (and some of the jobs use this set)
The approch which worked for me to handle this scenario is to pass jars with spark-opt as
oozie spark action job1
<jar>main_spark-1.0.jar</jar>
<spark-opt>--jars utils_spark-1.0.jar</spark-opt>
oozie spark action job2
<jar>main_spark-2.0.jar</jar>
<spark-opt>--jars utils_spark-2.0.jar</spark-opt>
I tested this configuration in two different actions and it works.
The question I have is
How is it different then loading jars in app lib path(oozie) ?
If both jobs/action run in parallel on same yarn-cluster then Is there any possibility of class loader issue (multiple versions of same jar)?
In my understanding both application will be running in their spark context so it should be ok but any expert advice ?

If both jobs/action run in parallel on same yarn-cluster then Is there any possibility of class loader issue (multiple versions of same jar)?
No (or at least it is not expected and if happened I'd consider it a bug).
Submitting a Spark application to a YARN cluster always ends up as a separate set of the driver and executors that all together compose a separate environment from other Spark applications.

Best Practice For Deploying and Running Periodic Spark Job

I have a number of spark batch jobs each of which need to be run every x hours. I'm sure this must be a common problem but there seems to be relatively little on the internet as to what the best practice is here for setting this up. My current setup is as follows:
Build system (sbt) builds a tar.gz containing a fat jar + a script that will invoke spark-submit.
Once tests have passed, CI system (Jenkins) copies the tar.gz to hdfs.
I set up a chronos job to unpack the tar.gz to the local filesystem and run the script that submits to spark.
This setup works reasonably well, but there are some aspects of step 3) that I'm not fond of. Specifically:
I need a separate script (executed by chronos) that copies from hdfs, unpacks and runs the spark-submit task. As far as I can tell chrons can't run scripts from hdfs so I have to have a copy of this script on every mesos worker which makes deployment more complex that it would be if everything just lived on hdfs.
I have a feeling that I have too many moving parts. For example I was wondering if I could create an executable jar that could submit itself (args would be the spark master and the main class) in which case I would do away with at least one of the wrapper scripts. Unfortunately I haven't found a good way of doing this
As this is a problem that everyone faces I was wondering if anyone could give a better solution.

To download and extract archive you can use Mesos fetcher by specifying it in Chronos job config by setting uris field.
To do the same procedure on executors side you can set spark.executor.uri parameter in default Spark conf

How to modify spark source code and build

I just start learning spark. I have imported spark source code to IDEA and made some small changes (just add some println()) to spark source code. What should I do to see these updates? Should I recompile the spark? Thanks!

At the bare minimum, you will need maven 3.3.3 and Java 7+.
You can follow the steps at http://spark.apache.org/docs/latest/building-spark.html
The "make-distribution.sh" script is quite handy which comes within the spark source code root directory. This script will produce a distributable tar.gz which you can simply extract and launch spark-shell or spark-submit. After making the source code changes in spark, you can run this script with the right options (mainly passing the desired hadoop version, yarn or hive support options but these are required if you want to run on top of hadoop distro, or want to connect to existing hive).
BTW, inserting println() will not be a good idea as it can severely slow down the performance of the job. You should use a logger instead.

Add CLASSPATH to Oozie workflow job

I coded SparkSQL that accesses Hive tables, in Java, and packaged a jar file that can be run using spark-submit.
Now I want to run this jar as an Oozie workflow (and coordinator, if I make workflow to work). When I try to do that, the job fails and I get in Oozie job logs
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
What I did was to look for the jar in $HIVE_HOME/lib that contains that class, copy that jar in the lib path of my Oozie workflow root path and add this to workflow.xml in the Spark Action:
<spark-opts> --jars lib/*.jar</spark-opts>
But this leads to another java.lang.NoClassDefFoundError that points to another missing class, so I did the process again of looking for the jar and copying, run the job and the same thing goes all over. It looks like it needs the dependency to many jars in my Hive lib.
What I don't understand is when I use spark-submit in the shell using the jar, it runs OK, I can SELECT and INSERT into my Hive tables. It is only when I use Oozie that this occurs. It looks like that Spark can't see the Hive libraries anymore when contained in an Oozie workflow job. Can someone explain how this happens?
How do I add or reference the necessary classes / jars to the Oozie path?
I am using Cloudera Quickstart VM CDH 5.4.0, Spark 1.4.0, Oozie 4.1.0.

Usually the "edge node" (the one you can connect to) has a lot of stuff pre-installed and referenced in the default CLASSPATH.
But the Hadoop "worker nodes" are probably barebones, with just core Hadoop libraries pre-installed.
So you can wait a couple of years for Oozie to package properly Spark dependencies in a ShareLib, and use the "blablah.system.libpath" flag.
[EDIT] if base Spark functionality is OK but you fail on the Hive format interface, then specify a list of ShareLibs including "HCatalog" e.g.
action.sharelib.for.spark=spark,hcatalog
Or, you can find out which JARs and config files are actually used by Spark, upload them to HDFS, and reference them (all of them, one by one) in your Oozie Action under <file> so that they are downloaded at run time in the working dir of the YARN container.
[EDIT] Maybe the ShareLibs contain the JARs but not the config files; then all you have to upload/download is a list of valid config files (Hive, Spark, whatever)

The better way to avoid the ClassPath not found exception in Oozie is, Install the Oozie SharedLib in the cluster, and update the Hive/Pig jars in the Shared Locaton {Some Times Existing Jar in Oozie Shared Location use to get mismatch with product jar.}
hdfs://hadoop:50070/user/oozie/share/lib/
once the same has been update, please pass a parameter
"oozie.use.system.libpath = true"
These will inform oozie to read the Jars from Hadoop Shared Location.
Once the You have mention the Shared Location by setting the paramenter "true" you no need to mention all and each jar one by one in workflow.xml

Tachyon: Failed to rename during copyFromLocal command

I'm using Apache Spark to build an application. To make the RDDs available from other applications I'm trying two approaches:
Using tachyon
Using a spark-jobserver
I'm new to Tachyon. I completed the following tasks given in the a Running Tachyon on a Cluster
I'm able to access the UI from master:19999 URL.
From the tachyon directory I successfully created a directory./bin/tachyon tfs mkdir /Test
But while trying to do the copyFromLocal command I'm getting the following errors:
FailedToCheckpointException(message:Failed to rename hdfs://master:54310/tmp/tachyon/workers/1421840000001/8/93 to hdfs://master:54310/tmp/tachyon/data/93)

You are most likely running tachyon and spark-jobserver under different users, and have HDFS as your underFS.
Check out https://tachyon.atlassian.net/browse/TACHYON-1339 and the related patch.
The easy way out is running tachyon and your spark job server as the same user.
The (slightly) harder way is to port the patch and recompile spark, and then sjs with the patched client.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using different version of hadoop client library with apache spark - apache-spark

Related

Loading dependency jars (different versions of same jar for different actions/jobs) with oozie spark action

Best Practice For Deploying and Running Periodic Spark Job

How to modify spark source code and build

Add CLASSPATH to Oozie workflow job

Tachyon: Failed to rename during copyFromLocal command

Categories

Resources