The difference between similar spark configurations - apache-spark

I am confused about some similar spark configurations...
I have surveyed the major reference links are https://spark.apache.org/docs/latest/configuration.html and https://spark.apache.org/docs/latest/running-on-yarn.html.
But I am still confused about these configurations...
Could anyone help me to figure out the main differences?
Thanks very much!!
1. spark.yarn.jars vs. spark.jars
What is the difference between spark.yarn.jars vs. spark.jars?
Which configuration is the same to --jars ?
spark.yarn.jars: List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
spark.jars: Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
2. spark.yarn.dist.archives vs. spark.yarn.archive
What is the difference between spark.yarn.dist.archives vs. spark.yarn.archive?
Which configuration is the same to --archives ?
spark.yarn.dist.archives: Comma separated list of archives to be extracted into the working directory of each executor.
spark.yarn.archive: An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.
3. spark.yarn.dist.files vs. spark.files
What is the difference between spark.yarn.dist.files vs. spark.files?
Which configuration is the same to --files ?
spark.yarn.dist.files: Comma-separated list of files to be placed in the working directory of each executor.
spark.files: Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed.

Related

Read from resources when running Spark in Yarn

In my Spark job I read some additional data from resources files.
Some example Resources.getResource("/more-data")
It works great locally, and when I run from spark-submit master=local[*]
I only to need to add --conf=spark.driver.extraClassPath=moredata.
Moving to cluster mode (Yarn) it is no longer able to find the folder.
I tried spark.yarn.dist.files, without help, maybe I need to add something to that?
Assuming you are running spark application in Yarn mode , you have some file resources in more-data folder . Instead of distributing the folder , distribute all the resources .
Depending on the type of resource to be distributed over , we have following options :
spark.yarn.dist.jars
spark.yarn.dist.jars (default: empty) is a collection of additional jars to distribute.
It is used when Client distributes additional resources as specified using --jars command-line option for spark-submit.
spark.yarn.dist.files
spark.yarn.dist.files (default: empty) is a collection of additional files to distribute.
It is used when Client distributes additional resources as specified using --files command-line option for spark-submit.
spark.yarn.dist.archives
spark.yarn.dist.archives (default: empty) is a collection of additional archives to distribute.
It is used when Client distributes additional resources as specified using --archives command-line option for spark-submit.
You can find further information from https://jaceklaskowski.gitbooks.io/mastering-apache-spark/yarn/spark-yarn-settings.html
Be careful about how you will be accessing the resources .
example : spark-submit --files /folder-name/fileName
The mentioned resource should be accessed as fileName in the code

Prevent Spark from copying JAR dependencies to `work/` folder for each executor node

Is there a way to prevent Spark from automatically copying the JAR files specified via --jars in the spark-submit command to the work/ folder for each executor node?
My spark-submit command specifies all the JAR dependencies for the job like so
spark-submit \
--master <master> \
--jars local:/<jar1-path>,local:/<jar2-path>... \
<application-jar> \
<arguments>
These JAR paths live on a distributed filesystem that is available in the same location on all the cluster nodes.
Now, according to the documentation:
Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up.
The last sentence is absolutely true. My JAR dependencies need to include some multi-gigabyte model files, and when I deploy my Spark job over 100 nodes, you can imagine that having 100 copies of these files wastes huge amounts of disk space, not to mention the time it takes to copy them.
Is there a way to prevent Spark from copying the dependencies? I'm not sure I understand why it needs to copy them in the first place, given that the JARS are accessible from each cluster node via the same path. There should not be a need to keep distinct copies of each JAR in each node's working directory.
That same Spark documentation mentions that
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
...which is exactly how I'm referencing the JARS in the spark-submit command.
So, can Spark be prevented from copying all JARS specified via local:/... to the working directory of each cluster node? If so, how? If not, is there a reason why this copying must happen?
Edit: clarified that copies are per-node (not per-executor)

Passing multiple jar files in dcos spark-submit, jars with comma separated not suitable

uggestions needed, need to pass lots of jar files to dcos spark submit, jars with comma separated not suitable:
Tried below options:
dcos spark run --submit-args='--class com.gre.music.inn.orrd.SpaneBasicApp --jars /spark_submit_jobs/new1/unzip_new/* 30'
dcos spark run --submit-args='--class com.gre.music.inn.orrd.SpaneBasicApp --jars local:* 30'
dcos spark run --submit-args='--class com.gre.music.inn.orrd.SpaneBasicApp --jars https://s3-us-west-2.amazonaws.com/gmu_jars/* 30‘ .
The last one wont work bcz I guess wild card is not allowed with http.
Update from DC/OS:
--jars isn't supported via dcos spark run (Spark cluster mode). We'll have support for it around ~ DC/OS 1.10 when we move Spark over to Marathon instead of the Spark dispatcher. In the mean time, if you want to use --jars, you'll have to submit your job in client mode via spark-submit through metronome or marathon.
As far as I know you can't use wildcards, and you need to put the JARs somewhere where Spark can access them in a distributed manner (S3, http, hdfs, etc.).
See
http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
You can't use wildcards with --jars argument in spark-submit. Here's the feature request for that (it's still open).

SparkContext.addFile vs spark-submit --files

I am using Spark 1.6.0. I want to pass some properties files like log4j.properties and some other customer properties file. I see that we can use --files but I also saw that there is a method addFile in SparkContext. I did prefer to use --files instead of programatically adding the files, assuming both the options are same ?
I did not find much documentation about --files, so is --files & SparkContext.addFile both options same ?
References I found about --files and for SparkContext.addFile.
It depends whether your Spark application is running in client or cluster mode.
In client mode the driver (application master) is running locally and can access those files from your project, because they are available within the local file system. SparkContext.addFile should find your local files and work like expected.
If your application is running in cluster mode. The application is submitted via spark-submit. This means that your whole application is transfered to the Spark master or Yarn, which starts the driver (application master) within the cluster on a specific node and within an separated environment. This environment has no access to your local project directory. So all necessary files has to be transfered as well. This can be achieved with the --files option. The same concept applies to jar files (dependencies of your Spark application). In cluster mode, they need to be added with the --jars option to be available within the classpath of the application master. If you use PySpark there is a --py-files option.

Add CLASSPATH to Oozie workflow job

I coded SparkSQL that accesses Hive tables, in Java, and packaged a jar file that can be run using spark-submit.
Now I want to run this jar as an Oozie workflow (and coordinator, if I make workflow to work). When I try to do that, the job fails and I get in Oozie job logs
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
What I did was to look for the jar in $HIVE_HOME/lib that contains that class, copy that jar in the lib path of my Oozie workflow root path and add this to workflow.xml in the Spark Action:
<spark-opts> --jars lib/*.jar</spark-opts>
But this leads to another java.lang.NoClassDefFoundError that points to another missing class, so I did the process again of looking for the jar and copying, run the job and the same thing goes all over. It looks like it needs the dependency to many jars in my Hive lib.
What I don't understand is when I use spark-submit in the shell using the jar, it runs OK, I can SELECT and INSERT into my Hive tables. It is only when I use Oozie that this occurs. It looks like that Spark can't see the Hive libraries anymore when contained in an Oozie workflow job. Can someone explain how this happens?
How do I add or reference the necessary classes / jars to the Oozie path?
I am using Cloudera Quickstart VM CDH 5.4.0, Spark 1.4.0, Oozie 4.1.0.
Usually the "edge node" (the one you can connect to) has a lot of stuff pre-installed and referenced in the default CLASSPATH.
But the Hadoop "worker nodes" are probably barebones, with just core Hadoop libraries pre-installed.
So you can wait a couple of years for Oozie to package properly Spark dependencies in a ShareLib, and use the "blablah.system.libpath" flag.
[EDIT] if base Spark functionality is OK but you fail on the Hive format interface, then specify a list of ShareLibs including "HCatalog" e.g.
action.sharelib.for.spark=spark,hcatalog
Or, you can find out which JARs and config files are actually used by Spark, upload them to HDFS, and reference them (all of them, one by one) in your Oozie Action under <file> so that they are downloaded at run time in the working dir of the YARN container.
[EDIT] Maybe the ShareLibs contain the JARs but not the config files; then all you have to upload/download is a list of valid config files (Hive, Spark, whatever)
The better way to avoid the ClassPath not found exception in Oozie is, Install the Oozie SharedLib in the cluster, and update the Hive/Pig jars in the Shared Locaton {Some Times Existing Jar in Oozie Shared Location use to get mismatch with product jar.}
hdfs://hadoop:50070/user/oozie/share/lib/
once the same has been update, please pass a parameter
"oozie.use.system.libpath = true"
These will inform oozie to read the Jars from Hadoop Shared Location.
Once the You have mention the Shared Location by setting the paramenter "true" you no need to mention all and each jar one by one in workflow.xml

Resources