Read from resources when running Spark in Yarn

Read from resources when running Spark in Yarn - apache-spark

In my Spark job I read some additional data from resources files.
Some example Resources.getResource("/more-data")
It works great locally, and when I run from spark-submit master=local[*]
I only to need to add --conf=spark.driver.extraClassPath=moredata.
Moving to cluster mode (Yarn) it is no longer able to find the folder.
I tried spark.yarn.dist.files, without help, maybe I need to add something to that?

Assuming you are running spark application in Yarn mode , you have some file resources in more-data folder . Instead of distributing the folder , distribute all the resources .
Depending on the type of resource to be distributed over , we have following options :
spark.yarn.dist.jars
spark.yarn.dist.jars (default: empty) is a collection of additional jars to distribute.
It is used when Client distributes additional resources as specified using --jars command-line option for spark-submit.
spark.yarn.dist.files
spark.yarn.dist.files (default: empty) is a collection of additional files to distribute.
It is used when Client distributes additional resources as specified using --files command-line option for spark-submit.
spark.yarn.dist.archives
spark.yarn.dist.archives (default: empty) is a collection of additional archives to distribute.
It is used when Client distributes additional resources as specified using --archives command-line option for spark-submit.
You can find further information from https://jaceklaskowski.gitbooks.io/mastering-apache-spark/yarn/spark-yarn-settings.html
Be careful about how you will be accessing the resources .
example : spark-submit --files /folder-name/fileName
The mentioned resource should be accessed as fileName in the code

Related

The difference between similar spark configurations

I am confused about some similar spark configurations...
I have surveyed the major reference links are https://spark.apache.org/docs/latest/configuration.html and https://spark.apache.org/docs/latest/running-on-yarn.html.
But I am still confused about these configurations...
Could anyone help me to figure out the main differences?
Thanks very much!!
1. spark.yarn.jars vs. spark.jars
What is the difference between spark.yarn.jars vs. spark.jars?
Which configuration is the same to --jars ?
spark.yarn.jars: List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
spark.jars: Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
2. spark.yarn.dist.archives vs. spark.yarn.archive
What is the difference between spark.yarn.dist.archives vs. spark.yarn.archive?
Which configuration is the same to --archives ?
spark.yarn.dist.archives: Comma separated list of archives to be extracted into the working directory of each executor.
spark.yarn.archive: An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.
3. spark.yarn.dist.files vs. spark.files
What is the difference between spark.yarn.dist.files vs. spark.files?
Which configuration is the same to --files ?
spark.yarn.dist.files: Comma-separated list of files to be placed in the working directory of each executor.
spark.files: Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed.

Apache Spark : how to read from hdfs file

I have locally installed spark 2.3.0 and using pyspark. I'm able to work with processing local files without any problem.
But if i have to read from hdfs, i'm not able to.
I'm confused with how spark access hadoop files. while installing spark, I'm asked to copy the winutil. I don't understand what is the role of winutil.
Should we bring up the hadoop services first , to work with spark ?
Getting java.lang.UnsatisfiedLinkError errors if i use the hadoop installed externally and tried to use it in the spark. any pointer to right docuementation will be great help.
Thanks,
Kiran

If you're using spark-submit to run the application in cluster mode, then it can take a flag --files which is used to pass down files from driver node to workers. I believe the reason you were able to run in local mode was because your driver and worker are in same machine however in cluster mode the driver and workers possibly are in separate machines. Spark needs to know in that case which files to send over to worker nodes. The follow flags are available as described in the book Learning Spark by Holden Karau; Andy Konwinski; Patrick Wendell; Matei Zaharia
--master
Indicates the cluster manager to connect to. The options for this flag are described in Table 7-1.
--deploy-mode
Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where spark-submit >s itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.
--class
The “main” class of your application if you’re running a Java or Scala program.
--name
A human-readable name for your application. This will be displayed in Spark’s web UI.
--jars
A list of JAR files to upload and place on the classpath of your application. If your application depends on a small number of third-party JARs, you can add them here.
--files
A list of files to be placed in the working directory of your application. This can be used for data files that you want to distribute to each node.
--py-files
A list of files to be added to the PYTHONPATH of your application. This can contain .py, .egg, or .zip files.
--executor-memory
The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
--driver-memory
The amount of memory to use for the driver process, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
Update
I assumed that Kiran has Hadoop setup (as he mentioned externally) and was not able to make the program read from the HDFS programatically. If that was not the case, please ignore the answer.

Passing multiple jar files in dcos spark-submit, jars with comma separated not suitable

uggestions needed, need to pass lots of jar files to dcos spark submit, jars with comma separated not suitable:
Tried below options:
dcos spark run --submit-args='--class com.gre.music.inn.orrd.SpaneBasicApp --jars /spark_submit_jobs/new1/unzip_new/* 30'
dcos spark run --submit-args='--class com.gre.music.inn.orrd.SpaneBasicApp --jars local:* 30'
dcos spark run --submit-args='--class com.gre.music.inn.orrd.SpaneBasicApp --jars https://s3-us-west-2.amazonaws.com/gmu_jars/* 30‘ .
The last one wont work bcz I guess wild card is not allowed with http.

Update from DC/OS:
--jars isn't supported via dcos spark run (Spark cluster mode). We'll have support for it around ~ DC/OS 1.10 when we move Spark over to Marathon instead of the Spark dispatcher. In the mean time, if you want to use --jars, you'll have to submit your job in client mode via spark-submit through metronome or marathon.

As far as I know you can't use wildcards, and you need to put the JARs somewhere where Spark can access them in a distributed manner (S3, http, hdfs, etc.).
See
http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.

You can't use wildcards with --jars argument in spark-submit. Here's the feature request for that (it's still open).

SparkContext.addFile vs spark-submit --files

I am using Spark 1.6.0. I want to pass some properties files like log4j.properties and some other customer properties file. I see that we can use --files but I also saw that there is a method addFile in SparkContext. I did prefer to use --files instead of programatically adding the files, assuming both the options are same ?
I did not find much documentation about --files, so is --files & SparkContext.addFile both options same ?
References I found about --files and for SparkContext.addFile.

It depends whether your Spark application is running in client or cluster mode.
In client mode the driver (application master) is running locally and can access those files from your project, because they are available within the local file system. SparkContext.addFile should find your local files and work like expected.
If your application is running in cluster mode. The application is submitted via spark-submit. This means that your whole application is transfered to the Spark master or Yarn, which starts the driver (application master) within the cluster on a specific node and within an separated environment. This environment has no access to your local project directory. So all necessary files has to be transfered as well. This can be achieved with the --files option. The same concept applies to jar files (dependencies of your Spark application). In cluster mode, they need to be added with the --jars option to be available within the classpath of the application master. If you use PySpark there is a --py-files option.

Spark submit does automatically upload the jar to cluster?

I'm trying to submit a Spark app from local machine Terminal to my Cluster.
I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
When I provide the path to application jar which is in my local machine, would spark-submit automatically upload it to my Cluster?
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101- does not exist
In Documentation ,http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
Advanced Dependency Management When using spark-submit, the
application jar along with any jars included with the --jars option
will be automatically transferred to the cluster.
But seems like it does not !

I see you are quoting the spark-submit page from Spark Docs but I would spend a lot more time on the Running Spark on YARN page. Bottom-line, look at:
There are two deploy modes that can be used to launch Spark
applications on YARN. In yarn-cluster mode, the Spark driver runs
inside an application master process which is managed by YARN on the
cluster, and the client can go away after initiating the application.
In yarn-client mode, the driver runs in the client process, and the
application master is only used for requesting resources from YARN.
Further you note, "I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine"
So I agree with you that you are right to run --master yarn-cluster instead of --master yarn-client
(and one comment notes what might just be a syntax error where you dropped "assembly.jar" but I think this will apply as well...)
Some of the basic assumptions about non-YARN implementations change a lot when YARN is introduced, mostly related to Classpaths and the need to push jars to the workers.
From an email on the Apache Spark User list:
YARN cluster mode. Spark submit does upload your jars to the cluster.
In particular, it puts the jars in HDFS so your driver can just read
from there. As in other deployments, the executors pull the jars from
the driver.
So finally, from the Apache Spark YARN doc:
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory
which contains the (client side) configuration files for the Hadoop
cluster. These configs are used to write to HDFS and connect to the
YARN ResourceManager.
NOTE: I only see you adding a single JAR, if there's a need to add other JARs there's a special note about doing that with YARN:
In yarn-cluster mode, the driver runs on a different machine than the
client, so SparkContext.addJar won’t work out of the box with files
that are local to the client. To make files on the client available to
SparkContext.addJar, include them with the --jars option in the launch
command.
That page in the link has some examples.
And of course you downloaded or built the YARN-specific version of Spark.
Background, in a standalone cluster deployment using spark-submit and the option --deploy-mode cluster, yes you do need to make sure every worker node has access to all the dependencies, Spark will not push them to the cluster. This is because in "standalone cluster" mode with Spark as the job manager, you don't know which node the driver will run on! But that doesn't apply to your case.
But if I could, depending on the size of the jars you are uploading, I would still explicitly put the jars on each node, or "globally available" via HDFS, for another reason from the docs:
From Advanced Dependency Management, it seems to present the best of both worlds, but also a great reason for manually pushing your jars out to all nodes:
local: - a URI starting with local:/ is expected to exist as a local
file on each worker node. This means that no network IO will be
incurred, and works well for large files/JARs that are pushed to each
worker, or shared via NFS, GlusterFS, etc.
But I assume that local:/... would change to hdfs:/ ... not sure on that one.

Yes and no. It depends on what you mean. Spark deploys the .jar to the nodes in the cluster. However, it won't upload your .jar file from your local machine to the cluster.
You can find more info in the Submitting Applications page. As you can see, in the arguments you pass to spark-submit, there is one that needs to be globally visible: the application-jar.
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes.
As far as I understand, what you want is to use yarn-client, not yarn-cluster. This will run the driver in the client (e.g., the machine which you are trying to call spark-submit on, for example your laptop), without the need of copying the .jar file on the cluster. More about this here.

Try adding --jars option before your /path/to/jar/file
spark-submit --jars /tmp/test.jar

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string