Apache Spark : how to read from hdfs file

Apache Spark : how to read from hdfs file - apache-spark

I have locally installed spark 2.3.0 and using pyspark. I'm able to work with processing local files without any problem.
But if i have to read from hdfs, i'm not able to.
I'm confused with how spark access hadoop files. while installing spark, I'm asked to copy the winutil. I don't understand what is the role of winutil.
Should we bring up the hadoop services first , to work with spark ?
Getting java.lang.UnsatisfiedLinkError errors if i use the hadoop installed externally and tried to use it in the spark. any pointer to right docuementation will be great help.
Thanks,
Kiran

If you're using spark-submit to run the application in cluster mode, then it can take a flag --files which is used to pass down files from driver node to workers. I believe the reason you were able to run in local mode was because your driver and worker are in same machine however in cluster mode the driver and workers possibly are in separate machines. Spark needs to know in that case which files to send over to worker nodes. The follow flags are available as described in the book Learning Spark by Holden Karau; Andy Konwinski; Patrick Wendell; Matei Zaharia
--master
Indicates the cluster manager to connect to. The options for this flag are described in Table 7-1.
--deploy-mode
Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where spark-submit >s itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.
--class
The “main” class of your application if you’re running a Java or Scala program.
--name
A human-readable name for your application. This will be displayed in Spark’s web UI.
--jars
A list of JAR files to upload and place on the classpath of your application. If your application depends on a small number of third-party JARs, you can add them here.
--files
A list of files to be placed in the working directory of your application. This can be used for data files that you want to distribute to each node.
--py-files
A list of files to be added to the PYTHONPATH of your application. This can contain .py, .egg, or .zip files.
--executor-memory
The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
--driver-memory
The amount of memory to use for the driver process, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
Update
I assumed that Kiran has Hadoop setup (as he mentioned externally) and was not able to make the program read from the HDFS programatically. If that was not the case, please ignore the answer.

Related

Prevent Spark from copying JAR dependencies to `work/` folder for each executor node

Is there a way to prevent Spark from automatically copying the JAR files specified via --jars in the spark-submit command to the work/ folder for each executor node?
My spark-submit command specifies all the JAR dependencies for the job like so
spark-submit \
--master <master> \
--jars local:/<jar1-path>,local:/<jar2-path>... \
<application-jar> \
<arguments>
These JAR paths live on a distributed filesystem that is available in the same location on all the cluster nodes.
Now, according to the documentation:
Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up.
The last sentence is absolutely true. My JAR dependencies need to include some multi-gigabyte model files, and when I deploy my Spark job over 100 nodes, you can imagine that having 100 copies of these files wastes huge amounts of disk space, not to mention the time it takes to copy them.
Is there a way to prevent Spark from copying the dependencies? I'm not sure I understand why it needs to copy them in the first place, given that the JARS are accessible from each cluster node via the same path. There should not be a need to keep distinct copies of each JAR in each node's working directory.
That same Spark documentation mentions that
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
...which is exactly how I'm referencing the JARS in the spark-submit command.
So, can Spark be prevented from copying all JARS specified via local:/... to the working directory of each cluster node? If so, how? If not, is there a reason why this copying must happen?
Edit: clarified that copies are per-node (not per-executor)

Specify spark driver for spark-submit

I'm submitting a spark job from a shell script that has a bunch of env vars and parameters to pass to spark. Strangely, the driver host is not one of these parameters (there are driver cores and memory however). So if I have 3 machines in the cluster, a driver will be chosen randomly. I don't want this behaviour since 1) the jar I'm submitting is only on one of the machines and 2) the driver machine should often be smaller than the other machines which is not the case if it's random choice.
So far, I found no way to specify this param on the command line to spark-submit. I've tried --conf SPARK_DRIVER_HOST="172.30.1.123, --conf spark.driver.host="172.30.1.123 and many other things but nothing has any effect. I'm using spark 2.1.0. Thanks.

I assume you are running with Yarn cluster. In brief yarn uses containers to launch and implement tasks. And resource manager decides where to run which container based on availability of resources. In spark case drivers and executors also launched as containers with separate jvms. Driver dedicated to splitting tasks among executors and collect the results from them. If your node from where you launch your application included in cluster then it will be also used as shared resource for launching driver/executor.

From the documentation: http://spark.apache.org/docs/latest/running-on-yarn.html
When running the cluster in standalone or in Mesos the driver host (this is the master) can be launched with:
--master <master-url> #e.g. spark://23.195.26.187:7077
When using YARN it works a little different. Here the parameter is yarn
--master yarn
The yarn is specified in Hadoop its configuration for the ResourceManager. For how to do this see this interesting guide https://dqydj.com/raspberry-pi-hadoop-cluster-apache-spark-yarn/ . Basically in the hdfs the hdfs-site.xml and in yarn the yarn-site.xml

SparkContext.addFile vs spark-submit --files

I am using Spark 1.6.0. I want to pass some properties files like log4j.properties and some other customer properties file. I see that we can use --files but I also saw that there is a method addFile in SparkContext. I did prefer to use --files instead of programatically adding the files, assuming both the options are same ?
I did not find much documentation about --files, so is --files & SparkContext.addFile both options same ?
References I found about --files and for SparkContext.addFile.

It depends whether your Spark application is running in client or cluster mode.
In client mode the driver (application master) is running locally and can access those files from your project, because they are available within the local file system. SparkContext.addFile should find your local files and work like expected.
If your application is running in cluster mode. The application is submitted via spark-submit. This means that your whole application is transfered to the Spark master or Yarn, which starts the driver (application master) within the cluster on a specific node and within an separated environment. This environment has no access to your local project directory. So all necessary files has to be transfered as well. This can be achieved with the --files option. The same concept applies to jar files (dependencies of your Spark application). In cluster mode, they need to be added with the --jars option to be available within the classpath of the application master. If you use PySpark there is a --py-files option.

Spark submit does automatically upload the jar to cluster?

I'm trying to submit a Spark app from local machine Terminal to my Cluster.
I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
When I provide the path to application jar which is in my local machine, would spark-submit automatically upload it to my Cluster?
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101- does not exist
In Documentation ,http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
Advanced Dependency Management When using spark-submit, the
application jar along with any jars included with the --jars option
will be automatically transferred to the cluster.
But seems like it does not !

I see you are quoting the spark-submit page from Spark Docs but I would spend a lot more time on the Running Spark on YARN page. Bottom-line, look at:
There are two deploy modes that can be used to launch Spark
applications on YARN. In yarn-cluster mode, the Spark driver runs
inside an application master process which is managed by YARN on the
cluster, and the client can go away after initiating the application.
In yarn-client mode, the driver runs in the client process, and the
application master is only used for requesting resources from YARN.
Further you note, "I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine"
So I agree with you that you are right to run --master yarn-cluster instead of --master yarn-client
(and one comment notes what might just be a syntax error where you dropped "assembly.jar" but I think this will apply as well...)
Some of the basic assumptions about non-YARN implementations change a lot when YARN is introduced, mostly related to Classpaths and the need to push jars to the workers.
From an email on the Apache Spark User list:
YARN cluster mode. Spark submit does upload your jars to the cluster.
In particular, it puts the jars in HDFS so your driver can just read
from there. As in other deployments, the executors pull the jars from
the driver.
So finally, from the Apache Spark YARN doc:
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory
which contains the (client side) configuration files for the Hadoop
cluster. These configs are used to write to HDFS and connect to the
YARN ResourceManager.
NOTE: I only see you adding a single JAR, if there's a need to add other JARs there's a special note about doing that with YARN:
In yarn-cluster mode, the driver runs on a different machine than the
client, so SparkContext.addJar won’t work out of the box with files
that are local to the client. To make files on the client available to
SparkContext.addJar, include them with the --jars option in the launch
command.
That page in the link has some examples.
And of course you downloaded or built the YARN-specific version of Spark.
Background, in a standalone cluster deployment using spark-submit and the option --deploy-mode cluster, yes you do need to make sure every worker node has access to all the dependencies, Spark will not push them to the cluster. This is because in "standalone cluster" mode with Spark as the job manager, you don't know which node the driver will run on! But that doesn't apply to your case.
But if I could, depending on the size of the jars you are uploading, I would still explicitly put the jars on each node, or "globally available" via HDFS, for another reason from the docs:
From Advanced Dependency Management, it seems to present the best of both worlds, but also a great reason for manually pushing your jars out to all nodes:
local: - a URI starting with local:/ is expected to exist as a local
file on each worker node. This means that no network IO will be
incurred, and works well for large files/JARs that are pushed to each
worker, or shared via NFS, GlusterFS, etc.
But I assume that local:/... would change to hdfs:/ ... not sure on that one.

Yes and no. It depends on what you mean. Spark deploys the .jar to the nodes in the cluster. However, it won't upload your .jar file from your local machine to the cluster.
You can find more info in the Submitting Applications page. As you can see, in the arguments you pass to spark-submit, there is one that needs to be globally visible: the application-jar.
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes.
As far as I understand, what you want is to use yarn-client, not yarn-cluster. This will run the driver in the client (e.g., the machine which you are trying to call spark-submit on, for example your laptop), without the need of copying the .jar file on the cluster. More about this here.

Try adding --jars option before your /path/to/jar/file
spark-submit --jars /tmp/test.jar

Using Spark Shell (CLI) in standalone mode on distributed files

I am using Spark 1.3.1 in standalone mode (No YARN/HDFS involved - Only Spark) on a cluster with 3 machines. I have a dedicated node for master (no workers running on it) and 2 separate worker nodes.
The cluster starts healthy, and I am just trying to test my installation by running some simple examples via spark-shell (CLI - which I started on the master machine) : I simply put a file on the localfs on the master node (workers do NOT have a copy of this file) and I simply run:
$SPARKHOME/bin/spark-shell
...
scala> val f = sc.textFile("file:///PATH/TO/LOCAL/FILE/ON/MASTER/FS/file.txt")
scala> f.count()
and it returns the words count results correctly.
My Questions are:
1) This contradicts with what spark documentation (on using External Datasets) say as:
"If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system."
I am not using NFS and I did not copy the file to workers, so how does it work ? (Is it because spark-shell does NOT really launch jobs on the cluster, and does the computation locally (It is weird as I do NOT have a worker running on the node, I started shell on)
2) If I want to run SQL scripts (in standalone mode) against some large data files (which do not fit into one machine) through Spark's thrift server (like the way beeline or hiveserver2 is used in Hive) , do I need to put the files on NFS so each worker can see the whole file, or is it possible that I create chunks out of the files, and put each smaller chunk (which would fit on a single machine) on each worker, and then use multiple paths (comma separated) to pass them all to the submitted queries ?

The problem is that you are running the spark-shell locally. The default for running a spark-shell is as --master local[*], which will run your code locally on as many cores as you have. If you want to run against your workers, then you will need to run with the --master parameter specifying the master's entry point. If you want to see the possible options you can use with spark-shell, just type spark-shell --help
As to whether you need to put the file on each server, the short answer is yes. Something like HDFS will split it up across the nodes and the manager will handle the fetching as appropriate. I am not as familiar with NFS and if it has this capability, though

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string