Using Spark Shell (CLI) in standalone mode on distributed files - apache-spark

I am using Spark 1.3.1 in standalone mode (No YARN/HDFS involved - Only Spark) on a cluster with 3 machines. I have a dedicated node for master (no workers running on it) and 2 separate worker nodes.
The cluster starts healthy, and I am just trying to test my installation by running some simple examples via spark-shell (CLI - which I started on the master machine) : I simply put a file on the localfs on the master node (workers do NOT have a copy of this file) and I simply run:
$SPARKHOME/bin/spark-shell
...
scala> val f = sc.textFile("file:///PATH/TO/LOCAL/FILE/ON/MASTER/FS/file.txt")
scala> f.count()
and it returns the words count results correctly.
My Questions are:
1) This contradicts with what spark documentation (on using External Datasets) say as:
"If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system."
I am not using NFS and I did not copy the file to workers, so how does it work ? (Is it because spark-shell does NOT really launch jobs on the cluster, and does the computation locally (It is weird as I do NOT have a worker running on the node, I started shell on)
2) If I want to run SQL scripts (in standalone mode) against some large data files (which do not fit into one machine) through Spark's thrift server (like the way beeline or hiveserver2 is used in Hive) , do I need to put the files on NFS so each worker can see the whole file, or is it possible that I create chunks out of the files, and put each smaller chunk (which would fit on a single machine) on each worker, and then use multiple paths (comma separated) to pass them all to the submitted queries ?

The problem is that you are running the spark-shell locally. The default for running a spark-shell is as --master local[*], which will run your code locally on as many cores as you have. If you want to run against your workers, then you will need to run with the --master parameter specifying the master's entry point. If you want to see the possible options you can use with spark-shell, just type spark-shell --help
As to whether you need to put the file on each server, the short answer is yes. Something like HDFS will split it up across the nodes and the manager will handle the fetching as appropriate. I am not as familiar with NFS and if it has this capability, though

Related

How do I set up a HDFS file system to run a Spark job with HDFS?

I am interested in running Spark in standalone mode with Minio/HDFS.
This question asked exactly what I want: "I require a HDFS, is it thus enough to just use the file-system part of Hadoop?" -- but the accepted answer was not helpful, as it did not mention how to use HDFS with Spark.
I have downloaded Spark 2.4.3 pre-built for Apache Hadoop 2.7 and later.
I have followed the Apache Spark tutorials and successfully deployed one master (my local machine) and one worker (my RPi4 on the same local network). I was able to run a simple word count (counting words in /opt/spark/README.md).
Now I want to count words of a file that exists only on the master. I understand that I will need to use HDFS for this to share files across the local network. However, I don't have any idea how to do this, despite perusing both the Apache Spark and Hadoop documentation.
I am confused about the interplay between Spark and Hadoop. I don't know if I should be setting up a Hadoop cluster in addition to a Spark cluster. This tutorial on hadoop.apache.org doesn't seem to help, as it says that "you will need to start both the HDFS and YARN cluster". I want to run Spark in standalone mode, not YARN.
What do I need to do in order for me to run
val textFile = spark.read.textFile("file_that_exists_only_on_my_master")
and have the file be propagated to the worker nodes, i.e. not get a "File does not exist" error on the worker nodes?
I set up MinIO instead, and wrote the following Github Gist on instructions.
The trick is to set up core_site.xml to point to the MinIO server.
Github Gist here
<script src="https://gist.github.com/lieuzhenghong/c062aa2c5544d6b1a0fa5139e10441ad.js"></script>

Apache Spark : how to read from hdfs file

I have locally installed spark 2.3.0 and using pyspark. I'm able to work with processing local files without any problem.
But if i have to read from hdfs, i'm not able to.
I'm confused with how spark access hadoop files. while installing spark, I'm asked to copy the winutil. I don't understand what is the role of winutil.
Should we bring up the hadoop services first , to work with spark ?
Getting java.lang.UnsatisfiedLinkError errors if i use the hadoop installed externally and tried to use it in the spark. any pointer to right docuementation will be great help.
Thanks,
Kiran
If you're using spark-submit to run the application in cluster mode, then it can take a flag --files which is used to pass down files from driver node to workers. I believe the reason you were able to run in local mode was because your driver and worker are in same machine however in cluster mode the driver and workers possibly are in separate machines. Spark needs to know in that case which files to send over to worker nodes. The follow flags are available as described in the book Learning Spark by Holden Karau; Andy Konwinski; Patrick Wendell; Matei Zaharia
--master
Indicates the cluster manager to connect to. The options for this flag are described in Table 7-1.
--deploy-mode
Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where spark-submit >s itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.
--class
The “main” class of your application if you’re running a Java or Scala program.
--name
A human-readable name for your application. This will be displayed in Spark’s web UI.
--jars
A list of JAR files to upload and place on the classpath of your application. If your application depends on a small number of third-party JARs, you can add them here.
--files
A list of files to be placed in the working directory of your application. This can be used for data files that you want to distribute to each node.
--py-files
A list of files to be added to the PYTHONPATH of your application. This can contain .py, .egg, or .zip files.
--executor-memory
The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
--driver-memory
The amount of memory to use for the driver process, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
Update
I assumed that Kiran has Hadoop setup (as he mentioned externally) and was not able to make the program read from the HDFS programatically. If that was not the case, please ignore the answer.

Spark: hdfs cluster mode

I'm just getting started using Apache Spark. I'm using cluster mode (master, slave1, slave2) and I want to process a big file which is kept in Hadoop (hdfs). I am using the textFile method from SparkContext; while the file is being processing I monitorize the nodes and I can see that just the slave2 is working. After processing, slave2 has tasks but slave1 has no task.
If instead of using a hdfs I use a local file then both slaves work simultaneously.
I don't get why this behaviour. Please, can anybody give me a clue?
The main reason of that behavior is the concept of data locality. When Spark's Application Master asks for the creation of new executors, they are tried to be allocated in the same node where data resides.
I.e. in your case, HDFS is likely to have written all the blocks of the file on the same node. Thus Spark will instantiate the executors on that node. Instead, if you use a local file, it will be present in all nodes, so data locality won't be an issue anymore.

csv data processing in spark standalone mode

I have two node and Let's call A(192.168.2.100) and B(192.168.2.200).
A is for a master and a worker.
in A node
./bin/spark-class org.apache.spark.deploy.worker
./bin/spark-class org.apache.spark.deploy.master
B is for a woker
./bin/spark-class org.apache.spark.deploy.worker
my app is need to load cav file to process
in node A,
./spark-submit --class "myApp" --master spark://192.168.2.100:7077 /spark/app.jar
But It comes error with "need csv file in B".
Is there any way to to share this file to node B?
Do Really I need yarn of mesos to do this?
as the diagram shows bellow: all the data files you want to process should be accessible from all of your workers [ and be sure that your driver can be reachable from your worker ]
so here, you need to put your data files into a place from where workers can read data, in most situations, we put data files into HDFS.
As stated before that file has to be available on every node. So you either have multiple copies, one per node or you use an external hadoop data source (HDFS, Cassandra, Amazon s3). There is another easier solution. You can use NFS and mount a remote drive/partition/location to every node. This way you don't need to have multiple copies and you don't have to learn about an external storage. You can even use sshfs if you want to have a secure mount point over ssh.

How can I verify that DSE Spark Shell is distributing across the cluster

Is it possible to verify from within the Spark shell what nodes if the shell is connected to the cluster or is running just in local mode? I'm hoping to use that to investigate the following problem:
I've used DSE to setup a small 3 node Cassandra Analytics cluster. I can log onto any of the 3 servers and run dse spark and bring up the Spark shell. I have also verified that all 3 servers have the Spark master configured by running dsetool sparkmaster.
However, when I run any task using the Spark shell, it appears that the it is only running locally. I ran a small test command:
val rdd = sc.cassandraTable("test", "test_table")
rdd.count
When I check the Spark Master webpage, I see that only one server is running the job.
I suspect that when I run dse spark it's running the shell in local mode. I looked up how to specific a master for the Spark 0.9.1 shell and even when I use MASTER=<sparkmaster> dse spark (from the Programming Guide) it still runs only in local mode.
Here's a walkthrough once you've started a DSE 4.5.1 cluster with 3 nodes, all set for Analytics Spark mode.
Once the cluster is up and running, you can determine which node is the Spark Master with command dsetool sparkmaster. This command just prints the current master; it does not affect which node is the master and does not start/stop it.
Point a web browser to the Spark Master web UI at the given IP address and port 7080. You should see 3 workers in the ALIVE state, and no Running Applications. (You may have some DEAD workers or Completed Applications if previous Spark jobs had happened on this cluster.)
Now on one node bring up the Spark shell with dse spark. If you check the Spark Master web UI, you should see one Running Application named "Spark shell". It will probably show 1 core allocated (the default).
If you click on the application ID link ("app-2014...") you'll see the details for that app, including one executor (worker). Any commands you give the Spark shell will run on this worker.
The default configuration is limiting the Spark master to only allowing each application to use 1 core, therefore the work will only be given to a single node.
To change this, login to the Spark master node and sudo edit the file /etc/dse/spark/spark-env.sh. Find the line that sets SPARK_MASTER_OPTS and remove the portion -Dspark.deploy.defaultCores=1. Then restart DSE on this node (sudo service dse restart).
Once it comes up, check the Spark master web UI and repeat the test with the Spark shell. You should see that it's been allocated more cores, and any jobs it performs will happen on multiple nodes.
In a production environment you'd want to set the number of cores more carefully so that a single job doesn't take all the resources.

Resources