Spark: hdfs cluster mode - apache-spark

I'm just getting started using Apache Spark. I'm using cluster mode (master, slave1, slave2) and I want to process a big file which is kept in Hadoop (hdfs). I am using the textFile method from SparkContext; while the file is being processing I monitorize the nodes and I can see that just the slave2 is working. After processing, slave2 has tasks but slave1 has no task.
If instead of using a hdfs I use a local file then both slaves work simultaneously.
I don't get why this behaviour. Please, can anybody give me a clue?

The main reason of that behavior is the concept of data locality. When Spark's Application Master asks for the creation of new executors, they are tried to be allocated in the same node where data resides.
I.e. in your case, HDFS is likely to have written all the blocks of the file on the same node. Thus Spark will instantiate the executors on that node. Instead, if you use a local file, it will be present in all nodes, so data locality won't be an issue anymore.

Related

How to set path to files in Apache Spark Standalone Cluster?

I need some hints about defining a path to a directory with lots of files in Spark. I have set up a Standalone Cluster with one machine as Worker and another machine as Master and the driver is my local machine. I develop my code on the local machine with python. I have copied all files to the Master and Worker, the path on both machines is equal (like: /data/test/). I have set up a SparkSession but now I do not know how to define the path to the directory in my script. So my problem is how to say Spark that it can find the data on both machines in the directory above?
Another question for me is how to deal with file formats like .mal, how can I read in such files? Thanks for any hints!
When a spark job is submitted to a driver (master), few things happended
Driver program creates an execution plan. It creates multiples stages and each stage contains multiple tasks.
Cluster manager allocate resources and launch executors from worker based on arguments whiling submitting the job.
The tasks are given to executors to be executed and driver monitor each task execution. Resources is deallocated and executors are terminated when sparkContext is closed or the scope of application program is finished.
The driver or master where spark job is submitted needs the accessible data path, as it control all the execution plan. Driver program and cluster manager will take care all the things to do different kinds of operation in the worker. As the spark job is submitted in master, It's enough to provide the data path which is accessible by spark from the master machine.

How many connections will be build btween spark and hdfs when sc.textFile("hdfs://.....") is called

How many connections will be build btween spark and hdfs when sc.textFile("hdfs://.....") is called. The file on hdfs is very large(100G).
Actually, the main idea behind the distributed systems and of course which is designed and implemented in hadoop and spark is to send the process to data. In other words, imagine that there is some data located on hdfs data nodes on our cluster and we have a job which utilizes that data on the same worker. On each machine, you would have a data node and is a spark worker at the same time and may have some other processes like hbase region server too. When an executor is executing one of the scheduled tasks, it retrieves its needed data from the underlying data node. Then for each individual task you would retrieve its data and so you may describe this as one connection to hdfs on its local data node.

How YARN knows data locality in Apache spark in cluster mode

Assume that there is Spark job that is going to read a file named records.txt from HDFS and do some transformations and one action(write the processed output into HDFS). The job will be submitted to YARN cluster mode
Assume also that records.txt is a file of 128 MB and one of its HDFS replicated blocks is also in NODE 1
Lets say YARN is allocating is a executor inside NODE 1 .
How does YARN allocates a executor exactly in a node where the input data is located?
Who tells YARN that one of the replicated HDFS block of records.txt is available in NODE 1 ?
How the data localilty is found By Spark Application ? Is it done by Driver which runs inside Application Master ?
Does YARN know about the datalocality ?
The fundamental question here is:
Does YARN know about the datalocality ?
YARN "knows" what application tells it and it understand structure (topology) of the cluster. When application makes a resource request, it can include specific locality constraints, which might, or might not be satisfied, when resources are allocated.
If constraints cannot be specified, YARN (or any other cluster manager) will attempt to provide best alternative match, based on its knowledge of the cluster topology.
So how application "knows"?
If application uses input source (file system or other), which supports some form of data locality, it can query it corresponding catalog (namenode in case of HDFS) to get locations of the blocks of data it wants to access.
In broader sense Spark RDD can define preferredLocations, depending on a specific RDD implementation, which can be later translated into resource constraints, for the cluster manager (not necessarily YARN).

Spark without HDFS in cluster mode: Which data is stored where?

I am using Spark 1.5 without HDFS in cluster mode to build an application. I was wondering, when having a saving operation, e.g.,
df.write.parquet("...")
which data is stored where? Is all the data stored at the master, or is each worker storing its local data?
Generally speaking all workers nodes will perform writes to its local file system with driver writing only a _SUCCESS file.

Using Spark Shell (CLI) in standalone mode on distributed files

I am using Spark 1.3.1 in standalone mode (No YARN/HDFS involved - Only Spark) on a cluster with 3 machines. I have a dedicated node for master (no workers running on it) and 2 separate worker nodes.
The cluster starts healthy, and I am just trying to test my installation by running some simple examples via spark-shell (CLI - which I started on the master machine) : I simply put a file on the localfs on the master node (workers do NOT have a copy of this file) and I simply run:
$SPARKHOME/bin/spark-shell
...
scala> val f = sc.textFile("file:///PATH/TO/LOCAL/FILE/ON/MASTER/FS/file.txt")
scala> f.count()
and it returns the words count results correctly.
My Questions are:
1) This contradicts with what spark documentation (on using External Datasets) say as:
"If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system."
I am not using NFS and I did not copy the file to workers, so how does it work ? (Is it because spark-shell does NOT really launch jobs on the cluster, and does the computation locally (It is weird as I do NOT have a worker running on the node, I started shell on)
2) If I want to run SQL scripts (in standalone mode) against some large data files (which do not fit into one machine) through Spark's thrift server (like the way beeline or hiveserver2 is used in Hive) , do I need to put the files on NFS so each worker can see the whole file, or is it possible that I create chunks out of the files, and put each smaller chunk (which would fit on a single machine) on each worker, and then use multiple paths (comma separated) to pass them all to the submitted queries ?
The problem is that you are running the spark-shell locally. The default for running a spark-shell is as --master local[*], which will run your code locally on as many cores as you have. If you want to run against your workers, then you will need to run with the --master parameter specifying the master's entry point. If you want to see the possible options you can use with spark-shell, just type spark-shell --help
As to whether you need to put the file on each server, the short answer is yes. Something like HDFS will split it up across the nodes and the manager will handle the fetching as appropriate. I am not as familiar with NFS and if it has this capability, though

Resources