How Spark and S3 interact - apache-spark

I'm wondering how data is loaded into spark in below scenario:
There is 10 GB transaction data stored in S3 in parquet format, I'm going to run a Spark program to categorize every record in that 10 GB Parquet file (e.g. Income, Shopping, Dinning).
I have following questions:
How would this 10 GB distributed into different workers in the Spark Cluster? Does the 10 GB file loaded into Spark Master then Master split the data and send to executors?
If all these happen in memory? What if one of the executors crashed during a job run, will the master load the 10 GB file from S3 again and extract the subset of data that supposed to be processed by the crashed executor and send to another executor?

How would this 10 GB distributed into different workers in the Spark Cluster? Does the 10 GB file loaded into Spark Master then Master split the data and send to executors?
Answer:
Spark follows Master-Slave architecture. We have one master (Driver/Co-Ordinator) and multiple distributed worker nodes. Driver process runs on the master node and main method of the program runs in driver process. Driver process creates SparkSession or SparkContext. Driver process converts user code to tasks based on the transformation and actions operations in the code from the lineage graph. Driver creates the logical and physical plan and once physical plan is ready it co-ordinates with the cluster manager to get the executors to complete the task. Driver just keeps track of the state of the data(metadata) for each of the executors.
So, 10 GB file does not get loaded to the master node. S3 is a distributed Storage and spark reads from it in a splitted manner. Driver process just decides how the data would get splitted and what each executor needs to work on. Even if you cache the data it gets cached on the executors node only based on the partitions/data that the executor is working on. Also nothing gets triggered unless you call a action operation like count, collect etc. It creates a lineage graph plus DAG to keep track of this information.
If all these happen in memory? What if one of the executors crashed during a job run, will the master load the 10 GB file from S3 again and extract the subset of data that supposed to be processed by the crashed executor and send to another executor?
Answer:
As answered in first question, anything gets loaded into memory only when any action is performed. Loaded into memory does not mean it would be loaded into the driver memory. Depending upon the action data gets loaded into memory of driver or executors. If you have used collect operation everything gets loaded into the driver memory but for some other operation like count if you have cached dataframe then the data would get loaded into memory on each of the executor nodes.
Now if one of the executor crashes during the job ran, driver has the lineage graph information and the data (metadata) that the crash executor had, so it runs the same lineage graph on other executor and perform the task. This is what makes Spark resilient and fault tolerance.

Each worker will issue 1+ GET request on the ranges of the parquet file it has been given; more as it seeks around the files. The whole 10GB file is never loaded anywhere.
each worker will be doing its own read of its own split; this counts against the overall IO capacity of the store/shard.

Related

How is data accessed by worker nodes in a Spark Cluster?

I'm trying to understand the functioning of Spark, I know the Cluster manager allocates the resources (Workers) for the driver program.
I want to know, how (which transformations) the cluster manager sends the tasks to worker nodes and how worker nodes access the data (Assume my data is in S3)?
Does worker nodes read only a part of data and apply all transformations on it and return the actions to the driver program? or The worker nodes reads the entire file but only apply specific transformation and return back the result to the driver program?
Follow-up questions:
How and who decides how much amount of data needs to be sent to worker nodes? as we have established a point that partial data is present on each worker node. Eg: I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform few transformations and an action. assume the csv is on S3 and on the master node's local storage.
It's going to be a long answer, but I will try to simplify it at my best:
Typically a Spark cluster contains multiple nodes, each node would have multiple CPUs, a bunch of memory, and storage. Each node would hold some chunks of data, therefore sometimes they're also referred to data nodes as well.
When Spark application(s) are started, they tend to create multiple workers or executors. Those workers/executors took resources (CPU, RAM) from the cluster's nodes above. In other words, the nodes in a Spark cluster play both roles: data storage and computation.
But as you might have guessed, data in a node (sometimes) is incomplete, therefore, workers would have to "pull" data across the network to do a partial computation. Then the results are sent back to the driver. The driver would just do the "collection work", and combine them all to get the final results.
Edit #1:
How and who decides how much amount of data needs to be sent to worker nodes
Each task would decide which data is "local" and which is not. This article explains pretty well how data locality works
I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform a few transformations and an action
This situation is different with the above question, where you have only one file and most likely your worker would be exactly the same as your data node. The executor(s) those are sitting on that worker, however, would read the file piece by piece (by tasks), in parallel, in order to increase parallelism.

SparkDataframe.load(),when I execute a load command where actually my data get stored?

If I am loading one table from cassandra using spark dataframe.load().Where will my data gets loaded.Is it in spark memory.Or in datanode blocks ,if I am using yarn resource manager.
It will try to store in memory per number of partitions on the Worker Nodes / which in this context is a slightly better term than Data Nodes.
It will spill to disk if not enough memory on the Worker Nodes.
Per number of Cores / Executors, processing will occur. E.g. if you have, say, 20 Executors with 1 Core each, your concurrency of processing is 20 and spilling will occur via eviction. If you run out of disk, an error will result.
Worker Nodes is a better term here compared to Data Nodes, unless you have HDFS and processing locally, then Worker Node is equal to Data Node. Although you could argue what's in a name?
Of course, an Action will need to have been initiated.
And repartition and join or union latterly in the data pipeline affect things, but that goes without saying.

unable to launch more tasks in spark cluster

I have a 6 node cluster with 8 cores and 32 gb ram each. I am reading a simple csv file from azure blob storage and writing to hive table.
when the job runs I see only a single task getting launched and single executor working and all the other executor and instances sitting idle/dead.
How to increase the number of tasks so the job can run faster.
any help appreciated
I'm guessing that your csv file is in one block. Therefore your data is only on one partition and since Spark "only" creates one task per partition, you only have one.
You can call repartition(X) on your dataframe/rdd just after reading it to increase the number of partitions. Reading won't be faster but all your transformations and the writting will be parallelized.

Where does Spark schedule .textFile() task

Say I want to read data from an external HDFS database, and I have 3 workers in my cluster (one maybe a bit closer to external_host - but not on the same host).
sc.textFile("hdfs://external_host/file.txt")
I understand that Spark schedules tasks based on the locality of the underlying RDD. But on which worker (ie. executor) is .textFile(..) scheduled (since we do not have an executor running on external_host)?
I imagine it loads the HDFS blocks as partitions to worker memory, but how does Spark decide what the best worker is (I would imagine it chooses the closest based on latency or something else, is this correct?)

Spark Streaming Job Keeps growing memory

I am running spark v 1.6.1 on a single machine in standalone mode, having 64GB RAM and 16cores.
I have created five worker instances to create five executor as in standalone mode, there cannot be more than one executor in one worker node.
Configuration:
SPARK_WORKER_INSTANCES 5
SPARK_WORKER_CORE 1
SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
all other configurations are default in spark_env.sh
I am running a spark streaming direct kafka job at an interval of 1 min, which takes data from kafka and after some aggregation write the data to mongo.
Problems:
when I start master and slave, it starts one master process and five worker processes. each only consume about 212 MB of ram.when i submit the job , it again creates 5 executor processes and 1 job process and also the memory uses grows to 8GB in total and keeps growing over time (slowly) also when there is no data to process.
we are unpersisting cached rdd at the end also set spark.cleaner.ttl to 600. but still memory is growing.
one more thing, I have seen the merged SPARK-1706, then also why i am unable to create multiple executor within a worker.and also in spark_env.sh file , setting any configuration related to executor comes under YARN only mode.
Any help would be greatly appreciated,
Thanks

Resources