Who loads partitions into RAM in Spache Spark?

Who loads partitions into RAM in Spache Spark? - apache-spark

I have this question that I have not been able to find its answer anywhere.
I am using the following lines to load data within a PySpark application:
loadFile = self.tableName+".csv"
dfInput= self.sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(loadFile)
My cluster configuration is as follows:
I am using a Spark cluster with 3 nodes: 1 node is used to start the master, the other 2 nodes are running 1 worker each.
I submit the application from outside the cluster on a login node, with a script.
The script submits the Spark application with cluster deploy mode which I think, then in this case, makes a driver run on any of the 3 nodes I am utilising.
The input CSV files are stored in a globally visible temporary file system (Lustre).
In Apache Spark Standalone, how is the process of loading partitions to RAM?
Is it that each executor accesses to the driver's node RAM and loads partitions to its own RAM from there? (Storage --> driver's RAM --> executor's RAM)
Is it that each executor accesses to storage and loads to its own RAM? (Storage --> executor's RAM)
Is it none of these and I am missing something here? How can I witness this process by myself (monitoring tool, unix command, somewhere in Spark)?
Any comment or resource in which I can get deep into this would be very helpful. Thanks in advance.

The second scenario is correct:
each executor accesses to storage and loads to its own RAM? (Storage --> executor's RAM)

Related

Restricting memory usage by Spark application on worker nodes

I have a computer cluster consisting of a master node and a bunch of worker nodes, each having 8GB of physical RAM. I would like to run some Spark jobs on this cluster, but limit the memory that is used by each node for performing this job. Is this possible within Spark?
For example, can I run a job so that each worker will only be allowed to use at most 1GB of physical RAM to do its job?
If not, is there some other way to artificially "limit" the RAM that each worker has?
If it is possible to limit RAM in some way or the other, what actually happens when Spark "runs out" of physical RAM? Does it page to disk or does the Spark job just kill itself?

How to start Cassandra and Spark services assigning half of the RAM memory on each program?

I have a 4 nodes cluster on which there are installed Spark and Cassandra on each node.
Spark version 3.1.2 and Cassandra v3.11
Let me say that each nodes have 4GB of RAM and I want to run my "spark+cassandra" program all over the cluster.
How can I assign 2GB of RAM for Cassandra execution and 2GB for Spark execution?
I noted that.
If my Cassandra cluster is up and I run start-worker.sh command on a worker node to make my spark cluster up, suddenly Cassandra service stops but spark still works. Basically, Spark steals RAM resources to Cassandra. How can I avoid also this?
On Cassandra logs of the crashed node I read the message:
There is insufficient memory for the Java Runtime Environment to continue.
In fact typing top -c and then shift+M i can see Spark Service at the top of column Memory
Thanks for any suggestions.

By default, Spark workers take up the total RAM less 1GB. On a 4GB machine, the worker JVM consumes 3GB of memory. This is the reason the machine runs out of memory.
You'll need to configure the SPARK_WORKER_MEMORY to 1GB to leave enough memory for the operating system. For details, see Starting a Spark cluster manually.
It's very important to note as Alex Ott already pointed out, a machine with only 4GB of RAM is not going to be able to do much so expect to run into performance issues. Cheers!

Memory configurations

I had a question about memory config. I am running a 3 node cluster with Spark, Cassandra, Hadoop, Thrift and Yarn. I want to store my files in hdfs, so i loaded hadoop. I am finding that i am running out of memory when running my queries. I was able to figure out how to restrict cassandra to run in less than 4gb. Is there such a setting for hadoop? How about Yarn? As i only use hadoop to load up my flat files, i think setting it to 1 or 2gb should be fine. My boxes have 32gb of ram and 16 cores each.

It is hard to say without the error message you are facing. But if you want to check about allocation of memory at your workers you can setup these two configurations at your yarn-site.xml:
<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value> <!-- 40 GB -->
<name>yarn.scheduler.maximum-allocation-mb</name> <!-Max RAM-per-container->
<value>8192</value>
You can see more details here in this question.

How to update spark configuration after resizing worker nodes in Cloud Dataproc

I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console.
I noticed that the two workers nodes are not being fully used. In particular, there are only 2 executors on the first worker node and 1 executor on the second worker node, with
spark.executor.cores 2
spark.executor.memory 4655m
in the /usr/lib/spark/conf/spark-defaults.conf. I thought with spark.dynamicAllocation.enabled true, the number of executors will be increased automatically.
Also, The information on DataProc page of the web console doesn't get updated automatically, either. It seems that DataProc still think that all nodes are n1-standard-4.
My questions are
why are there more executors on the first worker node than the second?
why are not more executors added to each node?
Ideally, I want the whole cluster to get fully utilized, if the spark configuration needs updated, how?

As you've found a cluster's configuration is set when the cluster is first created and does not adjust to manual resizing.
To answer your questions:
The Spark ApplicationMaster takes a container in YARN on a worker node, usually the first worker if only a single spark application is running.
When a cluster is started, Dataproc attempts to fit two YARN containers per machine.
The YARN NodeManager configuration on each machine determines how much of the machine's resources should be dedicated to YARN. This can be changed on each VM under /etc/hadoop/conf/yarn-site.xml, followed by a sudo service hadoop-yarn-nodemanager restart. Once machines are advertising more resources to the ResourceManager, Spark can start more containers. After adding more resources to YARN, you may want to modify the size of containers requested by Spark by modifying spark.executor.memory and spark.executor.cores.
Instead of resizing cluster nodes and manually editing configuration files afterwards, consider starting a new cluster with new machine sizes and copy any data from your old cluster to the new cluster. In general, the simplest way to move data is to use hadoop's built in distcp utility. An example usage would be something along the lines of:
$ hadoop distcp hdfs:///some_directory hdfs://other-cluster-m:8020/
Or if you can use Cloud Storage:
$ hadoop distcp hdfs:///some_directory gs://<your_bucket>/some_directory
Alternatively, consider always storing data in Cloud Storage and treating each cluster as an ephemeral resource that can be torn down and recreated at any time. In general, any time you would save data to HDFS, you can also save it as:
gs://<your_bucket>/path/to/file
Saving to GCS has the nice benefit of allowing you to delete your cluster (and data in HDFS, on persistent disks) when not in use.

Cannot run memory intense program on Spark through YARN

I'm trying to benchmark a program on an Azure cluster using Spark. We previously ran this on EC2 and know that 150 GB of RAM is sufficient. I have tried multiple setups for the executors and given them 160-180GB of RAM but regardless of what I do, the program dies due to executors requesting more memory.
What can I do? Are there more launch options I should consider, I have tried every conceivable executor setup and nothing seems to want to work. I'm at a total loss.

For your command, you specified 7 executor and each with 40g of memory. That's 280G of memory in total, but you said your cluster has only 160-180 G of memory? If only 150G of memory is needed, why the spark-submit is configured that way?
What's your HDI cluster node type and how many of them you created?
Were you using YARN previously on EC2 as well? In that case, are the configuration the same?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string