What is spark.driver.maxResultSize? - apache-spark

The ref says:
Limit of total size of serialized results of all partitions for each
Spark action (e.g. collect). Should be at least 1M, or 0 for
unlimited. Jobs will be aborted if the total size is above this limit.
Having a high limit may cause out-of-memory errors in driver (depends
on spark.driver.memory and memory overhead of objects in JVM). Setting
a proper limit can protect the driver from out-of-memory errors.
What does this attribute do exactly? I mean at first (since I am not battling with a job that fails due to out of memory errors) I thought I should increase that.
On second thought, it seems that this attribute defines the max size of the result a worker can send to the driver, so leaving it at the default (1G) would be the best approach to protect the driver..
But will happen on this case, the worker will have to send more messages, so the overhead will be just that the job will be slower?
If I understand correctly, assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize). If so, then increasing that attribute to protect my driver from being assassinated from Yarn should be wrong.
But still the question above remains..I mean what if I set it to 1M (the minimum), will it be the most protective approach?

assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize).
No. If estimated size of the data is larger than maxResultSize given job will be aborted. The goal here is to protect your application from driver loss, nothing more.
if I set it to 1M (the minimum), will it be the most protective approach?
In sense yes, but obviously it is not useful in practice. Good value should allow application to proceed normally but protect application from unexpected conditions.

Related

Hazelcast causing heavy traffic between nodes

NOTE: Found the root cause in application code using hazelcast which started to execute after 15 min, the code retrieved almost entire data, so the issue NOT in hazelcast, leaving the question here if anyone will see same side effect or wrong code.
What can cause heavy traffic between Hazelcast (v3.12.12, also tried 4.1.1) 2 nodes ?
It holds maps with lot of data, no new entries are added/removed within that time, only map values are updated.
Java 11, Memory usage 1.5GB out of 12GB, no full GCs identified.
Following JFR the high IO is from:
com.hazelcast.internal.networking.nio.NioThread.processTaskQueue()
Below chart of Network IO, 15 min after start traffic jumps from 15 to 60 MB. From application perspective nothing changed after these 15 min.
This smells garbage collection, you are most likely to be running into long gc pauses. Check your gc logs, which you can enable using verbose gc settings on all members. If there are back-to-back GCs then you should do various things:
increase the heap size
tune your gc, I'd look into G1 (with -XX:MaxGCPauseMillis set to a reasonable number) and/or ZGC.

Cassandra 3.10 debug.log contains frequent "FailureDetector.java:457 - Ignoring interval time of..."

The debug.log files for one of our Cassandra 3.10 clusters has frequent messages similar to “FailureDetector.java:457 - Ignoring interval time of…”
The messages appear even if the cluster is idle. I see the messages at a rate of about 1 per second on each node of this 6 node cluster (3 nodes each in two data centers).
Can someone tell me what causes the messages and if they are something to be concerned about?
We have a couple of other small clusters supporting the same application (different environments) and I see this message much less often (days apart).
The FailureDetector is responsible of deciding if a node is considered UP or DOWN.
The gossip process tracks state from other nodes both directly (nodes
gossiping directly to it) and indirectly (nodes communicated about
secondhand, third-hand, and so on). Rather than have a fixed threshold
for marking failing nodes, Cassandra uses an accrual detection
mechanism to calculate a per-node threshold that takes into account
network performance, workload, and historical conditions. During
gossip exchanges, every node maintains a sliding window of
inter-arrival times of gossip messages from other nodes in the
cluster.
Here you can find the source code, which gives you the log message. It is set to DEBUG level because they may be helpful in tracking down the actual issue causing the latency, but don't indicate a problem on their own.
In other words: your node measures the acknowledgement latency for each gossip message sent to the other nodes e.g: X nanosec for IP address1, Z nanosec for IP address2, etc. If eitherX or Y is above the expected 2 sec threshold as stated in MAX_INTERVAL_IN_NANO, it will get reported.
Problems, which can cause this log message:
Huge load on the node(s): e.g too many large partitions
High pressure: e.g. too many queries in sort period of time
Bad network connection
The extra FailureDetector logging was added with this:
Expose phi values from failure detector via JMX and tweak debug
and trace logging (CASSANDRA-9526)
and also I found this open issue, might be related to your problem:
The failure detector becomes more sensitive when the network is flakey(CASSANDRA-9536)
Also I find this article about Gossiping and Failure Detection very useful.

High Native-Transport-Requests All time Blocked

After running tpstats on all nodes. I see a lot of nodes having high number of ALL TIME BLOCKED NTR. We have a 4 node cluster and the values for NTR ALL TIME BLOCKED are :
NODE 1: 23953
NODE 2: 2935
NODE 3: 15229
NODE 4: 5951
I know ALL TIME BLOCKED is bad and hence worried as to what I am doing wrong.
This pool handles cql requests, so it is the number of active CQL requests allowed. Its limited to prevent too many active ones from OOMing your system (ie each returning large blobs). This effectively applies backpressure to your client application to slow down. Unfortunately if you have small requests this isnt ideal and hurts your throughput so in CASSANDRA-11363 they added a setting to make the space tradeoff for small bursty workloads.
If you upgrade to 2.2.8+ you can set the max queue size of that threadpool with -Dcassandra.max_queued_native_transport_requests=4096

High disk I/O on Cassandra nodes

Setup:
We have 3 nodes Cassandra cluster having data of around 850G on each node, we have LVM setup for Cassandra data directory (currently consisting 3 drives 800G + 100G + 100G) and have separate volume (non LVM) for cassandra_logs
Versions:
Cassandra v2.0.14.425
DSE v4.6.6-1
Issue:
After adding 3rd (100G) volume in LVM on each of the node, all the nodes went very high in disk I/O and they go down quite often, servers also become inaccessible and we need to reboot the servers, servers don't get stable and we need to reboot after every 10 - 15 mins.
Other Info:
We have DSE recommended server settings (vm.max_map_count, file descriptor) configured on all nodes
RAM on each node : 24G
CPU on each node : 6 cores / 2600MHz
Disk on each node : 1000G (Data dir) / 8G (Logs)
As I suspected, you are having throughput problems on your disk. Here's what I looked at to give you background. The nodetool tpstats output from your three nodes had these lines:
Pool Name Active Pending Completed Blocked All time blocked
FlushWriter 0 0 22 0 8
FlushWriter 0 0 80 0 6
FlushWriter 0 0 38 0 9
The column I'm concerned about is the All Time Blocked. As a ratio to completed, you have a lot of blocking. The flushwriter is responsible for flushing memtables to the disk to keep the JVM from running out of memory or creating massive GC problems. The memtable is an in-memory representation of your tables. As your nodes take more writes, they start to fill and need to be flushed. That operation is a long sequential write to disk. Bookmark that. I'll come back to it.
When flushwriters are blocked, the heap starts to fill. If they stay blocked, you will see the requests starting to queue up and eventually the node will OOM.
Compaction might be running as well. Compaction is a long sequential read of SSTables into memory and then a long sequential flush of the merge sorted results. More sequential IO.
So all these operations on disk are sequential. Not random IOPs. If your disk is not able to handle simultaneous sequential read and write, IOWait shoots up, requests get blocked and then Cassandra has a really bad day.
You mentioned you are using Ceph. I haven't seen a successful deployment of Cassandra on Ceph yet. It will hold up for a while and then tip over on sequential load. Your easiest solution in the short term is to add more nodes to spread out the load. The medium term is to find some ways to optimize your stack for sequential disk loads, but that will eventually fail. Long term is get your data on real disks and off shared storage.
I have told this to consulting clients for years when using Cassandra "If your storage has an ethernet plug, you are doing it wrong" Good rule of thumb.

NODE_LOCAL vs RACK_LOCAL task read time

I am studying how locality affects task read time in a spark sql job.
THE TEST:
To facilitate the analisys I run a simple SQL query which performs a table scan and returns no data, the task takes time reading the block and then processing it.
The Query: "CREATE TABLE target_table AS SELECT * FROM source_table WHERE column_name>1000".
Selectivity is equal to 0 (i.e. column_name is never grater tha 1000)
Spark context has been created with only one executor so to observe both NODE_LOCAL and RACK_LOCAL tasks.
My cluster is composed by 7 nodes equipped with 8 cores each in a single rack linked together with a gigabit swithch (1 gigabit point-to-point)
Before getting to the point of my question I would like to state few hypotheses:
Each task processes a single block
As data locality is preferred, the driver allocates NODE_LOCAL tasks first and then RACK_LOCAL ones
When more than one VCore is allocated, tasks initially compete on the local hard drive to fetch their blocks and then the fetching is done remotely on other nodes
The network throughput outperform the hard drive throughput therefore under stress hard drive are the bottleneck
Finally the question :)
When many VCores are allocated (e.g. 8) in a single executor, given the hypotheses stated above, I would expect the RACK_LOCAL task's read time to be faster than the NODE_LOCAL's one.
Insted, according to my tests, RACK_LOCAL read time is in average few % points slower than NODE_LOCAL as shown here. Obviously I am missing something but I digged around without come out with a reason. What is this something?
The linked figure shows NODE_LOCAL and RACK_LOCAL average task duration time for increasing numbers of VCores.
Thanks,
Lorenzo
Actually I find out that one of my hypotheses is not correct: "The network throughput outperform the hard drive throughput therefore under stress hard drive are the bottleneck"
A gigabit switch performs in average 0.8 its speed, meaning that two nodes are linked together with a network throughput of 100MB/s. HDDs can normally read at 150MB/s.
As remote read and network transfer are performed in pipeline, the small difference between NODE_LOCAL and RACK_LOCAL is due to the remote buffering time which occurs between read and transmit
RACK_LOCAL means that a block is being read from an HDD on a remote node and then is passed over network. NODE_LOCAL means that a block is being read on this node, therefore the "network" part is omitted, therefore NODE_LOCAL in general should be faster.

Resources