Why spark.sql.shuffle.partitions default is set to 200?

Why spark.sql.shuffle.partitions default is set to 200? - apache-spark

I read in various forums and posts that it is recommended to change it, but I do wonder why it is set to 200 by default and not to some function on the number of total executors' cores.

Related

presto + Query exceeded maximum time limit of

we have 8 presto workers
from presto dashboard we have several exception about "Query exceeded maximum time limit of"
as the following
io.prestosql.spi.PrestoException: Query exceeded maximum time limit of 15.00s
at io.prestosql.execution.QueryTracker.enforceTimeLimits(QueryTracker.java:187)
at io.prestosql.execution.QueryTracker.lambda$start$0(QueryTracker.java:88)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
according to presto documentation
query.max-run-time
Type: String (duration)
Default value: 100 d
Description: Used as default for session property query_max_run_time. If the presto works in environment where there are mostly very long queries (over 100 days) than it may be a good idea to increase this value to avoid dropping clients that didn’t set their session property correctly. On the other hand in the presto works in environment where they are only very short queries this value set to small value may be used to detect user errors in queries. It may also be decreased in poor presto cluster configuration with mostly short queries to increase garbage collection efficiency and by that lowering memory usage in cluster.
and indeed this value set to 100day in presto cluster
so why is the limitation about - Query exceeded maximum time limit of 15.00s ?
from presto dashboard session:

How To Get the Aggregated Usage of CPU in GRAFANA when hyperthreading is enabled

We are running GRAFANA/PROMETHEUS to monitor our CPU metrics and find aggregated CPU Usage of all cpus. the problem is we have enabled hyperthreading and when we stress CPU the percentage exceeds from 100%. my question is how to limit that cpu usage to show only usage in 100% not more even if cpu is highly utilized.
P.S i have tried setting the max and min limit in grafana but still the graph spikes goes above that limit.
Kindly give me the right query for this problem.
The queries I have tried are given below.
sum(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
100 - avg(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
and other similar queries we have tried.

If all you want is to "cap" a variable or expression result to a maximum value (that is, 100) you could simply use the Prometheus function clamp_max.
Thus, you could do:
clamp_max(<expr>, 100)

This is probably the most helpful query.
(1 - avg(irate(node_cpu_seconds_total{instance="$instance",job="$job",mode!="idle"}[5m])))*100
Replace your instance IP and your node exporter job name.

What is spark.driver.maxResultSize?

The ref says:
Limit of total size of serialized results of all partitions for each
Spark action (e.g. collect). Should be at least 1M, or 0 for
unlimited. Jobs will be aborted if the total size is above this limit.
Having a high limit may cause out-of-memory errors in driver (depends
on spark.driver.memory and memory overhead of objects in JVM). Setting
a proper limit can protect the driver from out-of-memory errors.
What does this attribute do exactly? I mean at first (since I am not battling with a job that fails due to out of memory errors) I thought I should increase that.
On second thought, it seems that this attribute defines the max size of the result a worker can send to the driver, so leaving it at the default (1G) would be the best approach to protect the driver..
But will happen on this case, the worker will have to send more messages, so the overhead will be just that the job will be slower?
If I understand correctly, assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize). If so, then increasing that attribute to protect my driver from being assassinated from Yarn should be wrong.
But still the question above remains..I mean what if I set it to 1M (the minimum), will it be the most protective approach?

assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize).
No. If estimated size of the data is larger than maxResultSize given job will be aborted. The goal here is to protect your application from driver loss, nothing more.
if I set it to 1M (the minimum), will it be the most protective approach?
In sense yes, but obviously it is not useful in practice. Good value should allow application to proceed normally but protect application from unexpected conditions.

Performance testing - Jmeter results

I am using Jmeter (started using it a few days ago) as a tool to simulate a load of 30 threads using a csv data file that contains login credentials for 3 system users.
The objective I set out to achieve was to measure 30 users (threads) logging in and navigating to a page via the menu over a time span of 30 seconds.
I have set my thread group as:
Number of threads: 30
Ramp-up Perod: 30
Loop Count: 10
I ran the test successfully. Now I'd like to understand what the results mean and what is classed as good/bad measurements, and what can be suggested to improve the results. Below is a table of the results collated in the Summary report of Jmeter.
I have conducted research only to find blogs/sites telling me the same info as what is defined on the jmeter.apache.org site. One blog (Nicolas Vahlas) that I came across gave me some very useful information,but still hasn't help me understand what to do next with my results.
Can anyone help me understand these results and what I could do next following the execution of this test plan? Or point me in the right direction of an informative blog/site that will help me understand what to do next.
Many thanks.

According to me, Deviation is high.
You know your application better than all of us.
you should focus on, avg response time you got and max response frequency and value are acceptable to you and your users? This applies to throughput also.
It shows average response time is below 0.5 seconds and maximum response time is also below 1 second which are generally acceptable but that should be defined by you (Is it acceptable by your users). If answer is yes, try with more load to check scaling.

In you requirement it is mentioned that you need have 30 concurrent users performing different actions. The response time of your requests is less and you have ramp-up of 30 seconds. Can you please check total active threads during the test. I believe the time for which there will be 30 concurrent users in system is pretty short so the average response time that you are seeing seems to be misleading. I would suggest you run a test for some more time so that there will be 30 concurrent users in the system and that would be correct reading as per your requirements.
You can use Aggregate report instead of summary report. In performance testing
Throughput - Requests/Second
Response Time - 90th Percentile and
Target application resource utilization (CPU, Processor Queue Length and Memory)
can be used for analysis. Normally SLA for websites is 3 seconds but this requirement changes from application to application.

Your test results are good, considering if the users are actually logging into system/portal.
Samples: This means the no. of requests sent on a particular module.
Average: Average Response Time, for 300 samples.
Min: Min Response Time, among 300 samples (fastest among 300 samples).
Max: Max Response Time, among 300 samples (slowest among 300 samples).
Standard Deviation: A measure of the variation (for 300 samples).
Error: failure %age
Throughput: No. of request processed per second.
Hope this will help.

Couchdb 100% CPU usage

I used Couchdb to create a private NPM mirror, but I found that beam.smp kept my CPU usage to 100%, is there any way to make it lower, like 50%?
Thank you very much.

You cannot directly limit CPU/memory usage for CouchDB, but you may tweak Replicator options to reduce their usage. Options you're interested:
http_connections
Defines maximum number of HTTP connections per replication. Keeping them lower reduces transfer bandwidth.
[replicator]
http_connections = 20
worker_batch_size
With lower batch sizes checkpoints are done more frequently. Lower batch sizes also reduce the total amount of used RAM memory.
[replicator]
worker_batch_size = 500
worker_processes
Amount of replication workers. Keeping them lower reduces amount of data replication handled => reduces CPU usage because of less data to process.
[replicator]
worker_processes = 4
Play with these options to find right combination to fit your limits.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why spark.sql.shuffle.partitions default is set to 200? - apache-spark

I read in various forums and posts that it is recommended to change it, but I do wonder why it is set to 200 by default and not to some function on the number of total executors' cores.

Related

presto + Query exceeded maximum time limit of

How To Get the Aggregated Usage of CPU in GRAFANA when hyperthreading is enabled

What is spark.driver.maxResultSize?

Performance testing - Jmeter results

Couchdb 100% CPU usage

Categories

Resources