There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.
We want to cache some entries (i.e. depending upon predicate) in continuous query cache on client for IMap. But we want to send update to CQC only after some delay seconds (i.e. 30 sec) even if the entries receives like 100 updates per sec. This we can achieve by setting delay seconds to 30 seconds & coalescing to true.
QueryCacheConfig cqc = new QueryCacheConfig();
cqc.setDelaySeconds(30);
cqc.setCoalesce(true);
cqc.setBatchSize(30)
CQC fits perfectly well for the above use case.
But we are noticing CQC is not receiving updates after delay seconds until batch size capacity is not reached. Is this is the expected behavior?
We thought CQC will receive the latest updated value for entries after delay seconds or batch size reached its capacity.
delaySeconds and batchSize 'OR' relation. Updates are pushed to caches either when batchSize is reached or delaySeconds are passed. If coalesce is true, then only latest update of a key is pushed to cache.
We have noticed some issues when testing with intellij. please try using another IDE if you are using intellij
I'm currently using 1-node cluster with DataStax Opscenter 5.2.1 (Cassandra 2.2.3) installed on Windows.
There is not too much data is sent to the cluster, and here is the graph (last 20 minutes) of write requests that I can see in Opscenter. The graph looks normal and expected for me:
write_requests(20min)
However, when I've switched the data range to last 1 hour, as turns out there were much more write requests (according to cluste(max) line):
write_requests(1h)
I'm confused, could someone clarify what cluster(max) means in my case? Why these values are so big in comparison with cluster(total) or cluster(min)?
The first graph (20 minute) uses an average. The 1h graph will have 3 lines - min per sample, average, and max per sample.
What you're likely seeing is that something (perhaps opscenter itself) is doing a flood of writes, about 700/second for a few seconds, and on the 20 minute graph it gets averaged out, but with the min/max lines, you'll see the outliers.
Unable to add a new node to a Cassandra 1.2 ring because streaming times out. We have streaming_socket_timeout_in_ms=30000. What should I change it to? Why is streaming not retried ad infinitum until successful?
In the end it worked, after a dozen attempts. The difference was:
Raise streaming_socket_timeout_in_ms=600000 (10 min), to be longer than the longest imaginable GC pause (Maybe 0ms would have also worked.)
Before starting the bootstrap of the new node, do a rolling restart of the nodes in the same datacenter as the new node, so that their heap starts fresh, from a low value
Start the bootstrap at the end of the work day and leave it on overnight
We have a metric that we increment every time a user performs a certain action on our website, but the graphs don't seem to be accurate.
So going off this hunch, we invested the updates.log of carbon and discovered that the action had happened over 4 thousand times today(using grep and wc), but according the Integral result of the graph it returned only 220ish.
What could be the cause of this? Data is being reported to statsd using the statsd php library, and calling statsd::increment('metric'); and as stated above, the log confirms that 4,000+ updates to this key happened today.
We are using:
graphite 0.9.6 with statsD (etsy)
After some research through the documentation, and some conversations with others, I've found the problem - and the solution.
The way the whisper file format is designed, it expect you (or your application) to publish updates no faster than the minimum interval in your storage-schemas.conf file. This file is used to configure how much data retention you have at different time interval resolutions.
My storage-schemas.conf file was set with a minimum retention time of 1 minute. The default StatsD daemon (from etsy) is designed to update to carbon (the graphite daemon) every 10 seconds. The reason this is a problem is: over a 60 second period StatsD reports 6 times, each write overwrites the last one (in that 60 second interval, because you're updating faster than once per minute). This produces really weird results on your graph because the last 10 seconds in a minute could be completely dead and report a 0 for the activity during that period, which results in completely nuking all of the data you had written for that minute.
To fix this, I had to re-configure my storage-schemas.conf file to store data at a maximum resolution of 10 seconds, so every update from StatsD would be saved in the whisper database without being overwritten.
Etsy published the storage-schemas.conf configuration that they were using for their installation of carbon, which looks like this:
[stats]
priority = 110
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974
This has a 10 second minimum retention time, and stores 6 hours worth of them. However, due to my next problem, I extended the retention periods significantly.
As I let this data collect for a few days, I noticed that it still looked off (and was under reporting). This was due to 2 problems.
StatsD (older versions) only reported an average number of events per second for each 10 second reporting period. This means, if you incremented a key 100 times in 1 second and 0 times for the next 9 seconds, at the end of the 10th second statsD would report 10 to graphite, instead of 100. (100/10 = 10). This failed to report the total number of events for a 10 second period (obviously).Newer versions of statsD fix this problem, as they introduced the stats_counts bucket, which logs the total # of events per metric for each 10 second period (so instead of reporting 10 in the previous example, it reports 100).After I upgraded StatsD, I noticed that the last 6 hours of data looked great, but as I looked beyond the last 6 hours - things looked weird, and the next reason is why:
As graphite stores data, it moves data from high precision retention to lower precision retention. This means, using the etsy storage-schemas.conf example, after 6 hours of 10 second precision, data was moved to 60 second (1 minute) precision. In order to move 6 data points from 10s to 60s precision, graphite does an average of the 6 data points. So it'd take the total value of the oldest 6 data points, and divide it by 6. This gives an average # of events per 10 seconds for that 60 second period (and not the total # of events, which is what we care about specifically).This is just how graphite is designed, and for some cases it might be useful, but in our case, it's not what we wanted. To "fix" this problem, I increased our 10 second precision retention time to 60 days. Beyond 60 days, I store the minutely and 10-minutely precisions, but they're essentially there for no reason, as that data isn't as useful to us.
I hope this helps someone, I know it annoyed me for a few days - and I know there isn't a huge community of people that are using this stack of software for this purpose, so it took a bit of research to really figure out what was going on and how to get a result that I wanted.
After posting my comment above I found Graphite 0.9.9 has a (new?) configuration file, storage-aggregation.conf, in which one can control the aggregation method per pattern. The available options are average, sum, min, max, and last.
http://readthedocs.org/docs/graphite/en/latest/config-carbon.html#storage-aggregation-conf