Hazelcast Continuous Query Cache - Batch Size Capacity & Coalescing - hazelcast

We want to cache some entries (i.e. depending upon predicate) in continuous query cache on client for IMap. But we want to send update to CQC only after some delay seconds (i.e. 30 sec) even if the entries receives like 100 updates per sec. This we can achieve by setting delay seconds to 30 seconds & coalescing to true.
QueryCacheConfig cqc = new QueryCacheConfig();
cqc.setDelaySeconds(30);
cqc.setCoalesce(true);
cqc.setBatchSize(30)
CQC fits perfectly well for the above use case.
But we are noticing CQC is not receiving updates after delay seconds until batch size capacity is not reached. Is this is the expected behavior?
We thought CQC will receive the latest updated value for entries after delay seconds or batch size reached its capacity.

delaySeconds and batchSize 'OR' relation. Updates are pushed to caches either when batchSize is reached or delaySeconds are passed. If coalesce is true, then only latest update of a key is pushed to cache.
We have noticed some issues when testing with intellij. please try using another IDE if you are using intellij

Related

Bulls Queue Performance and Scalability: Queue.add(), Queue.getJob(jobId), Job.remove()

My use case is to create dynamic delayed job. (I am Using Bulls Queue which can be used to create delayed Jobs.)
Based on some event add some more delay to the delayed interval (further delay the job).
Since I could not find any function to update the Delayed Interval for a Job I came up with the following steps:
onEvent(jobId):
// queue is of Type Bull.Queue
// job is of type bull.Job
job = queue.getJob(jobId)
data = job.data
delay = job.toJSON().delay
job.remove()
queue.add("jobName", {value: 1}, {jobId: jobId, delayed: delay + someValue})
This pretty much solves my problem.
But I am worried about the SCALE at which these operations will happen.
I am expecting nearly 50K events per minute or even more in near future.
My Queue size is expected to grow based on unique JobId.
I am expecting more than:
1 million daily entry
around 4-5 million weekly entry
10-12 million monthly entry.
Also, after 60-70 days delayed interval for jobs will reach, and older jobs will be removed one by one.
I can run multiple processor to handle these delayed job which is not an issue.
My queue size will be stabilise after 60-70 days and more or less my queue will have around 10 million jobs.
I can vertically scale my REDIS as required.
But I want to understand the time complexity for below queries:
queue.getJob(jobId) // Get Job By Id
job.remove() // remove job from queue
queue.add(name, data, opts) // add a delayed job to this queue
If any of these operations are O(N) OR the QUEUE can keep some max number of Jobs which is less than 10 million.
Then I might have to discard this design and come up with something entirely different.
Need advice from experienced folks who can guide me on how solve this problem.
Any kind of help is appreciated.
Taking reference from the source code:
queue.getJob(jobId)
This should be O(1) since it's mostly using hash based solutions using hmget. You're only requesting one job and according to official redis docs, the time complexity is O(N) where N is the requested number of keys which will be in the order of O(1) since I'm expecting bull is storing few number of fields at the hash key.
job.remove()
Considering that a considerable number of your jobs is going to be delayed and a fraction of them are moved to waiting or active queue. This should be O(logN) on an amortized level as it's mostly using zrem for these operations.
queue.add(name, data, opts)
For job addition in a delayed queue, bull is using zadd so this is again O(logN).

CouchDB: difference between max_replication_retry_count and retries_per_request

I'm currently exploring CouchDB replication and trying to figure out the difference between max_replication_retry_count and retries_per_request configuration options in [replicator] section of configuration file.
Basically I want to configure continuous replication of local couchdb to the remote instance that would never stop replication attempts, considering potentially continuous periods of being offline(days or even weeks). So, I'd like to have infinite replication attempts with maximum retry interval of 5 minutes or so. Can I do this? Do I need to change default configuration to achieve this?
Here's the replies I've got at CouchDB mailing lists:
If we are talking Couch 1.6, the attribute retries_per_request
controls a number of attempts a current replication is going to do to
read _changes feed before giving up. The attribute
max_replication_retry_count controls a number of attempts the whole replication job is going to be retried by a replication manager.
Setting this attribute to “infinity” should make the replicaton
manager to never give up.
I don’t think the interval between those attempts is configurable. As
far as I understand it’s going to start from 2.5 sec between the
retries and then double until reached 10 minutes, which is going to be
hard upper limit.
Extended answer:
The answer is slightly different depending if you're using 1.x/2.0
releases or current master.
If you're using 1.x or 2.0 release: Set "max_replication_retry_count =
infinity" so it will always retry failed replications. That setting
controls how the whole replication job restarts if there is any error.
Then "retries_per_request" can be used to handle errors for individual
replicator HTTP requests. Basically the case where a quick immediate
retry succeeds. The default value for "retries_per_request" is 10.
After the first failure, there is a 0.25 second wait. Then on next
failure it doubles to 0.5 and so on. Max wait interval is 5 minutes.
But If you expect to be offline routinely, maybe it's not worth
retrying individual requests for too long so reduce the
"retries_per_request" to 6 or 7. So individual requests would retry a
few times for about 10 - 20 seconds then the whole replication job
will crash and retry.
If you're using current master, which has the new scheduling
replicator: No need to set "max_replication_retry_count", that setting
is gone and all replication jobs will always retry for as long as
replication document exists. But "retries_per_request" works the same
as above. Replication scheduler also does exponential backoffs when
replication jobs fail consecutively. First backoff is 30 seconds. Then
it doubles to 1 minute, 2 minutes, and so on. Max backoff wait is
about 8 hours. But if you don't want to wait 4 hours on average for
the replication to restart when network connectivity is restored, and
want to it be about 5 minutes or so, set "max_history = 8" in the
"replicator" config section. max_history controls how much history of
past events are retained for each replication job. If there is less
history of consecutive crashes, that backoff wait interval will also
be shorter.
So to summarize, for 1.x/2.0 releases:
[replicator] max_replication_retry_count = infinity
retries_per_request = 6
For current master:
[replicator] max_history = 8 retries_per_request = 6

Mongodb count performance issues with Node js

I am having issues with doing counts on a single table with up to 1million records. I have a 32 core 244gb ram box that I am running my test on so hardware should not be an issue.
I have indexes set up on all of my queries that I am using to perform counts. I have enabled node max_old_space_size to 15gb.
The process I am following is basically looping through a huge array, creating 1000 promises, within each promise I am performing 12 counts, waiting for the promises to all resolve, and then continuing with the next one thousand batch.
As part of my test, I am doing inserts, updates, and reads as well. All of those, are showing great performance up to 20000/sec on each. However, when I get to the portion of my code doing the counts(), I can see via mongostat that there are only 20-30 commands being executed per second. I have not determined at this point, if my node code is only sending that many, or if mongo is queuing it up.
Meanwhile, in my node.js code, all 1000 promises are started and waiting to evaluate. I know this is a lot of info, so please let me know what more granular details I should provide to get some more insight into why the count performance is so slow.
So basically, for a batch of 1000 records, doing lets say 12 counts each, for a total of 12,000 counts, it is taking close to 10 minutes, on a table of 1million records.
MongoDB Native Client v2.2.1
Node v4.2.1
What I'd like to add is that I have tried changing the maxPoolSize on the driver from 100-1000 with no change in performance. I've tried changing my queries that I perform from yield/generator/promise to callbacks wrapped in promise, which has helped somewhat.
The strange thing is, when my program starts, even if i use just the default number of connections which I see as 7 when running mongostat, I can get around 2500 count() queries per second throughout. However, after a few seconds this goes back down to about 300-400. This leads me to believe that mongo can handle that many all the time, but my code is not able to send that many requests, even though I set maxPoolSize to 300 and start 10000 simultaneous promises resolving in parallel. So what gives, any ideas from anyone ?

What is spark.streaming.receiver.maxRate? How does it work with batch interval

I am working with spark 1.5.2. I understand what a batch interval is, essentially the interval after which the processing part should start on the data received from the receiver.
But I do not understand what is spark.streaming.receiver.maxRate. From some research it is apparently an important parameter.
Lets consider a scenario. my batch interval is set to 60s. And spark.streaming.receiver.maxRate is set to 60*1000. What if I get 60*2000 records in 60s due to some temporary load. What would happen? Will the additional 60*1000 records be dropped? Or would the processing happen twice during that batch interval?
Property spark.streaming.receiver.maxRate applies to number of records per second.
The receiver max rate is applied when receiving data from the stream - that means even before batch interval applies. In other words you will never get more records per second than set in spark.streaming.receiver.maxRate. The additional records will just "stay" in the stream (e.g. Kafka, network buffer, ...) and get processed in the next batch.

Tracking metrics using StatsD (via etsy) and Graphite, graphite graph doesn't seem to be graphing all the data

We have a metric that we increment every time a user performs a certain action on our website, but the graphs don't seem to be accurate.
So going off this hunch, we invested the updates.log of carbon and discovered that the action had happened over 4 thousand times today(using grep and wc), but according the Integral result of the graph it returned only 220ish.
What could be the cause of this? Data is being reported to statsd using the statsd php library, and calling statsd::increment('metric'); and as stated above, the log confirms that 4,000+ updates to this key happened today.
We are using:
graphite 0.9.6 with statsD (etsy)
After some research through the documentation, and some conversations with others, I've found the problem - and the solution.
The way the whisper file format is designed, it expect you (or your application) to publish updates no faster than the minimum interval in your storage-schemas.conf file. This file is used to configure how much data retention you have at different time interval resolutions.
My storage-schemas.conf file was set with a minimum retention time of 1 minute. The default StatsD daemon (from etsy) is designed to update to carbon (the graphite daemon) every 10 seconds. The reason this is a problem is: over a 60 second period StatsD reports 6 times, each write overwrites the last one (in that 60 second interval, because you're updating faster than once per minute). This produces really weird results on your graph because the last 10 seconds in a minute could be completely dead and report a 0 for the activity during that period, which results in completely nuking all of the data you had written for that minute.
To fix this, I had to re-configure my storage-schemas.conf file to store data at a maximum resolution of 10 seconds, so every update from StatsD would be saved in the whisper database without being overwritten.
Etsy published the storage-schemas.conf configuration that they were using for their installation of carbon, which looks like this:
[stats]
priority = 110
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974
This has a 10 second minimum retention time, and stores 6 hours worth of them. However, due to my next problem, I extended the retention periods significantly.
As I let this data collect for a few days, I noticed that it still looked off (and was under reporting). This was due to 2 problems.
StatsD (older versions) only reported an average number of events per second for each 10 second reporting period. This means, if you incremented a key 100 times in 1 second and 0 times for the next 9 seconds, at the end of the 10th second statsD would report 10 to graphite, instead of 100. (100/10 = 10). This failed to report the total number of events for a 10 second period (obviously).Newer versions of statsD fix this problem, as they introduced the stats_counts bucket, which logs the total # of events per metric for each 10 second period (so instead of reporting 10 in the previous example, it reports 100).After I upgraded StatsD, I noticed that the last 6 hours of data looked great, but as I looked beyond the last 6 hours - things looked weird, and the next reason is why:
As graphite stores data, it moves data from high precision retention to lower precision retention. This means, using the etsy storage-schemas.conf example, after 6 hours of 10 second precision, data was moved to 60 second (1 minute) precision. In order to move 6 data points from 10s to 60s precision, graphite does an average of the 6 data points. So it'd take the total value of the oldest 6 data points, and divide it by 6. This gives an average # of events per 10 seconds for that 60 second period (and not the total # of events, which is what we care about specifically).This is just how graphite is designed, and for some cases it might be useful, but in our case, it's not what we wanted. To "fix" this problem, I increased our 10 second precision retention time to 60 days. Beyond 60 days, I store the minutely and 10-minutely precisions, but they're essentially there for no reason, as that data isn't as useful to us.
I hope this helps someone, I know it annoyed me for a few days - and I know there isn't a huge community of people that are using this stack of software for this purpose, so it took a bit of research to really figure out what was going on and how to get a result that I wanted.
After posting my comment above I found Graphite 0.9.9 has a (new?) configuration file, storage-aggregation.conf, in which one can control the aggregation method per pattern. The available options are average, sum, min, max, and last.
http://readthedocs.org/docs/graphite/en/latest/config-carbon.html#storage-aggregation-conf

Resources