Native Transport Requests All time blocked - cassandra

Trying to understand more about Native-Transport-Requests!
As we know these are cql requests and if limit exceeds the result will be all time blocked NTR.
My question is how do i monitor these requests in real time and get some kind of report on it.
I see some settings like max_queued_native_transport_requests and native_transport_max_threads. How these settings will have effect over all time blocked.

Have a look at JIRA-11363.
Also check this discussion for more info.
The recommendation is to start with the default values and tune from there. The default values are:
max_queued_native_transport_requests=1024
native_transport_max_threads: 128
Monitor you nodes and if you see an increasing number of blocked Native-Transport-Requests, then you need to increase max_queued_native_transport_requests.
Also, I think it's worth checking these discussions: 1, 2

Related

set network interface counters (rx, tx, packets, bytes)

Is it possible to SET the interface statistics in Linux after it's been brought up? I'm dealing with rrdtool (mrtg) that gets upset by a daily ifdown and ifup which brings the interface counters back to zero. Ideally I would like to continue counting from where I left and setting the interface values to what they were before the interface went down seems to be the easiest path.
I checked writing to /sys/class/net/ax0/statistics/rx_packets but that gives a Permission Denied error.
netstat, ifup, ifconfig and friends don't seem to support changing these values either.
Anything else I can try?
You can't set the kernel counters, no - but do you really need to?
MRTG will usually graph a rate, based on the difference between samples. So your MRTG/RRD will store packets-per-second values every cycle (usually 5min but maybe 1min). When your device resets the counters, then MRTG will see the value apparently go backwards - which will be discounted as out of range, so one failed sample. But, the next sample will work, and a new rate be given.
If you're getting a big spike in the MRTG graph at the point of the reset, this will be due to an incorrect 'counter rollover' detection. You can prevent this by either setting the MRTG AbsMax setting (to prevent this high value from being valid) or (better) by using SNMPv2 counters (where a reset is more obvious).
If you set your RRD file to have a large enough heartbeat and XFF, then this one missing sample will be interpolated, and so your graphs (which, remember, show the rate rather than the total) will continue to look fine.
Should you need the total, it can be derived by sum(rate x interval) which is automatically done by the Routers2 frontend for MRTG/RRD.

Hazelcast High Response Times

We have a Java 1.6 application that uses Hazelcast 3.7.4 version,
with a topology of two nodes. The application operates mainly with 3 maps.
In normal application working, response times when consulting the maps are
generally in values around some milliseconds tens.
I have observed that in some circumstances such as for example with network
cuts, the response time increases to huge values such as for example, 20 or 30 seconds!!
And this is impacting the application performance.
I would like to know if this kind of situation with network micro-cuts can lead
to increase searches response time in this manner. I do not know if some concrete configuration can be done to minimize this, and also which other elements can provoke so high times.
I provide some examples of some executed consults
Example 1:
String sqlPredicate = "acui='"+acui+"'";
Collection<Agent> agents =
(Collection<Agent>) data.getMapAgents().values(new SqlPredicate(sqlPredicate));
Example 2:
boolean exist = data.getMapAgents().containsKey(agent);
Thanks so much for your help.
Best Regards,
Jorge
The Map operations are all TCP Socket based and thus are subject to your Operating Systems TCP Driver implementation.
See TCP_NODELAY

Give reads priority over writes in Elasticsearch

I have an EC2 server running Elasticsearch 0.9 with a nginx server for read/write access. My index has about 750k small-medium documents. I have a pretty continuous stream of minimal writes (mainly updates) to the content. The speeds/consistency I receive with search is fine with me, but I have some sporadic timeout issues with multi-get (/_mget).
On some pages in my app, our server will request a multi-get of a dozen to a few thousand documents (this usually takes less than 1-2 seconds). The requests that fail, fail with a 30,000 millisecond timeout from the nginx server. I am assuming this happens because the index was temporarily locked for writing/optimizing purposes. Does anyone have any ideas on what I can do here?
A temporary solution would be to lower the timeout and return a user friendly message saying documents couldn't be retrieved (however they still would have to wait ~10 seconds to see an error message).
Some of my other thoughts were to give read priority over writes. Anytime someone is trying to read a part of the index, don't allow any writes/locks to that section. I don't think this would be scalable and it may not even be possible?
Finally, I was thinking I could have a read-only alias and a write-only alias. I can figure out how to set this up through the documentation, but I am not sure if it will actually work like I expect it to (and I'm not sure how I can reliably test it in a local environment). If I set up aliases like this, would the read-only alias still have moments where the index was locked due to information being written through the write-only alias?
I'm sure someone else has come across this before, what is the typical solution to make sure a user can always read data from the index with a higher priority over writes. I would consider increasing our server power, if required. Currently we have 2 m2x-large EC2 instances. One is the primary and the replica, each with 4 shards.
An example dump of cURL info from a failed request (with an error of Operation timed out after 30000 milliseconds with 0 bytes received):
{
"url":"127.0.0.1:9200\/_mget",
"content_type":null,
"http_code":100,
"header_size":25,
"request_size":221,
"filetime":-1,
"ssl_verify_result":0,
"redirect_count":0,
"total_time":30.391506,
"namelookup_time":7.5e-5,
"connect_time":0.0593,
"pretransfer_time":0.059303,
"size_upload":167002,
"size_download":0,
"speed_download":0,
"speed_upload":5495,
"download_content_length":-1,
"upload_content_length":167002,
"starttransfer_time":0.119166,
"redirect_time":0,
"certinfo":[
],
"primary_ip":"127.0.0.1",
"redirect_url":""
}
After more monitoring using the Paramedic plugin, I noticed that I would get timeouts when my CPU would hit ~80-98% (no obvious spikes in indexing/searching traffic). I finally stumbled across a helpful thread on the Elasticsearch forum. It seems this happens when the index is doing a refresh and large merges are occurring.
Merges can be throttled at a cluster or index level and I've updated them from the indicies.store.throttle.max_bytes_per_sec from the default 20mb to 5mb. This can be done during runtime with the cluster update settings API.
PUT /_cluster/settings HTTP/1.1
Host: 127.0.0.1:9200
{
"persistent" : {
"indices.store.throttle.max_bytes_per_sec" : "5mb"
}
}
So far Parmedic is showing a decrease in CPU usage. From an average of ~5-25% down to an average of ~1-5%. Hopefully this can help me avoid the 90%+ spikes I was having lock up my queries before, I'll report back by selecting this answer if I don't have any more problems.
As a side note, I guess I could have opted for more balanced EC2 instances (rather than memory-optimized). I think I'm happy with my current choice, but my next purchase will also take more CPU into account.

Getting UDP traffic statistic between particular hosts on Linux

I need to gather some network statistic to test my server application. I've tried many linux tools, but nothing I've found suits my needs.
Basically I want to gather some UDP statistics (bytes/time_interval, packets/time_interval, packets_loss), but regarding only two particular hosts - for example I want to get UDP statistic from traffic going from IP_A:PORT_A to IP_B:PORT_B.
Tools like tcpdump/wireshark can easily dump such traffic but I have problems with getting statistics like temporary speed (too see throughput peeks), and linux system statistics gives me number for all traffic.
It would be better to get text output so it will be possible to parse it.
Anyone has any idea how can I achieve it?
Thanks in advance
Harnen
Here's a tutorial for the libpcap library:
http://www.systhread.net/texts/200805lpcap1.php
To determine packets lost, your program will probably want to work on a pair of logs, and make sure UDP messages on the source are found on the destination. A good method for doing this is to maintain a window of packets equal to the amount of time your timeout is set to, load all the packets into the window, sort them, then search for all the packets in the desired time frame, marking them as found as you go. Once you've exhausted a minute, remove half of that minute from the buffer, and load the next thirty seconds and re-sort.
If you have lots (millions? probably should profile it) of packets, it may be faster to use what's called a Counting Bloom Filter, so you can determine if your packet is "probably" in there very quickly.
If you weren't looking for programming advice, take your question to serverfault.

Ideal timeout period for dns lookup

In my rails app i do a nslookup using a ruby library resolv. If the site like dgdfgdfgdfg.com is entered its talking too long to resolve. in some instance like 20 sec.(mostly for non-existent sites) Because it cause the application to slowdown.
So i though of introducing a timeout period for the dns lookup.
What will be the ideal timeout period for the dns lookup so that resolution of actual site doesnt fail. will something like 10 sec will be fine?
There's no IETF mandated value, although ยง6.1.3.3 of RFC 1123 suggests a value not less than 5 seconds.
Perl's Net::DNS and the command line dig utility do default to 5 seconds between retries. Some versions of the Microsoft resolver appear to default to 3 seconds.
You can run some tests among the users to find out the right number compromising responsiveness / performance.
Also you can adjust that timeout dinamically depending on the network traffic.
For example, for every sucessful resolv, you save how much time it took you to resolv it. And every hour (for example) you can calculate an average and set double of its value as timeout (Remember that "average" is, roughly speaking, "the middle"). This way if your latency is high at some point, it autoadjust itself to increase the timeout period.

Resources