I have been using JmDNSfor a while now. I could use it for the purposes of my application. Every thing works fine for me (I have "announcer" machines and a "listening" one, and this latter machine can see the other devices and discover their information).
It is true that I've managed to work with the JmDNS jar file, but I did it without totally understanding what is going on in this file. Now I want to know about the effect of using JmDNS for the network traffic. I have consulted the documentation but couldn't manage to discover the signification of the constants, like QUERY_WAIT_INTERVAL, PROBE_THROTTLE_COUNT, etc.
I want to know the default frequency with which the announcer machine sends service announcements.
I also noticed DNS_TTL that was described as follows: "The default TTL is set to 1 hour by the standard, so a record is going to stay in the cache of any listening machine for an hour without need to ping the server again".
I understand that it is the Time To Live of the service to stay in the DNS cache, but I couldn't understand what is intended by "purge the server". Does it mean that the listener has to ask the announcer about a service when the DNS_TTL expires? if so, why do need to have the announcer announce its service every 1s (ANNOUNCE_WAIT_INTERVAL = 1000 milliseconds)?
I am so confused.
The way that the Domain Name System works is basically very simple. Fundamentally it's a tree-like system which starts with the root nameservers. These then delegate name space out to the next level. That level in turn delegates out the next level and so on. For example . is the root, which delegates to .com., which can then delegate out example.com.. (Yes, that trailing . is actually part of the domain name, though you almost never have to use it or see it.
When you load a web page there are usually hundreds of elements that load. This is every image, every JS file, every CSS file, etc. To have your computer request that same domain to IP resolution that many times for one page would make load time unbearable and also create massive unnecessary traffic on the nameserver. Therefore DNS caches. The TTL is how long it caches for. If it's set to 24 hours then when you get an answer for that resolution, that's how long you can hold on to it for before you make another request.
The announcing that you're talking about is the nameserver basically announcing that it's responsible for those domains. You want it constantly stating that so other nameservers know where to go to get the correct (authoritative) data.
Throttling is a term used in many fields and applications and means you're limiting your traffic flow so it doesn't get overloaded.
DNS is actually quite simple to understand once you get the basics down.
Here are a few links that could help you get a better grip of it all:
Few paragraphs of basic DNS info
About.com guide
A few definitions
Relatively simple and informative PDF from IETF
Related
I am hosting a RESTful API and my problem is that every first inbound request after a certain time will take about three seconds, compared to the normal ~100ms.
What I find most interesting is that it is always takes exactly 3100 to around 3250 milliseconds, not more and not less. So it seems pretty intentional to me.
I've already debugged the API and everything runs pretty much instantly except for one thing and that is this three second delay before my API even starts to receive the request.
My best guess is that something went wrong either in Apache or the DNS resolution but I don't know what exactly causes it (that's why I'm asking this question).
I am using the Apache ProxyPass like this:
ProxyRequests off
Timeout 54
ProxyTimeout 5400
ProxyPass /jokeapi http://localhost:8079
ProxyPassReverse /jokeapi http://localhost:8079
I'm using the Cloudflare/APNIC DNS gateway servers 1.1.1.1 and 0.0.0.0
Additionally, all my requests get routed through a Cloudflare SSL proxy before even reaching my network.
I've even partially rewritten the API so it responds with ReadStreams instead of loading the files into RAM and serving it at once but that didn't fix the problem.
My question is how I can fully debug the route a request takes and see precisely where this 3 second delay comes from.
Thanks!
PS: the server runs on NodeJS
I think the key is not related to network activity, but in the note that after a period of idle activity the first response to the API in a while requires slightly over 3 seconds. I am assuming that follow up actions are back to the 100ms window.
As you are using localhost, this is not a routing issue. If you want, you can just as easily use loopback, 127.0.0.1, to avoid a name resolution hit, but such a hit on a reserved hostname would be microseconds.
I suspect that the compiled version of your RESTful function has aged out of the cache for your system. The first hit after a period of non-use time then requires a recompile, and so long as the compiled instructions are exercised for a period of time they will remain in cache and contoninue to respond in the 100ms range. We observe this condition quite often in multiuser performance testing after cold boots of systems (setting initial conditions). Ramp-ups of the test users take the hit for the recompiles of common code before hitting the time under full load.
Another item to strike back at the network side of the house, DNS timeouts and bind cache entries tend to be quite long, usually significant portions of a day or even longer. Even so, the odds that a DNS lookup for an item which has aged out of the bind cache would not add three seconds to your initial connection time.
I have some questions to better understand DNS mechanism:
1) I know between clients and authoritative DNS server there are some intermediate DNS servers like ISP's one. What and where are the other types of them?
2) After the TTL of an NS record is expired in intermediate DNS servers, when do they refresh the addresses of names? Clients request? or right after expiration, they refresh records?
Thanks.
Your question is off topic here as not related to programming.
But:
I know between clients and authoritative DNS server there are some intermediate DNS servers like ISP's one. What and where are the other types of them?
There are only two types of DNS servers (we will put aside the stub case for now): it is either an authoritative nameserver (holding information about some domains and being the trust source of it) or a recursive one, attached to a cache, that is basically starting with no data and will then progressively, based on queries it gets, do various queries to get information.
Technically, a single server could do both, but it is a bad idea for at least the reason of the cache, and the different population of clients: an authoritative nameserver is normally open to any client as it needs to "broadcast" its data everywhere while a recursive nameserver is normally only for a selected list of clients (like the ISP clients).
There exists open public recursive nameservers today by big organizations: CloudFlare, Google, Quad9, etc. However, they have both the hardware, links, and manpower to handle all issues that come out of public recursive nameservers, like DDOS with amplification.
Technically you can have a farm of recursive nameservers, like big ISPs will need to do (or the above big public ones) because any single instance could not sustain all clients queries, and they can either share a single cache or work in a hierarchy, the bottom ones sending their data to another upstream recursive nameserver, etc.
After the TTL of an NS record is expired in intermediate DNS servers, when do they refresh the addresses of names? Clients request? or right after expiration, they refresh records?
This historic naïve way could be summarized as: a request arrive, do I have it in my cache? If no, query outside for it and cache it. If yes, is it expired in my cache? If no, ship it to client, but if yes we need to remove it from cache and then do like it was not in cache from the beginning.
You then have various variations:
some caches are not exactly honoring the TTLs: some are clamping values that are too low or too high, based on their own local policies. The most agreed reading on the specification is that the TTL is an indication of the maximum amount of time to keep the record in cache, which means the client is free to ditch it before. However, it should not rewrite it to a higher value if it thinks it is too low.
caches can be kept along reboots/restarts, and can be prefetched, especially for "popular" records; in a way, the list of root NS is prefetched at boot and compared to the internal hardcoded list, in order to update it
caches, especially in RAM, may need to be trimmed on, typically on an "oldest removed" case, in order to get places for new records coming along the way.
so depending on how the cache is managed and which features it is requested to have, there may be a background task that monitor expirations and refresh records.
I recommend you to have a look at unbound as a recursive nameserver as it has various settings around TTL handling, so you could learn things, and then reading up the code itself (which brings us back on-topic kind of).
You can also read this document: https://www.ietf.org/archive/id/draft-wkumari-dnsop-hammer-03.txt an IETF Internet-Draft about:
The principle is that popular RRset in the cache are fetched, that is
to say resolved before their TTL expires and flushed. By fetching
RRset before they are being queried by an end user, that is to say
prefetched, HAMMER is expected to improve the quality of experience
of the end users as well as to optimize the resources involved in
large DNSSEC resolving platforms.
Make sure to read Appendix A with a lot of useful examples, such as:
Unbound already does this (they use a percentage of TTL, instead of a number
of seconds).
OpenDNS that they also implement something similar.
BIND as of 9.10, around Feb
2014 now implements something like this
(https://deepthought.isc.org/article/AA-01122/0/Early-refresh-of-cache-records-cache-prefetch-in-BIND-9.10.html), and enables it by
default.
A number of recursive resolvers implement techniques similar to the
techniques described in this document. This section documents some
of these and tradeoffs they make in picking their techniques.
And to take one example, the Bind one, you can read:
BIND 9.10 prefetch works as follows. There are two numbers that control it. The first number is the "eligibility". Only records that arrive with TTL values bigger than the configured elegibility will be considered for prefetch. The second number is the "trigger". If a query arrives asking for data that is cached with fewer than "trigger" seconds left before it expires, then in addition to returning that data as the reply to the query, BIND will also ask the authoritative server for a fresh copy. The intention is that the fresh copy would arrive before the existing copy expires, which ensures a uniform response time.
BIND 9.10 prefetch values are global options. You cannot ask for different prefetch behavior in different domains. Prefetch is enabled by default. To turn it off, specify a trigger value of 0. The following command specifies a trigger value of 2 seconds and an eligibility value of 9 seconds, which are the defaults.
As simple as that. I went through quite a lot of articles on the internet and all of them just go on about how updated/modified DNS records take time to propagate and so on. I may be stupid (most likely I am), but the whole situation is not very clear. Especially the following:
Do new (absolutely new records) propagate?
Example: we have an old domain, with propagated nameservers, IP, etc and add a TXT record to it. No TXT records existed previously. Is it applied immediately, after some time or after TTL?
Is there any influence on this from local DNS, cache, ISP or anything else?
Thank you.
There are at least two things being mixed under the term "propagation" here.
One is various caches of local resolvers and recursing name servers remembering information for a set amount of time before they go out and ask an authoritative server again. This has no relevance to your question, but it is what many of those articles you read were talking about.
The other is data moving from a master name server to its secondary name servers. This is relevant to your question. A master name server is where data gets injected into DNS from outside, so that's where your new records begin their lives. Secondary servers check with the master server for new data when they think enough time has passed or when they get prodded to do so (usually, the master server is set to prod them when its information is updated). The way they tell if they need to re-fetch a zone from the master or not is by comparing the serial number in the zone's SOA record between what they have stored locally and what the server has. If the number at the master is higher, the secondary will fetch the whole zone again (usually, other options exist). If the number at the master is not higher, the secondary will assume the information it has is up to date, and do nothing.
The most common reason, by far, for new records not propagating to secondaries is that whoever added the new records forgot to increase the serial number in the SOA record.
In my company we experienced a serious problem today: our production server went down. Most people accessing our software via a browser were unable to get a connection, however people who had already been using the software were able to continue using it. Even our hot standby server was unable to communicate with the production server, which it does using HTTP, not even going out to the broader internet. The whole time the server was accessible via ping and ssh, and in fact was quite underloaded - it's normally running at 5% CPU load and it was even lower at this time. We do almost no disk i/o.
A few days after the problem started we have a new variation: port 443 (HTTPS) is responding but port 80 stopped responding. The server load is very low. Immediately after restarting tomcat, port 80 started responding again.
We're using tomcat7, with maxThreads="200", and using maxConnections=10000. We serve all data out of main memory, so each HTTP request completes very quickly, but we have a large number of users doing very simple interactions (this is high school subject selection). But it seems very unlikely we would have 10,000 users all with their browser open on our page at the same time.
My question has several parts:
Is it likely that the "maxConnections" parameter is the cause of our woes?
Is there any reason not to set "maxConnections" to a ridiculously high value e.g. 100,000? (i.e. what's the cost of doing so?)
Does tomcat output a warning message anywhere once it hits the "maxConnections" message? (We didn't notice anything).
Is it possible there's an OS limit we're hitting? We're using CentOS 6.4 (Linux) and "ulimit -f" says "unlimited". (Do firewalls understand the concept of Tcp/Ip connections? Could there be a limit elsewhere?)
What happens when tomcat hits the "maxConnections" limit? Does it try to close down some inactive connections? If not, why not? I don't like the idea that our server can be held to ransom by people having their browsers on it, sending the keep-alive's to keep the connection open.
But the main question is, "How do we fix our server?"
More info as requested by Stefan and Sharpy:
Our clients communicate directly with this server
TCP connections were in some cases immediately refused and in other cases timed out
The problem is evident even connecting my browser to the server within the network, or with the hot standby server - also in the same network - unable to do database replication messages which normally happens over HTTP
IPTables - yes, IPTables6 - I don't think so. Anyway, there's nothing between my browser and the server when I test after noticing the problem.
More info:
It really looked like we had solved the problem when we realised we were using the default Tomcat7 setting of BIO, which has one thread per connection, and we had maxThreads=200. In fact 'netstat -an' showed about 297 connections, which matches 200 + queue of 100. So we changed this to NIO and restarted tomcat. Unfortunately the same problem occurred the following day. It's possible we misconfigured the server.xml.
The server.xml and extract from catalina.out is available here:
https://www.dropbox.com/sh/sxgd0fbzyvuldy7/AACZWoBKXNKfXjsSmkgkVgW_a?dl=0
More info:
I did a load test. I'm able to create 500 connections from my development laptop, and do an HTTP GET 3 times on each, without any problem. Unless my load test is invalid (the Java class is also in the above link).
It's hard to tell for sure without hands-on debugging but one of the first things I would check would be the file descriptor limit (that's ulimit -n). TCP connections consume file descriptors, and depending on which implementation is in use, nio connections that do polling using SelectableChannel may eat several file descriptors per open socket.
To check if this is the cause:
Find Tomcat PIDs using ps
Check the ulimit the process runs with: cat /proc/<PID>/limits | fgrep 'open files'
Check how many descriptors are actually in use: ls /proc/<PID>/fd | wc -l
If the number of used descriptors is significantly lower than the limit, something else is the cause of your problem. But if it is equal or very close to the limit, it's this limit which is causing issues. In this case you should increase the limit in /etc/security/limits.conf for the user with whose account Tomcat is running and restart the process from a newly opened shell, check using /proc/<PID>/limits if the new limit is actually used, and see if Tomcat's behavior is improved.
While I don't have a direct answer to solve your problem, I'd like to offer my methods to find what's wrong.
Intuitively there are 3 assumptions:
If your clients hold their connections and never release, it is quite possible your server hits the max connection limit even there is no communications.
The non-responding state can also be reached via various ways such as bugs in the server-side code.
The hardware conditions should not be ignored.
To locate the cause of this problem, you'd better try to replay the scenario in a testing environment. Perform more comprehensive tests and record more detailed logs, including but not limited:
Unit tests, esp. logic blocks using transactions, threading and synchronizations.
Stress-oriented tests. Try to simulate all the user behaviors you can come up with and their combinations and test them in a massive batch mode. (ref)
More specified Logging. Trace client behaviors and analysis what happened exactly before the server stopped responding.
Replace a server machine and see if it will still happen.
The short answer:
Use the NIO connector instead of the default BIO connector
Set "maxConnections" to something suitable e.g. 10,000
Encourage users to use HTTPS so that intermediate proxy servers can't turn 100 page requests into 100 tcp connections.
Check for threads hanging due to deadlock problems, e.g. with a stack dump (kill -3)
(If applicable and if you're not already doing this, write your client app to use the one connection for multiple page requests).
The long answer:
We were using the BIO connector instead of NIO connector. The difference between the two is that BIO is "one thread per connection" and NIO is "one thread can service many connections". So increasing "maxConnections" was irrelevant if we didn't also increase "maxThreads", which we didn't, because we didn't understand the BIO/NIO difference.
To change it to NIO, put this in the element in server.xml:
protocol="org.apache.coyote.http11.Http11NioProtocol"
From what I've read, there's no benefit to using BIO so I don't know why it's the default. We were only using it because it was the default and we assumed the default settings were reasonable and we didn't want to become experts in tomcat tuning to the extent that we now have.
HOWEVER: Even after making this change, we had a similar occurrence: on the same day, HTTPS became unresponsive even while HTTP was working, and then a little later the opposite occurred. Which was a bit depressing. We checked in 'catalina.out' that in fact the NIO connector was being used, and it was. So we began a long period of analysing 'netstat' and wireshark. We noticed some periods of high spikes in the number of connections - in one case up to 900 connections when the baseline was around 70. These spikes occurred when we synchronised our databases between the main production server and the "appliances" we install at each customer site (schools). The more we did the synchronisation, the more we caused outages, which caused us to do even more synchronisations in a downward spiral.
What seems to be happening is that the NSW Education Department proxy server splits our database synchronisation traffic into multiple connections so that 1000 page requests become 1000 connections, and furthermore they are not closed properly until the TCP 4 minute timeout. The proxy server was only able to do this because we were using HTTP. The reason they do this is presumably load balancing - they thought by splitting the page requests across their 4 servers, they'd get better load balancing. When we switched to HTTPS, they are unable to do this and are forced to use just one connection. So that particular problem is eliminated - we no longer see a burst in the number of connections.
People have suggested increasing "maxThreads". In fact this would have improved things but this is not the 'proper' solution - we had the default of 200, but at any given time, hardly any of these were doing anything, in fact hardly any of these were even allocated to page requests.
I think you need to debug the application using Apache JMeter for number of connection and use Jconsole or Zabbix to look for heap space or thread dump for tomcat server.
Nio Connector of Apache tomcat can have maximum connections of 10000 but I don't think thats a good idea to provide that much connection to one instance of tomcat better way to do this is to run multiple instance of tomcat.
In my view best way for Production server: To Run Apache http server in front and point your tomcat instance to that http server using AJP connector.
Hope this helps.
Are you absolutely sure you're not hitting the maxThreads limit? Have you tried changing it?
These days browsers limit simultaneous connections to a max of 4 per hostname/ip, so if you have 50 simultaneous browsers, you could easily hit that limit. Although hopefully your webapp responds quickly enough to handle this. Long polling has become popular these days (until websockets are more prevalent), so you may have 200 long polls.
Another cause could be if you use HTTP[S] for app-to-app communication (that is, no browser involved). Sometimes app writers are sloppy and create new connections for performing multiple tasks in parallel, causing TCP and HTTP overhead. Double check that you are not getting an inflood of requests. Log files can usually help you on this, or you can use wireshark to count the number of HTTP requests or HTTP[S] connections. If possible, modify your API to handle multiple API calls in one HTTP request.
Related to the last one, if you have many HTTP/1.1 requests going across one connection, and intermediate proxy may be splitting them into multiple connections for load balancing purposes. Sounds crazy I know, but I've seen it happen.
Lastly, some crawl bots ignore the crawl delay set in robots.txt. Again, log files and/or wireshark can help you determine this.
Overall, run more experiments with more changes. maxThreads, https, etc. before jumping to conclusions with maxConnections.
In every paper I have read about crawler proposals, I see that one important component is the DNS Resolver.
My question is:
Why is it necessary? Can't we just make a request to http://www.some-domain.com/?
DNS resolution is a well-known bottleneck in web crawling. Due to the
distributed nature of the Domain Name Service, DNS resolution may
entail multiple requests and round-trips across the internet,
requiring seconds and sometimes even longer. Right away, this puts in
jeopardy our goal of fetching several hundred documents a second.
There is another important difficulty in DNS resolution; the lookup
implementations in standard libraries (likely to be used by anyone
developing a crawler) are generally synchronous. This means that once
a request is made to the Domain Name Service, other crawler threads at
that node are blocked until the first request is completed. To
circumvent this, most web crawlers implement their own DNS resolver as
a component of the crawler.
http://nlp.stanford.edu/IR-book/html/htmledition/dns-resolution-1.html