How to deal with millions queries to DNS server? - dns

I'm wondering, how modern DNS servers dealing with millions queries per second, due to the fact that txnid field is uint16 type?
Let me explain. There is intermediate server, from one side clients sending to it DNS requests, and from other side server itself sending requests to upper DNS server (8.8.8.8 for example). So the thing is, that according to DNS protocol there is field txnid in the DNS header, which should be unchanged during request and response. Obviously, that intermediate DNS server with multiple clients replace this value with it's own txnid value (which is a counter), then sends request to external DNS server and after resolving replace this value back to client's one. And all of this will work fine for 65535 simultaneous requests due to uint16 field type. But what if we have hundreds of millions of them like Google DNS servers?

Going from your Google DNS server example:
In mid-2018 their servers were handling 1.2 trillion queries-per-day, extrapolating that growth says their service is currently handling ~20 million queries-per-second
They say that successful resolution of a cache-miss takes ~130ms, but taking timeouts into account pushes the average time up to ~400ms
I can't find any numbers on what their cache-hit rates are like, but I'd assume it's more than 90%. And presumably it increases with the popularity of their service
Putting the above together (2e7 * 0.4 * (1-0.9)) we get ~1M transactions active at any one time. So you have to find at least 20 bits of state somewhere. 16 bits comes for free because of the txnid field. As Steffen points out you can also use port numbers, which might give you another ~15 bits of state. Just these two sources give you more than enough state to run something orders of magnitude bigger than Google's DNS system.
That said, you could also just relegate transaction IDs to preventing any cache-poisoning attacks, i.e. reject any answers where the txnid doesn't match the inflight query for that question. If this check passes, then add the answer to the cache and resume any waiting clients.

Related

Trace a request going through the clearnet / Cloudflare / Apache to precisely find out performance issues

I am hosting a RESTful API and my problem is that every first inbound request after a certain time will take about three seconds, compared to the normal ~100ms.
What I find most interesting is that it is always takes exactly 3100 to around 3250 milliseconds, not more and not less. So it seems pretty intentional to me.
I've already debugged the API and everything runs pretty much instantly except for one thing and that is this three second delay before my API even starts to receive the request.
My best guess is that something went wrong either in Apache or the DNS resolution but I don't know what exactly causes it (that's why I'm asking this question).
I am using the Apache ProxyPass like this:
ProxyRequests off
Timeout 54
ProxyTimeout 5400
ProxyPass /jokeapi http://localhost:8079
ProxyPassReverse /jokeapi http://localhost:8079
I'm using the Cloudflare/APNIC DNS gateway servers 1.1.1.1 and 0.0.0.0
Additionally, all my requests get routed through a Cloudflare SSL proxy before even reaching my network.
I've even partially rewritten the API so it responds with ReadStreams instead of loading the files into RAM and serving it at once but that didn't fix the problem.
My question is how I can fully debug the route a request takes and see precisely where this 3 second delay comes from.
Thanks!
PS: the server runs on NodeJS
I think the key is not related to network activity, but in the note that after a period of idle activity the first response to the API in a while requires slightly over 3 seconds. I am assuming that follow up actions are back to the 100ms window.
As you are using localhost, this is not a routing issue. If you want, you can just as easily use loopback, 127.0.0.1, to avoid a name resolution hit, but such a hit on a reserved hostname would be microseconds.
I suspect that the compiled version of your RESTful function has aged out of the cache for your system. The first hit after a period of non-use time then requires a recompile, and so long as the compiled instructions are exercised for a period of time they will remain in cache and contoninue to respond in the 100ms range. We observe this condition quite often in multiuser performance testing after cold boots of systems (setting initial conditions). Ramp-ups of the test users take the hit for the recompiles of common code before hitting the time under full load.
Another item to strike back at the network side of the house, DNS timeouts and bind cache entries tend to be quite long, usually significant portions of a day or even longer. Even so, the odds that a DNS lookup for an item which has aged out of the bind cache would not add three seconds to your initial connection time.

What happened after DNS TTL expired in intermediate Name Server?

I have some questions to better understand DNS mechanism:
1) I know between clients and authoritative DNS server there are some intermediate DNS servers like ISP's one. What and where are the other types of them?
2) After the TTL of an NS record is expired in intermediate DNS servers, when do they refresh the addresses of names? Clients request? or right after expiration, they refresh records?
Thanks.
Your question is off topic here as not related to programming.
But:
I know between clients and authoritative DNS server there are some intermediate DNS servers like ISP's one. What and where are the other types of them?
There are only two types of DNS servers (we will put aside the stub case for now): it is either an authoritative nameserver (holding information about some domains and being the trust source of it) or a recursive one, attached to a cache, that is basically starting with no data and will then progressively, based on queries it gets, do various queries to get information.
Technically, a single server could do both, but it is a bad idea for at least the reason of the cache, and the different population of clients: an authoritative nameserver is normally open to any client as it needs to "broadcast" its data everywhere while a recursive nameserver is normally only for a selected list of clients (like the ISP clients).
There exists open public recursive nameservers today by big organizations: CloudFlare, Google, Quad9, etc. However, they have both the hardware, links, and manpower to handle all issues that come out of public recursive nameservers, like DDOS with amplification.
Technically you can have a farm of recursive nameservers, like big ISPs will need to do (or the above big public ones) because any single instance could not sustain all clients queries, and they can either share a single cache or work in a hierarchy, the bottom ones sending their data to another upstream recursive nameserver, etc.
After the TTL of an NS record is expired in intermediate DNS servers, when do they refresh the addresses of names? Clients request? or right after expiration, they refresh records?
This historic naïve way could be summarized as: a request arrive, do I have it in my cache? If no, query outside for it and cache it. If yes, is it expired in my cache? If no, ship it to client, but if yes we need to remove it from cache and then do like it was not in cache from the beginning.
You then have various variations:
some caches are not exactly honoring the TTLs: some are clamping values that are too low or too high, based on their own local policies. The most agreed reading on the specification is that the TTL is an indication of the maximum amount of time to keep the record in cache, which means the client is free to ditch it before. However, it should not rewrite it to a higher value if it thinks it is too low.
caches can be kept along reboots/restarts, and can be prefetched, especially for "popular" records; in a way, the list of root NS is prefetched at boot and compared to the internal hardcoded list, in order to update it
caches, especially in RAM, may need to be trimmed on, typically on an "oldest removed" case, in order to get places for new records coming along the way.
so depending on how the cache is managed and which features it is requested to have, there may be a background task that monitor expirations and refresh records.
I recommend you to have a look at unbound as a recursive nameserver as it has various settings around TTL handling, so you could learn things, and then reading up the code itself (which brings us back on-topic kind of).
You can also read this document: https://www.ietf.org/archive/id/draft-wkumari-dnsop-hammer-03.txt an IETF Internet-Draft about:
The principle is that popular RRset in the cache are fetched, that is
to say resolved before their TTL expires and flushed. By fetching
RRset before they are being queried by an end user, that is to say
prefetched, HAMMER is expected to improve the quality of experience
of the end users as well as to optimize the resources involved in
large DNSSEC resolving platforms.
Make sure to read Appendix A with a lot of useful examples, such as:
Unbound already does this (they use a percentage of TTL, instead of a number
of seconds).
OpenDNS that they also implement something similar.
BIND as of 9.10, around Feb
2014 now implements something like this
(https://deepthought.isc.org/article/AA-01122/0/Early-refresh-of-cache-records-cache-prefetch-in-BIND-9.10.html), and enables it by
default.
A number of recursive resolvers implement techniques similar to the
techniques described in this document. This section documents some
of these and tradeoffs they make in picking their techniques.
And to take one example, the Bind one, you can read:
BIND 9.10 prefetch works as follows. There are two numbers that control it. The first number is the "eligibility". Only records that arrive with TTL values bigger than the configured elegibility will be considered for prefetch. The second number is the "trigger". If a query arrives asking for data that is cached with fewer than "trigger" seconds left before it expires, then in addition to returning that data as the reply to the query, BIND will also ask the authoritative server for a fresh copy. The intention is that the fresh copy would arrive before the existing copy expires, which ensures a uniform response time.
BIND 9.10 prefetch values are global options. You cannot ask for different prefetch behavior in different domains. Prefetch is enabled by default. To turn it off, specify a trigger value of 0. The following command specifies a trigger value of 2 seconds and an eligibility value of 9 seconds, which are the defaults.

Expected performance with getstream.io

The getstream.io documentation says that one should expect retrieving a feed in approximately 60ms. When I retrieve my feeds they contain a field named 'duration' which I take is the calculated server side processing time. This value is steadily around 10-40ms, with an average around 15ms.
The problem is, I seldomly get my feeds in less than 150ms and the average time is rather around 200-250ms and sometimes up to 300-400ms. This is the time for the getting the feed alone, no enrichment etc., and I have verified with tcpdump that the network roundtrip is low (around 25ms), and that the time is actually spent waiting for the server to respond.
I've tried to move around my application (eu-west and eu-central) but that doesn't seem to affect things much (again, network roundtrip is steadily around 25ms).
My question is - should I really expect 60ms and continue investigating, or is 200-400ms normal? On the getstream.io site it is explained that developer accounts receive "Low Priority Processing" - what does this mean in practise? How much difference could I expect with another plan?
I'm using the node js low level API.
Stream APIs use SSL to encrypt traffic. Unfortunately SSL introduces additional network I/O. Usually you need to pay for the increased latency only once because Stream HTTP APIs supports HTTP persistent connection (aka keep-alive).
Here's a Wireshark screenshot of the TCP traffic of 2 sequential API requests with keep alive disabled client side:
The 4 lines in red highlight that the TCP connection is getting closed each time. Another interesting thing is that the handshaking takes almost 100ms and it's done twice (the first bunch of lines).
After some investigation, it turns out that the library used to make API requests to Stream's APIs (request) does not have keep-alive enabled by default. Such change will be part of the library soon and is available on a development branch.
Here's a screenshot of the same two requests with keep-alive enabled (using the code from that branch):
This time there is not connection reset anymore and the second HTTP request does not do SSL handshaking.

Can bittorrent peers handle seeding large numbers of idle torrents

I'm considering using bittorrent for a large data dissemination problem where the data source is petascale and users will want up to several terabytes. Some details
Number of torrents potentially in the millions
torrent sizes ranging from 100Mb to 100Gb
A stable set of clusters around the world capable of acting as seeders each holding a large subset of the total torrents (say 60% on average)
A relatively small number of simultaneous users (less than 100) wanting to download on average a few terabytes of data.
I expect the number of active torrents to be small compared to the total available but quality of service is important so there must be several seeders for each torrent or some mechanism for launching new seeders.
My question is can bittorrent clients handle seeding huge numbers of torrents, most of which are idle? Would I need to stripe torrents across the seeders in a cluster or could each node be seeding all torrents it has access to? Which client would do the best job? Are there any tools for managing clusters of seeders?
I am assuming that trackers can be made to scale to this level.
There are 2 main problems:
Each torrent (typically) needs to announce to a tracker periodically, this might end up using a significant amount of bandwidth.
The bittorrent client itself need to be written in a way to scale with a large number of torrents
As for the tracker traffic, let's assume you have 1 million torrents, the typical re-announce interval is 30 minutes, but some tracker has it set to 1 hour. Let's be conservative and assume your tracker uses 1 hour announce intervals. You will have to make 1 million GET requests per hour, let's say each request is 400 bytes up and 100 bytes down (assuming most responses will not contain any peers), that's about 111 kB/s up and 28 kB/s down constantly. That's not so bad, but keep in mind that TCP requires an extra round-trip for establishing connections, so that's another 40 bytes down and 40 bytes up.
This can be mitigated by only using UDP trackers. Then you would only need a single connect-message, and you can reuse the connection ID for each announce. Each announce message would then be 100 bytes, and the returned message would be a bit more compact as well, let's assume 60 bytes. That would get you 28 kB/s up and 16kB/s down, just to keep the torrents announced. For this you would need a client with decent udp tracker support (one that caches the connection ID for instance).
Not too bad, assuming that's insignificant compared to the actual data your seeds would send.
However, you don't necessarily need to stripe your torrents across separate data centers, you could also use an HTTP server to seed the torrents. All major bittorrent clients support http seeding, and you wouldn't have to worry about announcing to the tracker (the URL is burned into the .torrent itself).
As for a client that scales well with torrents, I don't know for sure, I haven't done any measurements. It should be fairly straightforward to just generate a million random torrents and try to load it up.
I have done some optimization work in libtorrent rasterbar to make it scale well with many torrents, I haven't tried millions though.
I've written a blog post on this topic, here.
You may be looking for Hekate
It's in, at best, pre-alpha right now, but it's quite nearly what you're describing.
To not collapse under the overhead of useless tracker announces and scrapes in the millions (and that in every announce interval), you have to restrict your seeding clusters to only load the current working set of items that are requested right now. Downloaders need to get (download) the .torrent file from a central place anyway, and that could trigger loading it into the seeding clusters. Alternatively, determine activity for a particcular info-hash by recognizing announces that do NOT originate from one of your seed clusters.
rTorrent has fast-resume (meaning no hashing happens when an appropriately prepared .torrent is loaded), and is controllable via xmlrpc so you can decommission idle items. That way, a .torrent download can trigger the actual data to be available for the next 24 hours, or as long as there's activity in the swarm.
The protocol allows for this, but I do not know which clients would scale to millions of torrents. In the worst case, you would have to write your own seed-only client.
The protocol feature most relevant to your use case is that, when a peer connects to another, the connecting peer is supposed to send the torrent's info-hash first. This means that a single listening TCP port could be used to seed an unlimited amount of torrents, with almost zero resources used when idle.
This can be found on The BitTorrent Protocol Specification:
If both sides don't send the same value, they sever the connection. The one possible exception is if a downloader wants to do multiple downloads over a single port, they may wait for incoming connections to give a download hash first, and respond with the same one if it's in their list.
I also found the same on this Bittorrent Protocol Specification v1.0:
The initiator of a connection is expected to transmit their handshake immediately. The recipient may wait for the initiator's handshake, if it is capable of serving multiple torrents simultaneously (torrents are uniquely identified by their info_hash).
However, there is one thing that would increase your load, and it is the tracker. With the normal tracker protocol, each client has to periodically announce to the tracker each torrent it has, together with information like how much it has uploaded. With millions of torrents, this would present a somewhat high load. If you were writing your own mass-seed-only client, a separate protocol to announce your seeders to the tracker would be a good idea.

Does a caching-nameserver usually cache the negative DNS response SERVFAIL

Does a caching-nameserver usually cache the negative DNS response SERVFAIL?
EDIT:
To clarify the question, I can see the caching nameserver caching negative responses NXDOMAIN, NODATA. But it does not do this for SERVFAIL responses. Is this intentional?
SERVFAIL is covered by §7.1 of RFC2308:
Server failures fall into two major
classes. The first is where a
server can determine that it has been
misconfigured for a zone. This may
be where it has been listed as a server, but not configured to be a
server for the zone, or where it has
been configured to be a server for
the zone, but cannot obtain the zone
data for some reason. This can
occur either because the zone file
does not exist or contains errors,
or because another server from which
the zone should have been available
either did not respond or was unable
or unwilling to supply the zone.
The second class is where the
server needs to obtain an answer from
elsewhere, but is unable to do so, due
to network failures, other servers
that don't reply, or return server
failure errors, or similar.
In either case a resolver MAY cache
a server failure response. If it
does so it MUST NOT cache it for
longer than five (5) minutes, and it
MUST be cached against the specific
query tuple <query name, type,
class, server IP address>.
So basically, it's dependent on the implementation of your name server.
RFC 1034 describes how to cache negative responses but did not define a mechanism for returning those cache results to peer resolvers. RFC 2308 defines these attributes.
Negative caching was an optional part of the DNS Specifications...
One of the timeout fields in the SOA is a "negative timeout". It is usually set to a short time, such as 30 or 60 seconds. So, yes, but for a shorter time than a "positive" response.

Resources