Linux: Compare large files - linux

I am downloading the .COM zone file every day. It's a list of all .COM domains in the world with their primary nameserver.
Sample of the zone file:
DAYTONOHIOJOBS NS NS1.HOSTINGNET
DAYTONOHIOJOBS NS NS2.HOSTINGNET
DAYTONOHIOMAP NS NS1.HOSTINGNET
DAYTONOHIOMAP NS NS2.HOSTINGNET
DAYTONOHIONEWS NS NS1.HOSTINGNET
DAYTONOHIONEWS NS NS2.HOSTINGNET
To save in disk space, you can see .COM has been removed from the domain name (it's all .COM anyway).
The same goes for the nameserver (if it ends in .COM it has been removed).
This zone file is around 270,000,000 lines and about 9 GB.
My goal is to monitor a specific nameserver. Every day I want a list of all domains with that specific nameserver, but also a list of all new domains with that nameserver (new as in: yesterday this domain didn't have that nameserver yet).
I wrote a perl script to open and load "yesterdays" database and then open "todays" database and loop and compare. But this takes hours and lots of memory.
What would be the best way to do this?

Here is how I would do it, judging by what I know:
Have the script read the first file. For each line that corresponds to the nameserver of interest, add the entry to a hashmap.
Have the script read the second file. For each line that corresponds to the nameserver of interest, check if the entry is in the hashmap. If it isn't, it is new. If it is, it is unchanged - remove it from the hashmap.
At the end, all entries still left in the hashmap have been removed.
This does assume that the hashmap with this particular nameservers domains fits into memory, but on a reasonable machine and for a reasonable nameserver, this seems a reasonable assumption...

You may grep today and yestarday files for lines with the nameserver and compare the two results. (grep - command line unix tool)
You may keep compressed files (gzip) and use zgrep for initial grep.

Related

DNS Switch A Record to C Name Without Impacting Consumers

Say we have an A REC that points to IP x of our LB for one of our services. It has a TTL of 3600s. But... what it should have been was a C NAME that points to a A REC for a VIP. It's already in production and has about 10 services that calls the new A REC comprising of ~100 machines. If the A REC is deleted and a new C NAME is created with the same name and points to a new A REC, will the consumers notice this change? Is there a chance that the callers would time out?
I'd assume with the amount of machines some are bound to be impacted. If I set the TTL to 5 hours would there be a better chance of no one noticing?
So my question is, how do I swap an A REC for a C NAME without consumers of our service noticing?
Would it matter if the record is for use inside the network only vs available to the public?
I ask because we will need to load balance across data centers soon, and we have some records that are stuck pointing to an IP.
It would be nice to have an explanation of how the DNS system would behave in this scenario. Thanks.
Let's assume that you have a name foo.example.org that has nothing except an A record with the IPv4 address 192.0.2.1 and a 3600 second TTL. Anyone who looks up foo.example.org will get that A record, and remember it for an hour before they go and ask your name server for fresher information.
Then assume you change things so that foo.example.org has a CNAME record pointing at bar.example.net, which in turn has an A record holding the address 192.0.2.1. Anyone who looks up the name foo.example.org for the first time will get the CNAME, proceed to look up bar.example.net, and get the A record from there.
The only complication is that anyone who looked up foo.example.org during the 3600 seconds immediately before you change to the CNAME chain took effect will remember the direct lookup, and thus not see the new information until the TTL expires. So for up to an hour after you do the change, some people may still see the old information. So to keep the change transparent to users, make sure that the old information (the old IP address) still works for at least one full TTL period after you make a change.
This is not in any way special for changing from A to CNAME. No matter what you change, there will be a full TTL period during which clients can legitimately get the old info. That's just how DNS works.
On top of that, of course, there are clients and caching servers that don't pay as much attention to the TTL value as they should, but that's a whole different thing.

Knot Resolver: How to observe and modify a resolved answer at the right time

Goal
I would like to stitch up a GNU GPL licensed Knot Resolver module either in C or in CGO that would examine the client's query and the corresponding resolved answer with the goal of querying an external API offering a knowledge base of malware infected hostnames and ip addresses (e.g. GNU AGPL v3 IntelMQ).
If there is a match with the resolved A's (AAAA's) IP address it is to be logged, likewise a match with the queried hostname should be logged or (optionally) it could result in sending the client an IP address of a sinkhole instead of the resolved one.
Means
I studied the layers and I came to the conclusion that the phase I'm interested in is consume. I don't want to affect the resolution process, I just want to step in at the last moment and check the results and possibly modify them.
I ventured to register the a consume function
with
static knot_layer_api_t _layer = {
.consume = &consume,
};
but I'm not sure it is the right place to do the deed.
Furthermore, I also looked into module hints.c, especially its query method
and module stats.c for its _to_wire function usage.
Question(s)
Phase (Layer?)
When is the right time to step in and read/write the answer to the query before it's send to the client? Am I at the right spot in consume layer?
Answer sections
If the following attempt at getting the resolved IP address gives me the Name Server's address:
char addr_str[INET6_ADDRSTRLEN];
memset(addr_str, 0, sizeof(addr_str));
const struct sockaddr *src = &(req->answer->sections);
inet_ntop(qry->ns.addr[0].ip.sa_family, kr_inaddr(src), addr_str, sizeof(addr_str));
DEBUG_MSG(NULL, "ADDR: %s\n", addr_str);
how do I get the resolved (A, AAAA) IP address for the query's hostname? I would like to iterate over A/AAAA IP addresses and CNAMEs in the answer and look at the IP addresses they were resolved to.
Modifying the answer
If the module setting demands it, I would like to be able to "ditch" the resolved answer and provide a new one comprising an A record pointed at a sinkhole.
How do I prepare the record so as it could be translated from char* to Knot's wire format and the proper structure in the right context at the right phase?
I guess it might go along functions such as knot_rrset_init and knot_rrset_add_rdata, but I wasn't able to arrive at any successful result.
THX for pointers and suggestions.
If you want to step in the last moment when the response is finalised but not yet sent to the requestor, the right place is finish. You can do it in consume as well, but you'll be overwriting responses from authoritative servers here, not the assembled response to requestor (which means DNSSEC validator is likely to stop your rewritten answers).
Disclaimer: Go interface is rough and requires a lot of CGO code to access internal structures. You'd be probably better suited by a LuaJIT module, there is another module doing something similar that you may take as an example, it also has wrappers for creating records from text etc. If you still want to do it, that's awesome and improvements to Go interface are welcome, read on.
What you need to do is roughly this (as CGO).
That will walk you through RR sets in the packet (C.knot_rrset_t),
where you can match type (rr.type) and contents (rr.rdata).
Contents is stored in DNS wire format, for address records it is the address in network byte order, e.g. {0x7f, 0, 0, 1}.
You will have to compare that to address/subnet you're looking for - example in C code.
When you find a match, you want to clear the whole packet and insert sinkhole record (you cannot selectively remove records, because the packet is append-only for performance reasons). This is relatively easy as there is a helper for that. Here's code in LuaJIT from policy module, you'd have to rewrite it in Go, using all functions mentioned above and using A/AAAA sinkhole record instead of SOA. Good luck!

Subdomain DNS seems to only be partially propagating

I own a domain, and clearly its DNS resolution is fine, everywhere seems to point to the right server : https://dnschecker.org/#A/e-bis.fr
I created a wildcard for subdomains, and it seems like it only points to the right server in some random places in the world, changes randomly every once in a while (as in sometimes a server will say it resolves, and one hour later it won't anymore) : https://dnschecker.org/#A/whatever.e-bis.fr
At first I thought it was a propagation issue, but it's been a week now so clearly it's me messing up the config at some point.
Here's the zone file used by bind9 for this domain :
# IN SOA ns3032550.ip-91-121-79.eu. postmaster.e-bis.fr. (
2014070501 ; Serial
8H ; Refresh
30M ; Retry
4W ; Expire
8H ; Minimum TTL
)
IN NS ns3032550.ip-91-121-79.eu.
IN NS ns.kimsufi.com.
e-bis.fr. IN A 91.121.79.161
*.e-bis.fr. IN A 91.121.79.161
ownercheck IN TXT "28834a04"
I do a service bind9 reload every time I update it, so the only thing I can see is the issue being in the zone file. I'm terrible with them, so it wouldn't surprise me if it was a beginner mistake.
Thanks in advance to anyone who can help,
Éric B.
Turns out I had just forgotten to update the serial (I think?).
For anyone running into the same problem, it was this line 2014070501 ; Serial which I had not updated. Incrementing it then restarting the service is enough.

How to find the position of Central Directory in a Zip file?

I am trying to find the position of the first Central Directory file header in a Zip file.
I'm reading these:
http://en.wikipedia.org/wiki/Zip_(file_format)
http://www.pkware.com/documents/casestudies/APPNOTE.TXT
As I see it, I can only scan through the Zip data, identify by the header what kind of section I am at, and then do that until I hit the Central Directory header. I would obviously read the File Headers before that and use the "compressed size" to skip the actual data, and not for-loop through every byte in the file...
If I do it like that, then I practically already know all the files and folders inside the Zip file in which case I don't see much use for the Central Directory anymore.
To my understanding the purpose of Central Directory is to list file metadata, and the position of the actual data in the Zip file so you wouldn't need to scan the whole file?
After reading about End Of Central Directory record, Wikipedia says:
This ordering allows a zip file to be created in one pass, but it is
usually decompressed by first reading the central directory at the
end.
How would I find End of Central Directory record easily? We need to remember that it can have an arbitrary sized comment there, so I may not know how many bytes from the end of the data stream it is located at. Do I just scan it?
P.S. I'm writing a Zip file reader.
Start at the end and scan towards the beginning, looking for the end of directory signature and counting the number of bytes you have scanned. When you find a candidate, get the byte 20 offset for the comment length (L). Check if L + 20 matches your current count. Then check that the start of the central directory (pointed to by the byte 12 offset) has an appropriate signature.
If you assumed the bits were pretty random when the signature check happened to be a wild guess (e.g. a guess landing into a data segment), the probability of getting all the signature bits correct is pretty low. You could refine this and figure out the chance of landing in a data segment and the chance of hitting a legitimate header (as a function of the number of such headers), but this is already sounded like a low likelihood to me. You could increase your confidence level by then checking the signature of the first file record listed, but be sure to handle the boundary case of an empty zip file.
I ended up looping through the bytes starting from the end. The loop stops if it finds a matching byte sequence, the index is below zero or if it already went through 64k bytes.
Just cross your fingers and hope that there isn't an entry with the CRC, timestamp or datestamp as 06054B50, or any other sequence of four bytes that happen to be 06054B50.

DNS Response Packets

I'm trying to code my own DNS server, I'm reading through RFC1035 on DNS but I have a few queries:
1) I want my server to respond with a CNAME for a particular request, but no A records - can I do this? for example, receive request for 'server1.com', response 'CNAME server2.com', and then the client queries another DNS server to get the A record for 'server2.com'.
I've currently set the header to: '\x84\x00' such to say this is the authoritive server, but recurse is not possible. Is this right?
2) I want my server to respond with no records for any other request, such that the client then queries a different DNS server for the records. I've currently set header to '\x83\x03' such to signal a NAME ERROR reply code. Is this right? Then what do I follow this with, zeros in all the other fields, or just end the packet there? I don't want to respond with 'this name doesn't exist', rather 'I don't know this name, try someone else' - how do I do this?
Many Thanks :)
Sounds about right - in fact, CNAME with A records is incorrect (RFC1034 section 3.6.2: "If a CNAME RR is present at a node, no other data should be present").
This would be very unusual behaviour from an authoritative nameserver - I'd suggest rethinking it or at least testing with some real-life resolvers to ensure they do what you want. RCODE #3 ("name error" or NXDOMAIN) is positive confirmation that the name doesn't exist. This would cause resolvers to terminate resolution and possibly cache the nonexistence of the name, which doesn't sound like what you're after. If you want the resolver to query one of the other nameservers that was delegated to for that zone, I guess SERVFAIL (RCODE #2) is the most appropriate/likely to have the desired effect.
By the way, for debugging the exact format of your DNS packets I can highly recommend Wireshark for its decoding accuracy compared with pasting hex codes into Stack Overflow ;)
In the CNAME case, your (authoritative) server should just return the CNAME in the answer section unless it is also authoritative for the domain that the CNAME points to, in which case it should also include the result of following the CNAME.
For your second case you should return RCODE 5 ("REFUSED") - this is the preferred error that an authoritative server should give when asked a question for a domain for which it is not configured.
Following that, you still need to send the four 16-bit count fields and a copy of the question from the original request. In this case the four counts would be (1, 0, 0, 0) - one question, no answer, no ns records, no additional records.

Resources