Collecting high volume DNS stats with libpcap

Collecting high volume DNS stats with libpcap - dns

I am considering writing an application to monitor DNS requests for approximately 200,000 developer and test machines. Libpcap sounds like a good starting point, but before I dive in I was hoping for feedback.
This is what the application needs to do:
Inspect all DNS packets.
Keep aggregate statistics. Include:
DNS name.
DNS record type.
Associated IP(s).
Requesting IP.
Count.
If the number of requesting IPs for one DNS name is > 10, then stop keeping the client ip.
The stats would hopefully be kept in memory, and disk writes would only occur when a new, "suspicious" DNS exchange occurs, or every few hours to write the new stats to disk for consumption by other processes.
My question are:
1. Do any applications exist that can do this? The link will be either 100 MB or 1 GB.
2. Performance is the #1 consideration by a large margin. I have experience writing c for other one-off security applications, but I am not an expert. Do any tips come to mind?
3. How much of an effort would this be for a good c developer in man-hours?
Thanks!
Jason

I suggest you to try something like DNSCAP or even Snort for capturing DNS traffic.
BTW I think this is rather a superuser.com question than a StackOverflow one.

Related

How to improve download speed on Azure VMs?

My organization is spinning down its private cloud and preparing to deploy a complete analytics and data warehousing solution on Azure VMs. Naturally, we are doing performance testing so we can identify and address any unforeseen issues before fully decommissioning our datacenter servers.
So far, the only significant issue I've noticed in Azure VMs is that my download speeds don't seem to change at all no matter which VM size I test. The speed is always right around 60 Mbps downstream. This is despite guidance such as this which indicates I should see improvements in ingress based on VM size.
I have done significant and painstaking research into the issue, but everything I've read so far only really addresses "intra-cloud" optimizations (e.g., ExpressRoute, Accelerated Networking, VNet peering) or communications to specific resources. My requirement is for fast download speeds on the open internet (secure and trusted sites only). To preempt any attempts to question this requirement, I will not specifically address the reasons why I can't consider alternatives such as establishing private/dedicated connections, but suffice it to say, I have no choice but to rule out those options.
That said, I'm open to any other advice! What, if anything, can I do to improve download speeds to Azure VMs when data needs to be transferred to the VM from the open web?
Edit: Corrected comment about Azure guidance.

I finally figured out what was going on. #evilSnobu was onto something -- there was something flawed about my tests. Namely, all of them (seemingly by sheer coincidence) "throttled" my data transfers. I was able to confirm this by examining the network throughput carefully. Since my private cloud environment never provisioned enough bandwidth to hit the 50-60 Mbps ceiling that seems to be fairly common among certain hosts, it didn't occur to me that I could eventually be throttled at a higher throughput rate. Real bummer. What this experiment did teach me, is that you should NOT assume more bandwidth will solve all your throughput problems. Throttling appears to be exceptionally common, and I would suggest planning for what to do when you encounter it.

Best way to manage big files downloads

I'm looking for the best way to manage the download of my products on the web.
Each of them weighs between 2 and 20 Go.
Each of them is approximatively downloaded between 1 and 1000 times a day by our customers.
I've tried to use Amazon S3 but the download speed isn't good and it quickly becomes expensive.
I've tried to use Amazon S3 + CloudFront but files are too large and the downloads too rare: the files didn't stay in the cache.
Also, I can't create torrent files in S3 because the files size is too large.
I guess the cloud solutions (such as S3, Azure, Google Drive...) work well only for small files, such as image / css / etc.
Now, I'm using my own servers. It works quite well but it is really more complex to manage...
Is there a better way, a perfect way to manage this sort of downloads?

This is a huge problem and we see it when dealing with folks in the movie or media business: they generate HUGE video files that need to be shared on a tight schedule. Some of them resort to physically shipping hard drives.
When "ordered and guaranteed data delivery" is required (e.g. HTTP, FTP, rsync, nfs, etc.) the network transport is usually performed with TCP. But TCP implementations are quite sensitive to packet loss, round-trip time (RTT), and the size of the pipe, between the sender and receiver. Some TCP implementations also have a hard time filling big pipes (limits on the max bandwidth-delay product; BDP = bit rate * propagation delay).
The ideal solution would need to address all of these concerns.
Reducing the RTT usually means reducing the distance between sender and receiver. Rule of thumb, reducing your RTT by half can double your max throughput (or cut your turnaround time in half). Just for context, I'm seeing an RTT from US East Coast to US West Coast as ~80-85ms.
Big deployments typically use a content delivery network (CDN) like Akamai or AWS CloudFront, to reduce the RTT (e.g. ~5-15ms). Simply stated, the CDN service provider makes arrangements with local/regional telcos to deploy content-caching servers on-premise in many cities, and sells you the right to use them.
But control over a cached resource's time-to-live (TTL) can depend on your service level agreement ($). And cache memory is not infinite, so idle resources might be purged to make room for newly requested data, especially if the cache is shared with others.
In your case, it sounds to me like you want to meaningfully reduce the RTT while retaining full control of the cache behaviour, so you can set a really long cache TTL. The best price/performance solution IMO is to deploy your own cache servers running CentOS 7 + NGINX with proxy_cache turned on and enough disk space, and deploy a cache server for each major region (e.g. west coast and east coast). Your end users could select the region closest to them, or your could add some code to automatically detect the closest regional cache server.
Deploying these cache servers on AWS EC2 is definitely an option. Your end users will probably see much better performance than by connecting to AWS S3 directly, and there are no BW caps.
The current AWS pricing for your volume is about $0.09/GB for BW out to the internet. Assuming your ~50 files at an average of 10GB, that's about $50/month for BW from cache servers to your end users - not bad? You could start with c4.large for low/average usage regions ($79/month). Higher usage regions might cost you about ~$150/month (c4.xl), ~$300/month (c4.2xl), etc. You can get better pricing with spot instances and you can tune performance based on your business model (e.g. VIP vs Best-Effort).
In terms of being able to "fill the pipe" and sensitivity to network loss (e.g. congestion control, congestion avoidance), you may want to consider an optimized TCP stack like SuperTCP (full disclaimer, I'm the director of development). The idea here is to have a per-connection auto-tuning TCP stack with a lot of engineering behind it, so it can fill huge pipes like the ones between AWS regions, and not overreact to network loss as regular TCP often does, especially when sending to Wi-Fi endpoints.
Unlike UDP solutions, it's a single-sided install (<5 min), you don't get charged for hardware or storage, you don't need to worry about firewalls, and it won't flood/kill your own network. You'd want to install it on your sending devices: the regional cache servers and the origin server(s) that push new requests to the cache servers.
An optimized TCP stack can increase your throughput by 25%-85% over healthy networks, and I've seen anywhere from 2X to 10X throughput on lousy networks.

Unfortunately I don't think AWS is going to have a solution for you. At this point I would recommend looking into some other CDN providers like Akamai https://www.akamai.com/us/en/solutions/products/media-delivery/download-delivery.jsp that provide services specifically geared toward large file downloads. I don't think any of those services are going to be cheap though.

You may also want to look into file acceleration software, like Signiant Flight or Aspera (disclosure: I'm a product manager for Flight). Large files (multiple GB in size) can be a problem for traditional HTTP transfers, especially over large latencies. File acceleration software goes over UDP instead of TCP, essentially masking the latency and increasing the speed of the file transfer.
One negative to using this approach is that your clients will need to download special software to download their files (since UDP is not supported natively in the browser), but you mentioned they use a download manager already, so that may not be an issue.
Signiant Flight is sold as a service, meaning Signiant will run the required servers in the cloud for you.
With file acceleration solutions you'll generally see network utilization of about 80 - 90%, meaning 80 - 90 Mbps on a 100 Mbps connection, or 800 Mbps on a 1 Gbps network connection.

Could a web-scraper get around a good throttle protection?

Suppose that a data source sets a tight IP-based throttle. Would a web scraper have any way to download the data if the throttle starts rejecting their requests as early as 1% of the data being downloaded?
The only technique I could think of a hacker using here would be some sort of proxy system. But, it seems like the proxies (even if fast) would eventually all reach the throttle.
Update: Some people below have mentioned big proxy networks like Yahoo Pipes and Tor, but couldn't these IP ranges or known exit nodes be blacklisted as well?

A list of thousands or poxies can be compiled for FREE. IPv6 addresses can be rented for pennies. Hell, an attacker could boot up an Amazon EC2 micro instance for 2-7 cents an hour.
And you want to stop people from scraping your site? The internet doesn't work that way, and hopefully it never will.
(I have seen IRC servers do a port scan on clients to see if the following ports are open: 8080,3128,1080. However there are proxy servers that use different ports and there are also legit reasons to run proxy server or to have these ports open, like if you are running Apache Tomcat. You could bump it up a notch by using YAPH to see if a client is running a proxy server. In effect you'd be using an attacker's too against them ;)

Someone using Tor would be hopping IP addresses every few minutes. I used to run a website where this was a problem, and resorted to blocking the IP addresses of known Tor exit nodes whenever excessive scraping was detected. You can implement this if you can find a regularly updated list of Tor exit nodes, for example, https://www.dan.me.uk/tornodes

You could use a P2P crawling network to accomplish this task. There will be a lot of IPs availble and there will be no problem if one of them become throttled. Also, you may combine a lot of client instances using some proxy configuration as suggested in previous answers.
I think you can use YaCy, a P2P opensource crawling network.

A scraper that wants the information will get the information. Timeouts, changing agent names, proxies, and of course EC2/RackSpace or any other cloud services that have the ability to start and stop servers with new IP addresses for pennies.

I've heard of people using Yahoo Pipes to do such things, essentially using Yahoo as a proxy to pull the data.

Maybe try running your scraper on amazon ec2 instances. Every time you get throttled, startup a new instance (at new IP), and kill the old one.

It depends on the time the attacker has for obtaining the data. If most of the data is static, it might be interesting for an attacker to run his scraper for, say, 50 days. If he is on a DSL line where he can request a "new" IP address twice a day, 1% limit would not harm him that much.
Of course, if you need the data more quickly (because it is outdated quickly), there are better ways (use EC2 instances, set up a BOINC project if there is public interest in the collected data, etc.).
Or have a Pyramid scheme a la "get 10 people to run my crawler and you get PORN, or get 100 people to crawl it and you get LOTS OF PORN", as it was quite common a few years ago with ad-filled websites. Because of the competition involved (who gets the most referrals) you might quickly get a lot of nodes running your crawler for very little money.

How much sustained data should a dedicated server be able to serve?

We have a dedicated godaddy server and it seemed to grind to a halt when we had users downloading only 3MB every 2 seconds (this was over about 20 http requests).
I want to look into database locking etc. to see if that is a problem - but first I'm curious as to what a dedicated server ought to be able to serve.

to help diagnose the problem, host a large file and download it. That will give you the transfer that the server and your web server can cope with. If the transfer rate is poor, then you know its the network, server or webserver.
If its acceptable or good, then you know its the means you have of generating those 3MB files.
check, measure and calculate!
PS. download the file over a fast link, you don't want the bottleneck to be your 64kbps modem :)

A lot depends on what the 3MB is. Serving up 1.5MBps of static data is way, way, way, within the bounds of even the weakest server.

Perhaps godaddy does bandwidt throtling? 60MB downloads every 2 seconds might fire some sort of bandwidt protection (either to protect their service or you from being overcharged, or both).

Check netspeed.stanford.edu from the dedicated server and see what your inbound and outbound traffic is like.
Also make sure your ISP is not limiting you at 10MBps (godaddy by default limits to 10Mbps and will set it at 100Mbps on request)

Which resources should one monitor on a Linux server running a web-server or database

When running any kind of server under load there are several resources that one would like to monitor to make sure that the server is healthy. This is specifically true when testing the system under load.
Some examples for this would be CPU utilization, memory usage, and perhaps disk space.
What other resource should I be monitoring, and what tools are available to do so?

As many as you can afford to, and can then graph/understand/look at the results. Monitoring resources is useful for not only capacity planning, but anomaly detection, and anomaly detection significantly helps your ability to detect security events.
You have a decent start with your basic graphs. I'd want to also monitor the number of threads, number of connections, network I/O, disk I/O, page faults (arguably this is related to memory usage), context switches.
I really like munin for graphing things related to hosts.

I use Zabbix extensively in production, which comes with a stack of useful defaults. Some examples of the sorts of things we've configured it to monitor:
Network usage
CPU usage (% user,system,nice times)
Load averages (1m, 5m, 15m)
RAM usage (real, swap, shm)
Disc throughput
Active connections (by port number)
Number of processes (by process type)
Ping time from remote location
Time to SSL certificate expiry
MySQL internals (query cache usage, num temporary tables in RAM and on disc, etc)
Anything you can monitor with Zabbix, you can also attach triggers to - so it can restart failed services; or page you to alert about problems.
Collect the data now, before performance becomes an issue. When it does, you'll be glad of the historical baselines, and the fact you'll be able to show what date and time problems started happening for when you need to hunt down and punish exactly which developer made bad changes :)

I ended up using dstat which is vmstat's nicer looking cousin.
This will show most everything you need to know about a machine's health,
including:
CPU
Disk
Memory
Network
Swap

"df -h" to make sure that no partition runs full which can lead to all kinds of funky problems, watching the syslog is of course also useful, for that I recommend installing "logwatch" (Logwatch Website) on your server which sends you an email if weird things start showing up in your syslog.

Cacti is a good web-based monitoring/graphing solution. Very complete, very easy to use, with a large userbase including many large Enterprise-level installations.
If you want more 'alerting' and less 'graphing', check out nagios.
As for 'what to monitor', you want to monitor systems at both the system and application level, so yes: network/memory/disk i/o, interrupts and such over the system level. The application level gets more specific, so a webserver might measure hits/second, errors/second (non-200 responses), etc and a database might measure queries/second, average query fulfillment time, etc.

Beware the afore-mentioned slowquerylog in mysql. It should only be used when trying to figure out why some queries are slow. It has the side-effect of making ALL your queries slow while it's enabled. :P It's intended for debugging, not logging.
Think 'passive monitoring' whenever possible. For instance, sniff the network traffic rather than monitor it from your server -- have another machine watch the packets fly back and forth and record statistics about them.
(By the way, that's one of my favorites -- if you watch connections being established and note when they end, you can find a lot of data about slow queries or slow anything else, without putting any load on the server you care about.)

In addition to top and auth.log, I often look at mtop, and enable mysql's slowquerylog and watch mysqldumpslow.
I also use Nagios to monitor CPU, Memory, and logged in users (on a VPS or dedicated server). That last lets me know when someone other than me has logged in.

network of course :) Use MRTG to get some nice bandwidth graphs, they're just pretty most of the time.. until a spammer finds a hole in your security and it suddenly increases.
Nagios is good for alerting as mentioned, and is easy to get setup. You can then use the mrtg plugin to get alerts for your network traffic too.
I also recommend ntop as it shows where your network traffic is going.
A good link to get you going with Munin and Monit: link text

I typically watch top and tail -f /var/log/auth.log.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string