Avahi Hostname Resolution: Is it caching somewhere? - linux

I am using Fedora 18 with the avahi command line tools (version 0.6.31)
I use avahi-resolve-host-name to discover the IP address of units on my subnet, for testing purposes during development. I monitor the request and response with Wireshark. After one successful request and response, no further requests show up on Wireshark, but the tool still returns an IP address. Is it possible the computer/avahi daemon/something else is 'caching' the result?
The Question: I wish to send out the request packet with EVERY CALL of avahi-resolve-host-name. Is this possible?
The Reason: I'm getting 'false positives' so to speak. I try to resolve 'test1.local', and I am getting a resulting IP, but the unit is no longer located at this IP. I want the request sent every time so I can avoid seeing units at incorrect IP addresses.

I see that I'm a bit late to answer your question but I'm going to leave a generic answer in case someone else stumbles upon this.
My answer is based on avahi-0.6.32_rc.
Is it possible the computer/avahi daemon/something else is 'caching' the result?
Yes, avahi-daemon is caching lookup results. While this doesn't seem to be explicitly listed in features, the avahi-daemon(8) manpage tips it:
The daemon [...] provides two IPC APIs for local programs to make use of the mDNS record cache the avahi-daemon maintains.
I wish to send out the request packet with EVERY CALL of avahi-resolve-host-name. Is this possible?
Yes, it is. The relevant option is cache-entries-max (from avahi-daemon.conf(5)):
cache-entries-max= Takes an unsigned integer specifying how many resource records are cached per interface. Bigger values allow mDNS work correctly in large LANs but also increase memory consumption.
To achieve the desired effect, you can simply set:
cache-entries-max=0
This will disable the caching entirely and force avahi-daemon to reissue the MDNS packets on every request, therefore making it possible for you to monitor them.
However, I should note here that this will also render avahi pretty much useless for normal use. While avahi-daemon will be issuing lookup packets, it will be unable to store the results and every call of avahi-resolve-host-name (as well as other command-line tools, nss-mdns, D-Bus APIā€¦) will fail.

I just stumbled upon this problem myself and found solution that doesn't require changing the config. It seems that simply killing the daemon (avahi-daemon --kill) flushes the cache. I'm on Ubuntu 18.04 and the daemon is restarted automatically. If on some other distro it isn't running after being killed, it can be restarted with avahi-daemon --daemonize.
Note that root is needed to kill avahi daemon, so this might not be the best option in some cases.

Related

Except from memory and CPU leaks, what will be reasons for Node.js server might go went down?

I have a Node.js (Express.js) server for my React.js website as BFF. I use Node.js for SSR, proxying some request and cache some pages in Redis. In last time I found that my server time to time went down. I suggest an uptime is about 2 days. After restart, all ok, then response time growth from hour to hour. I have resource monitoring at this server, and I see that server don't have problems with RAM or CPU. It used about 30% of RAM and 20% of CPU.
I regret to say it's a big production site and I can't make minimal reproducible example, cause i don't know where is reason of these error :(
Except are memory and CPU leaks, what will be reasons for Node.js server might go went down?
I need at least direction to search.
UPDATE1:
"went down" - its when kubernetes kills container due 3 failed life checks (GET request to a root / of website)
My site don't use any BD connection but call lots of 3rd party API's. About 6 API requests due one GET/ request from browser
UPDATE2:
Thx. To your answers, guys.
To understand what happend inside my GET/ request, i'm add open-telemetry into my server. In longtime and timeout GET/ requests i saw long API requests with very big tcp.connect and tls.connect.
I think it happens due lack of connections or something about that. I think Mostafa Nazari is right.
I create patch and apply them within the next couple of days, and then will say if problem gone
I solve problem.
It really was lack of connections. I add reusing node-fetch connection due keepAlive and a lot of cache for saving connections. And its works.
Thanks for all your answers. They all right, but most helpful thing was added open-telemetry to my server to understand what exactly happens inside request.
For other people with these problems, I'm strongly recommended as first step, add telemetry to your project.
https://opentelemetry.io/
PS: i can't mark two replies as answer. Joe have most detailed and Mostafa Nazari most relevant to my problem. They both may be "best answers".
Tnx for help, guys.
Gradual growth of response time suggest some kind of leak.
If CPU and memory consumption is excluded, another potentially limiting resources include:
File descriptors - when your server forgets to close files. Monitor for number of files in /proc//fd/* to confirm this. See what those files are, find which code misbehaves.
Directory listing - even temporary directory holding a lot of files will take some time to scan, and if your application is not removing some temporary files and lists them - you will be in trouble quickly.
Zombie processes - just monitor total number of processes on the server.
Firewall rules (some docker network magic may in theory cause this on host system) - monitor length of output of "iptables -L" or "iptables-save" or equivalent on modern kernels. Rare condition.
Memory fragmentation - this may happen in languages with garbage collection, but often leaves traces with something like "Can not allocate memory" in logs. Rare condition, hard to fix. Export some health metrics and make your k8s restart your pod preemptively.
Application bugs/implementation problems. This really depends on internal logic - what is going on inside the app. There may be some data structure that gets filled in with data as time goes by in some tricky way, becoming O(N) instead of O(1). Really hard to trace down, unless you have managed to reproduce the condition in lab/test environment.
API calls from frontend shift to shorter, but more CPU-hungry ones. Monitor distribution of API call types over time.
Here are some of the many possibilities of why your server may go down:
Memory leaks The server may eventually fail if a Node.js application is leaking memory, as you stated in your post above. This may occur if the application keeps adding new objects to the memory without appropriately cleaning up.
Unhandled exceptions The server may crash if an exception is thrown in the application code and is not caught. To avoid this from happening, ensure that all exceptions are handled properly.
Third-party libraries If the application uses any third-party libraries, the server may experience problems as a result. Before using them, consider examining their resource usage, versions, or updates.
Network Connection The server's network connection may have issues if the server is sending a lot of queries to third-party APIs or if the connection is unstable. Verify that the server is handling connections, timeouts, and retries appropriately.
Connection to the Database Even though your server doesn't use any BD connections, it's a good idea to look for any stale connections to databases that could be problematic.
High Volumes of Traffic The server may experience performance issues if it is receiving a lot of traffic. Make sure the server is set up appropriately to handle a lot of traffic, making use of load balancing, caching, and other speed enhancement methods. Cloudflare is always a good option ;)
Concurrent Requests Performance problems may arise if the server is managing a lot of concurrent requests. Check to see if the server is set up correctly to handle several requests at once, using tools like a connection pool, a thread pool, or other concurrency management strategies.
(Credit goes to my System Analysis and Design course slides)
With any incoming/outgoing web requests, 2 File Descriptors will be acquired. as there is a limit on number of FDs, OS does not let new Socket to be opened, this situation cause "Timeout Error" on clients. you can easily check number of open FDs by sudo ls -la /proc/_PID_/fd/ | tail -n +4 | wc -l where _PID_ is nodejs PID, if this value is rising, you have connection leak issue.
I guess you need to do the following to prevent Connection Leak:
make sure you are closing outgoing API call Http Connection (it depends on how you are opening them, some libraries manage this and you just need to config them)
cache your outgoing API call (if it is possible) to reduce API call
for your outgoing API call, use Connection pool, this would manage number of open HttpConnection, reuse already-opened connection and ...
review your code, so that you can serve a request faster than now (for example make your API call more parallel instead of await or nested call). anything you do to make your response faster, is good for preventing this situation
I solve problem. It really was lack of connections. I add reusing node-fetch connection due keepAlive and a lot of cache for saving connections. And its works.
Thanks for all your answers. They all right, but most helpful thing was added open-telemetry to my server to understand what exactly happens inside request.
For other people with these problems, I'm strongly recommended as first step, add telemetry to your project.
https://opentelemetry.io/

FreeRadius in combination with a vulnerability scan / software status check

What i have:
I am running a freeradius server fully configured of how i need it to be. Everything works just fine right now.
What i need:
I need the radius to put the devices in a seperate vlan before authentication and to run a vulnerability scan (nessus / openvas etc) on the devices in this vlan to check for software status ( antivirus etc. )
if the device passes the test the authentication should be done normaly.
if it fails it should be put into a third ( fourth if you count the unauth-vid ) vlan.
can someone tell me if this is doable in freeradius ?
thanks in advance for your answers
Yes. But this is a very broad question and is dependent on the networking equipment being used. I'll give you an overview of how I'd design such a system.
In general, you'll have an easier time if you can use the same DHCP server/IP range for your NAC and full access VLAN. That means you don't have to signal the higher networking layers in the client that there's been a state change, you can swap out VLANs behind the scenes to change what they can access.
You'd set up a database with an entry for each client. This doesn't have to be pre-populated, it could be populated during the first auth attempt. Part of each client entry would be a status field detailing when they last completed NAC.
You'd also need an accounting database, to store information about where each client is connected to the network.
If the client had never completed NAC checks before, you'd assign the client to the NAC VLAN, and signal your NAC processes to start interrogating it.
FreeRADIUS can act as both a RADIUS and DHCPv4 server, so you'd probably do signal the NAC process from the DHCPv4 side because then you'd know what IP the client received.
Binding the RADIUS and DHCPv4 sides can be done in a couple of ways. The most obvious is MAC, another common way is NAS/Port ID using the accounting table.
Once the NAC checks had completed, you'd have the NAC process write out a receipt in detail file format, and have that read back in by a detail file listener (there are examples of this in sites-available/ in the 'decoupled-accounting' virtual server files). When reading those entries back in, you'd change the state in the database, and send a CoA packet to the switch using information from the accounting database to identify the client. This would flip the VLAN and allow them to the standard set of networking resources.
I know this is very high level, documenting it properly would probably exceed StackOverflow's character limit. If you need more help with this, I suggest you research what I've described above and then start asking the RADIUS related questions on the FreeRADIUS user's mailing list https://freeradius.org/support/.

Weird Tomcat outage, possibly related to maxConnections

In my company we experienced a serious problem today: our production server went down. Most people accessing our software via a browser were unable to get a connection, however people who had already been using the software were able to continue using it. Even our hot standby server was unable to communicate with the production server, which it does using HTTP, not even going out to the broader internet. The whole time the server was accessible via ping and ssh, and in fact was quite underloaded - it's normally running at 5% CPU load and it was even lower at this time. We do almost no disk i/o.
A few days after the problem started we have a new variation: port 443 (HTTPS) is responding but port 80 stopped responding. The server load is very low. Immediately after restarting tomcat, port 80 started responding again.
We're using tomcat7, with maxThreads="200", and using maxConnections=10000. We serve all data out of main memory, so each HTTP request completes very quickly, but we have a large number of users doing very simple interactions (this is high school subject selection). But it seems very unlikely we would have 10,000 users all with their browser open on our page at the same time.
My question has several parts:
Is it likely that the "maxConnections" parameter is the cause of our woes?
Is there any reason not to set "maxConnections" to a ridiculously high value e.g. 100,000? (i.e. what's the cost of doing so?)
Does tomcat output a warning message anywhere once it hits the "maxConnections" message? (We didn't notice anything).
Is it possible there's an OS limit we're hitting? We're using CentOS 6.4 (Linux) and "ulimit -f" says "unlimited". (Do firewalls understand the concept of Tcp/Ip connections? Could there be a limit elsewhere?)
What happens when tomcat hits the "maxConnections" limit? Does it try to close down some inactive connections? If not, why not? I don't like the idea that our server can be held to ransom by people having their browsers on it, sending the keep-alive's to keep the connection open.
But the main question is, "How do we fix our server?"
More info as requested by Stefan and Sharpy:
Our clients communicate directly with this server
TCP connections were in some cases immediately refused and in other cases timed out
The problem is evident even connecting my browser to the server within the network, or with the hot standby server - also in the same network - unable to do database replication messages which normally happens over HTTP
IPTables - yes, IPTables6 - I don't think so. Anyway, there's nothing between my browser and the server when I test after noticing the problem.
More info:
It really looked like we had solved the problem when we realised we were using the default Tomcat7 setting of BIO, which has one thread per connection, and we had maxThreads=200. In fact 'netstat -an' showed about 297 connections, which matches 200 + queue of 100. So we changed this to NIO and restarted tomcat. Unfortunately the same problem occurred the following day. It's possible we misconfigured the server.xml.
The server.xml and extract from catalina.out is available here:
https://www.dropbox.com/sh/sxgd0fbzyvuldy7/AACZWoBKXNKfXjsSmkgkVgW_a?dl=0
More info:
I did a load test. I'm able to create 500 connections from my development laptop, and do an HTTP GET 3 times on each, without any problem. Unless my load test is invalid (the Java class is also in the above link).
It's hard to tell for sure without hands-on debugging but one of the first things I would check would be the file descriptor limit (that's ulimit -n). TCP connections consume file descriptors, and depending on which implementation is in use, nio connections that do polling using SelectableChannel may eat several file descriptors per open socket.
To check if this is the cause:
Find Tomcat PIDs using ps
Check the ulimit the process runs with: cat /proc/<PID>/limits | fgrep 'open files'
Check how many descriptors are actually in use: ls /proc/<PID>/fd | wc -l
If the number of used descriptors is significantly lower than the limit, something else is the cause of your problem. But if it is equal or very close to the limit, it's this limit which is causing issues. In this case you should increase the limit in /etc/security/limits.conf for the user with whose account Tomcat is running and restart the process from a newly opened shell, check using /proc/<PID>/limits if the new limit is actually used, and see if Tomcat's behavior is improved.
While I don't have a direct answer to solve your problem, I'd like to offer my methods to find what's wrong.
Intuitively there are 3 assumptions:
If your clients hold their connections and never release, it is quite possible your server hits the max connection limit even there is no communications.
The non-responding state can also be reached via various ways such as bugs in the server-side code.
The hardware conditions should not be ignored.
To locate the cause of this problem, you'd better try to replay the scenario in a testing environment. Perform more comprehensive tests and record more detailed logs, including but not limited:
Unit tests, esp. logic blocks using transactions, threading and synchronizations.
Stress-oriented tests. Try to simulate all the user behaviors you can come up with and their combinations and test them in a massive batch mode. (ref)
More specified Logging. Trace client behaviors and analysis what happened exactly before the server stopped responding.
Replace a server machine and see if it will still happen.
The short answer:
Use the NIO connector instead of the default BIO connector
Set "maxConnections" to something suitable e.g. 10,000
Encourage users to use HTTPS so that intermediate proxy servers can't turn 100 page requests into 100 tcp connections.
Check for threads hanging due to deadlock problems, e.g. with a stack dump (kill -3)
(If applicable and if you're not already doing this, write your client app to use the one connection for multiple page requests).
The long answer:
We were using the BIO connector instead of NIO connector. The difference between the two is that BIO is "one thread per connection" and NIO is "one thread can service many connections". So increasing "maxConnections" was irrelevant if we didn't also increase "maxThreads", which we didn't, because we didn't understand the BIO/NIO difference.
To change it to NIO, put this in the element in server.xml:
protocol="org.apache.coyote.http11.Http11NioProtocol"
From what I've read, there's no benefit to using BIO so I don't know why it's the default. We were only using it because it was the default and we assumed the default settings were reasonable and we didn't want to become experts in tomcat tuning to the extent that we now have.
HOWEVER: Even after making this change, we had a similar occurrence: on the same day, HTTPS became unresponsive even while HTTP was working, and then a little later the opposite occurred. Which was a bit depressing. We checked in 'catalina.out' that in fact the NIO connector was being used, and it was. So we began a long period of analysing 'netstat' and wireshark. We noticed some periods of high spikes in the number of connections - in one case up to 900 connections when the baseline was around 70. These spikes occurred when we synchronised our databases between the main production server and the "appliances" we install at each customer site (schools). The more we did the synchronisation, the more we caused outages, which caused us to do even more synchronisations in a downward spiral.
What seems to be happening is that the NSW Education Department proxy server splits our database synchronisation traffic into multiple connections so that 1000 page requests become 1000 connections, and furthermore they are not closed properly until the TCP 4 minute timeout. The proxy server was only able to do this because we were using HTTP. The reason they do this is presumably load balancing - they thought by splitting the page requests across their 4 servers, they'd get better load balancing. When we switched to HTTPS, they are unable to do this and are forced to use just one connection. So that particular problem is eliminated - we no longer see a burst in the number of connections.
People have suggested increasing "maxThreads". In fact this would have improved things but this is not the 'proper' solution - we had the default of 200, but at any given time, hardly any of these were doing anything, in fact hardly any of these were even allocated to page requests.
I think you need to debug the application using Apache JMeter for number of connection and use Jconsole or Zabbix to look for heap space or thread dump for tomcat server.
Nio Connector of Apache tomcat can have maximum connections of 10000 but I don't think thats a good idea to provide that much connection to one instance of tomcat better way to do this is to run multiple instance of tomcat.
In my view best way for Production server: To Run Apache http server in front and point your tomcat instance to that http server using AJP connector.
Hope this helps.
Are you absolutely sure you're not hitting the maxThreads limit? Have you tried changing it?
These days browsers limit simultaneous connections to a max of 4 per hostname/ip, so if you have 50 simultaneous browsers, you could easily hit that limit. Although hopefully your webapp responds quickly enough to handle this. Long polling has become popular these days (until websockets are more prevalent), so you may have 200 long polls.
Another cause could be if you use HTTP[S] for app-to-app communication (that is, no browser involved). Sometimes app writers are sloppy and create new connections for performing multiple tasks in parallel, causing TCP and HTTP overhead. Double check that you are not getting an inflood of requests. Log files can usually help you on this, or you can use wireshark to count the number of HTTP requests or HTTP[S] connections. If possible, modify your API to handle multiple API calls in one HTTP request.
Related to the last one, if you have many HTTP/1.1 requests going across one connection, and intermediate proxy may be splitting them into multiple connections for load balancing purposes. Sounds crazy I know, but I've seen it happen.
Lastly, some crawl bots ignore the crawl delay set in robots.txt. Again, log files and/or wireshark can help you determine this.
Overall, run more experiments with more changes. maxThreads, https, etc. before jumping to conclusions with maxConnections.

Setting timeout on tcp

i was wondering if is there any variable that allows me to modifty the value of tcp's timeout. I'm trying to send packages between two virtual machines.
Thank you.
The setsockopt function allows controlling a number of configuration parameters, including timeouts.
You can try:
Linux increasing or decreasing TCP sockets timeouts
There are a load of parameters that you can modify - see http://man7.org/linux/man-pages/man7/tcp.7.html.
As you question is vague it is difficult to tell you what ones you can modify

jmdns constants

I have been using JmDNSfor a while now. I could use it for the purposes of my application. Every thing works fine for me (I have "announcer" machines and a "listening" one, and this latter machine can see the other devices and discover their information).
It is true that I've managed to work with the JmDNS jar file, but I did it without totally understanding what is going on in this file. Now I want to know about the effect of using JmDNS for the network traffic. I have consulted the documentation but couldn't manage to discover the signification of the constants, like QUERY_WAIT_INTERVAL, PROBE_THROTTLE_COUNT, etc.
I want to know the default frequency with which the announcer machine sends service announcements.
I also noticed DNS_TTL that was described as follows: "The default TTL is set to 1 hour by the standard, so a record is going to stay in the cache of any listening machine for an hour without need to ping the server again".
I understand that it is the Time To Live of the service to stay in the DNS cache, but I couldn't understand what is intended by "purge the server". Does it mean that the listener has to ask the announcer about a service when the DNS_TTL expires? if so, why do need to have the announcer announce its service every 1s (ANNOUNCE_WAIT_INTERVAL = 1000 milliseconds)?
I am so confused.
The way that the Domain Name System works is basically very simple. Fundamentally it's a tree-like system which starts with the root nameservers. These then delegate name space out to the next level. That level in turn delegates out the next level and so on. For example . is the root, which delegates to .com., which can then delegate out example.com.. (Yes, that trailing . is actually part of the domain name, though you almost never have to use it or see it.
When you load a web page there are usually hundreds of elements that load. This is every image, every JS file, every CSS file, etc. To have your computer request that same domain to IP resolution that many times for one page would make load time unbearable and also create massive unnecessary traffic on the nameserver. Therefore DNS caches. The TTL is how long it caches for. If it's set to 24 hours then when you get an answer for that resolution, that's how long you can hold on to it for before you make another request.
The announcing that you're talking about is the nameserver basically announcing that it's responsible for those domains. You want it constantly stating that so other nameservers know where to go to get the correct (authoritative) data.
Throttling is a term used in many fields and applications and means you're limiting your traffic flow so it doesn't get overloaded.
DNS is actually quite simple to understand once you get the basics down.
Here are a few links that could help you get a better grip of it all:
Few paragraphs of basic DNS info
About.com guide
A few definitions
Relatively simple and informative PDF from IETF

Resources