We have problem with our Qt based production server for our business application. When total SSL connections increases with time, some clients does not manage to connect at all.
QSslSocket::waitForEncrypted() starts to fail with no QSslError, regardless of that timeout where set. There are more then ~100 active connections when this problem starts to kick in.
So there are ~170 connections, twice of threads, and "lsof" mentions a little more then 1000 opened files (we had to increase file "ulimit" for that..).
It does not look like it's clients problem, since IPs that are failing and reconnecting changes with time (some "leaps in" with success, but then other don't).
As mentioned, this happens in Ubuntu Server (Zentyal 10.04 and "vanilla" 9.10), but does NOT in Ubuntu Desktop 9.10.
Everything runs inside VMWare ESX 4.1, systems there tested with same resources attached. System loads stays below 1.0. Daemon runs with root permissions.
It looks like it's something with "server"/"desktop" kernel or other configuration differences, but I couldn't tell what exactly could make SSL connection not to handshake... in "server editions"...
We are using Qt 4.5.3 compiled by ourselves.
EDIT: after all it's the same on any Linux I tried. It feels like it's some kind socket limit per process, witch is about 1016 - other_opened_files. I'll try to create new question about that.
EDIT 2: It's select and FD_SETSIZE limit problem...
Problem is with fact that Qt uses select() which is limited with FD_SETSIZE macro for maximum selected sockets/files. I had to change FD_SETSIZE value inside /usr/include/bits/typesizes.h before compiling libQtNetwork and libQtCore.
Related
2.5 months ago, I was running a website on a Linux server to do a user study on 3 variations of a tool. All 3 variations ran on the same website. While I was conducting my user study, the website (i.e., process hosting the website) crashed. In my sleep-deprived state, I unfortunately did not record when the crash happened. However, I now need to know a) when the crash happened, and b) for how long the website was down until I brought it back up. I only have a rough timeframe for when the crash happened and for long it was down, but I need to pinpoint this information as precisely as possible to do some time-on-task analyses with my user study data.
The server runs Linux 16.04.4 LTS (GNU/Linux 4.4.0-165-generic x86_64) and has been minimally set up to run our website. As such, it is unlikely that any utilities aside from those that came with the OS have been installed. Similarly, no additional setup has likely been done. For example, I tried looking at a history of commands used in hopes that HISTTIMEFORMAT was previously set so that I could see timestamps. This ended up not being the case; while I can now see timestamps for commands, setting HISTTIMEFORMAT is not retroactive, meaning I can't get accurate timestamps for the commands I ran 2.5 months ago. That all being said, if you have an idea that you think might work, I'm willing to try (as long as it doesn't break our server)!
It is also worth mentioning that I currently do not know if it's possible to see a remote desktop or something of the like; I've been just ssh'ing in and use the terminal to interact with the server.
I've been bouncing ideas off with friends and colleagues, and we all feel that there must be SOMETHING we could use to pinpoint when the server went down (e.g., network activity logs showing spikes around the time that the user study began as well as when the website was revived, a log of previous/no longer running processes, etc.). Unfortunately, none of us know about Linux logs or commands to really dig deep into this very specific issue.
In summary:
I need a timestamp for either when the website crashed or when it was revived. It would be nice to have both (or otherwise determine for how long the website was down for), but this is not completely necessary
I'm guessing only a "native" Linux command will be useful since nothing new/special has been installed on our server. Otherwise, any additional command/tool/utility will have to be retroactive.
It may or may not be possible to get a remote desktop working with the server (e.g., to use some tool that has a GUI you interact with to help get some information)
Myself and my colleagues have that sense of "there must be SOMETHING we could use" between various logs or system information, such at network activity, process start times, etc., but none of us know enough about Linux to do deep digging without some help
Any ideas for what I can try to help figure out at least when the website crashed (if not also for how long it was down)?
A friend of mine pointed me to the journalctl command, which apparently maintains timestamps of past commands separately from HISTTIMEFORMAT and keeps logs that for me went as far back as October 7. It contained enough information for me to determine both when I revived my Node js server as well as when my Node js server initially went down
So, I have a collection of Windows Server 2016 virtual machines that are used to run some tests in pairs. To perform these tests, I copy a selection of scripts and files from the network on to the machine, before performing the tests.
I'm basically using a selection of scripts that have existed around here since before my time and whilst i would like to use other methods, so much of our infrastructure relies on these scripts that overhauling the system would be a colossal task.
First up, i sort out the mapped drives with
net use X: \\network\location1 /user:domain\user password
net use Y: \\network\location2 /user:domain\user password
and so on
Soon after, i use rsync to copy files from a location in /cygdrive/y/somewhere to /cygdrive/c/somewhere_else
During the rsync, i will get errors that "files have vanished" (I'm currently unable to post the exact error, I will edit this later to include this). When i check what's currently in the /cygdrive directory, all i see is /cygdrive/c and everything else has disappeared.
I've tried making a symbolic link to /cygdrive/y in a different location, I've tried including persistent:yes on the net use command, I've changed the power settings on the network card to not sleep. None of these work.
I'm currently looking into the settings for the virtual machines themselves at this point, but I have some doubts as we have other virtual windows machines that do not seem to have this issue.
Has anyone has heard of anything similar and/or knows of a decent method to troubleshoot this?
Right, so I've been working on this all day and finally noticed a positive change, but since my systems are in VMware's vCloud, this may not work for some people. It's was simply a matter of having the VM turned off and upgrading the Virtual Hardware Version to the latest version. I have noticed with this though, that upon a restart, one of the first messages that comes up mentions that the computer is "disabling group policies".
I did a bit of research into this and found out that Windows 8 and 10 (no mention of any Windows Server machines) both automatically update Group Policies in the background, disconnecting and reconnecting mapped drives to recreate them.
It's possible that changing the Group Policy drive from "recreate" to "update" should fix this issue, and that the Virtual Hardware update happened to resolve this in a similar manner.
I am experiencing a very strange behavior with oracle, maybe somebody can help me, let me summarize it real quick:
My OS of choice is debian linux, I am using Oracle XE 11.0.2.0. On linux startup, I run a script file which is located under /etc/init.d/. I added the following line to make oracle start on system start:
/etc/init.d/oracle-xe start
Right after this line , I run my application from the script, my application heavily relies on the oracle db, therefore once oracle starts, I am positive that my application will run ok. Unfortunately my assumption seems wrong.Here's why: I set up similar set up in 3 machines, in 2 of them I see weird behavior, after system start oracle db is not responding to connection requests, Even though oracle-xe start command completed executing.
My observation is the following, if I run my application right after oracle-xe start is executed, I receive ora-12505 errors at least for a minute: "TNS listener does not currently know of SID" . After a minute everything stabilizes, and my application starts working ok. 1 minute without a db on system startup is not acceptable for me performance-wise, therefore I am trying to solve this problem.
Surprisingly it does not happen in one of the other linux boxes I have here, I am not quite sure what is different on that box. I compared ora files, but couldn't find any difference, it seems like a wild goose chase...
I would be so grateful if anybody has experienced and solved ths problem before and shares that valuable solution with me.
I think I found the problem, looks like I am starting oracle-xe instance before I assign network interfaces an IP address, in that case it takes some time for oracle to receive connections, that requires me to set static ip on the linux boxes, which is something I don't want. Is there a solution so that I can still assign IP addresses later on?
I have pretty strange problem with Collectd. I'm not new to Collectd, was using it for a long time on CentOS based boxes, but now we have Ubuntu TLS 12.04 boxes, and I have really strange issue.
So, using version 5.2 on Ubuntu 12.04 TLS. Two boxes residing on Rackspace (maybe important, but I'm not sure). Network plugin configured using two local IPs, without any firewall in between and without any security (just to try to set simple client server scenario).
On both servers collectd writes in configured folders as it should write, but on server machine it doesn't write data received from client.
Troubleshooted with tcpdump, and I can clearly see UDP traffic and collectd data, including hostname and plugin names from my client machine, received on server, but they are not flushed to appropriate folder (configured by collectd) ever. Also running everything as root user, to avoid troubleshooting permissions.
Anyone has any idea or similar experience with this? Or maybe some idea what could I do for troubleshooting this beside trying to crawl internet (I think I clicked on every sensible link Google gave me in last two days) and checking network layer (which looks fine)?
And just small note: exactly the same happened with official 4.10.2 version from Ubuntu's repo. After trying to troubleshoot it for hours moved to upgrade to version five.
I'd suggest trying out the quite generic troubleshooting procedure based on the csv and logfile plugins, as described in this answer. As everything seems to be fine locally, follow this procedure on the server, activating only the network plugin (in addition to logfile, csv and possibly rrdtool).
So after no way of fixing this, I upgraded my Ubuntu to 12.04.2 LTS (3.2.0-24-virtual) and this just started working fine, without any intervention.
I've installed git on Ubuntu 11.04 and I've cloned a private respository on GitHub. Whenever I try to push or pull to/from the repository, it takes about 30-60 seconds. Even if their are no changes in the repository. When using the same repository on Windows 7, pull/push requests only take a few seconds. I can't figure out what is wrong.
I've ran ssh -v git#github.com and it hangs right after this line:
debug1: SSH2_MSG_SERVICE_ACCEPT received
The above line will take 30-60 seconds to complete and then the remaining lines finish within a second. Here is the full output of ssh -vvv git#github.com: http://pastebin.com/LdY0EifW
I've already tried changing "GSSAPIAuthentication" to "no" and "UseDNS" to "no" in /etc/ssh/ssh_config. That didn't make any difference.
Any ideas?
It seems like the system might be running out of entropy.
SSH, like any other cryptographic application, needs some truly random numbers to be able to provide security. Linux kernel normally gathers some randomness (entropy) from precise timing of various events and makes it available via /dev/random, which ssh reads when it needs to create the session keys. On desktops there is usually enough entropy gathered, but if there is some other application that needs it, you might be running low and than reading /dev/random will take a lot of time, because it's waiting for enough entropy to be collected.
=> please verify by running strace ssh git#github.com whether it's actually waiting in read from `/dev/random. If yes, you have this problem.
If it's a server hosting any potentially sensitive data, you should probably equip it with hardware random number generator (e.g. the "entropy key"). You can also try to modify the random number generator settings to less secure ones (I believe there are some options to be set via /proc), but only if the server does not host customer data or any sensitive company data.
Edit: it looks more like a network problem somewhere.
I had a similar problem, although with all internet and not just push/pull of git. The problem was according to solutions to the same problem for many people either something with IPv6 or something with the driver:
So try these:
Disable IPv6. See here <--- Note only do this if you know you won't use IPv6. Also remember this so if you needed later you can enable it.*
Blacklist your driver. Follow this question for more instructions.
For me, the second case was the solution.
* I'm not sure what they shipped with Ubuntu 11.04, but there appears to be something wrong with DNS lookup when the network is in IPv4, but IPv6 is enabled, which makes it take a long time.