Upgrading SLES 15.1 to 15.2 causing Varnish to Fail - varnish

Very recently I ran an Online Migration update through YaST on SUSE Linux Enterprise Server (SLES) 15.1 to 15.2 and ended up with the following versions of these after doing so:
SLES 15.2
Apache 2.4.43
MariaDB 10.4.17
PHP 7.4.6
Varnish 6.2.1
My main linux architecture is now as follows:
The preliminary tests showed no conflicts or issues prior to the upgrade and it rebooted and came up just fine when it all completed. Upon checking everything afterwards, I noticed that the varnish.service (varnishd) had failed to start. I've never had an issue with Varnish not starting, whether it was SUSE Linux, CentOS, Ubuntu, etc. I thought at first my custom vcl file was causing issues so I went with the default configuration file that it comes with (/etc/varnish/vcl.conf) just to start fresh with the basics but to no avail. The exact same issue happened.
Then I decided to take a shot and compile Varnish from source. Through YaST, I removed the varnish package and all of its configuration and service files, and then I downloaded the latest TAR Archive file (varnish-6.6.0.tgz) direct from https://varnish-cache.org/. After compiling and making Varnish this way, ironically, the same issue is happening when I try to start Varnish.
As with either, compiled (v6.6.0) or service package (v6.2.1), I get the following error(s) exactly the same between the two:
It describes a "Child not responding to CLI, killed it." and then proceeds to mention there's a "CLI communication error (hdr)." And finally a, "Child died signal=6."
What's most puzzling is that with either way of setting up Varnish, is that it fails the exact same way. I supposed this would indicate that Varnish isn't the issue per se, but rather something within the server configuration? I've been through every forum on Varnish that I could find and have found nothing this specific. I have even tried to get it to start by trying different CLI parameters (like timeout settings, pool delays, etc.) but it still won't do it. Again, this is with having the most basic/default configuration file loaded and nothing else.
# Marker to tell the VCL compiler that this VCL has been adapted to the
# new 4.0 format.
vcl 4.0;
# Default backend definition. Set this to point to your content server.
backend default {
.host = "127.0.0.1";
.port = "80";
}
Now here's the ultimate kicker... I took another (Development) server, slicked it bare, and installed SLES 15.2 from scratch and everything, including Varnish, works! So something with the in-place upgrade is stopping Varnish somehow. I can't take the main (Production) SLES 15.2 server and start over with it like that, however, because of so many other things that are currently installed and configured on it.
I'm trying to get Varnish back up and started within the current upgraded environment, but nothing seems to be working. Also, there is nothing in the Varnish logs (/var/log/varnish/varnish.log) to give me any clue either.
I'm at a loss as to what to try or where to go next. I've even tried starting Varnish in Debug Mode (-d) and then trying to get a child to start that way and it's the exact same error.
And ultimately, I can't check for any panics because Varnish won't even start in the first place.
So to recap, literally all I did was run the in-place upgrade from SLES 15.1 to 15.2, rebooted when it was all done, and now all other services start fine except for Varnish (which worked perfectly on 15.1).
UPDATE #1: I tried to start varnish with no vcl file and no backend (varnishd -b none) but it errored out. Then I simply substituted "none" with "localhost" and I'm right back to the same error as before.
UPDATE #2: Here is the output of the "strace -f varnishd" command.
StraceOutput.txt

VCL loop
This is a long shot, but can you please change the .port property in your backend to 8080 instead of 80? Just for testing.
Because if you start varnishd without an explicit -a, the standard listening port will be 80. But since your VCL file already connects to port 80 on localhost for its backend, you might end up in a loop.
I'm not saying the assert() that is triggered on your system is caused by this, but it's worth the attempt.
In older versions of Varnish, the standard port was 6081, but this has changed in recent versions.
What I am sure of, is that the error is caused by a file descriptor that is not available. Maybe a file descriptor that has already been closed.
Please give it a shot, and let me know.
Debug mode
It is also possible to enable debug mode by adding the -d runtime parameter to your varnishd command.
Please give it a try to increase the verbosity of the debug output
Checking panics
Another thing you can do is run the following command to see if any panics occured:
varnishdadm panic.show
Trying out various runtime options
Apparently the error refers to the fact that it cannot load the VCL file.
Let's try running varnishd without a VCL file to see whether or not that's the problem.
Just try starting varnishd using the following command:
varnishd -b none
This command will start Varnish without a VCL file and without a backend. When you then try to access Varnish via HTTP, you should be getting an HTTP 503 error. That's not perfect, but at least we know that Varnish is capable of not crashing all the time.
Once that works, you can remove -b and add your -f parameter that refers to the VCL file
If that also works, try playing around with the -s setting
And so on, and so forth ..
Use packages
Other than that, the only advise I can give you is to install Varnish using the official packages on a supported operating system (Debian, Ubuntu, Fedora, CentOS, RHEL).

When checking the output of the requested strace command, I found this:
[pid 1129] mkdir("vcl_boot.1621874391.008263", 0755) = 0
[pid 1129] chown("vcl_boot.1621874391.008263", 465, 463) = 0
[pid 1129] setresuid(-1, 465, -1) = 0
[pid 1129] openat(AT_FDCWD, "vcl_boot.1621874391.008263/vgc.c", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 5
[pid 1129] fchown(5, 0, 0) = -1 EPERM (Operation not permitted)
[pid 1129] geteuid() = 465
[pid 1129] close(5) = 0
[pid 1129] openat(AT_FDCWD, "vcl_boot.1621874391.008263/vgc.so", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 5
[pid 1129] fchown(5, 0, 0) = -1 EPERM (Operation not permitted)
Varnishd tries to change the owner of at least two files, but isn't allowed to do so. I'm not sure about the details, but as a next step you could try to find these files (probably below /var/cache/varnish) and check the current permissions. Maybe they belong to a user which is not the user you're running varnishd with.
AFAIK the daemon is started as user root and then the process switches to an unprivileged user. This assumption brings us back to my previous question: Are you running AppArmor or SElinux?

Related

nginx uWSGI connection to unix socket failed

I'm trying to connect uWSGI Flask application on CentOS 7 with nginx, nginx error log at /var/log/nginx/error.log gives:
2017/10/04 22:35:29 [crit] 24381#0: *54 connect() to unix:/var/www/html/CON29Application1/socket.sock failed
(13: Permission denied) while connecting to upstream, client: 80.44.138.51,
server: 188.226.174.121, request: "GET /favicon.ico HTTP/1.1", upstream: "uwsgi://unix:/var/www/html/CON29Application1/socket.sock:",
host: "188.226.174.121", referrer: "http://188.226.174.121/"
uWSGI error log shows I think that uWSGI running correctly:
WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0x1a1ebd0 pid: 26364 (default app)
This is my first deployment on Linux, but read another SO answer here: Nginx can't access a uWSGI unix socket on CentOS 7
This guy answered his own question, and referred to blog post on SE Linux http://axilleas.me/en/blog/2013/selinux-policy-for-nginx-and-gitlab-unix-socket-in-fedora-19/, saying SE Linux was the problem. I don't really understand what is running where on SE Linux, and solution seems to involve altering "AVC" messages in nginx audit.log, I'm getting in over my head!
As the blog post referred, I do get AVC messages mentioning denied write and nginx at /var/log/audit/audit.log:
type=AVC msg=audit(1507153878.777:559609): avc: denied { write } for pid=24381
comm="nginx" name="socket.sock" dev="vda1" ino=715975
scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:var_t:s0 tclass=sock_file
But being a newbie, is there perhaps something simpler perhaps I did wrong and can fix with chmod permissions or chown? Thanks any ideas.
Socket permissions:
ls -l socket.sock
srwxrwxrwx. 1 will nginx 0 Oct 4 17:02 socket.sock
Well, my SELinux settings did make a difference in the end, and changing this has got my web application actually working! I looked at another tutorial: https://www.digitalocean.com/community/tutorials/an-introduction-to-selinux-on-centos-7-part-1-basic-concepts
I must say from a Linux newbie's point of view, I have seen a few other posts mentioning how good Digital Ocean's tutorials are (I certainly don't have any affiliation with them whatsoever..).
For other newbies reading this, SELinux stands for Security Enhanced Linux, and is something included with many distributions of Linux now apparently, including CentOS 7. It's there for added security of some kind. I ran the simplest command they list on this page: getenforce
which output
enforcing
As the Digital Ocean tutorial states, "SELinux should currently be disabled", mine wasn't - no idea why, I hadn't touched anything on SELinux as had no idea what it was until 2 days ago.
Anyway, trying for simplest fix, as their advice did:
vi /etc/sysconfig/selinux
Or actually I think I didnt have permission to do this as my user, had to do it as root:
sudo vi /etc/sysconfig/selinux
There's only actually 2 settings in this file. So reset:
SELINUX=permissive
Then tried reboot as their advice to restart server, so apparently then SE Linux will start logging some security mumbo jumbo stuff, ie I think this means it records security booboos and people hacking into the system rather than stopping them. Reboot then asked me for Cloud something password, which I thought must be my sudo password, it wasn't, then crashed after trying this couple times anyway, so restarted it I think this is reboot yes? And my website now works.
As the other post I mentioned here, I think this means SELinux doing something to stop nginx running when it is set to enforcing. But the other post here seemed bit more complex for a newbie than to just change one setting as I have done here, more potential to create further problems. If I can ever develop this or another app further, i think need to find someone with more Linux experience.

Installing ColdFusion 9 on Ubuntu 14.04, getting error running connector wizard

The Setup:
I'm trying to install ColdFusion 9 on Ubuntu 14.04 with Apache 2.4.7. Seriously. Don't ask.
Spun up a Vagrant Box (xplore/ubuntu-14.04) that has the LAMP stack installed;
Performed apt-get update and apt-get upgrade;
Installed libstdc++5 (but still got a warning that CF couldn't verify it was installed);
Installed CF from ColdFusion_9_WWEJ_linux64.bin.
I had to create a symlink to /etc/apache2/apache2.conf called /etc/apache2/httpd.conf in order to get CF installed, because CF9 doesn't allow you to specify an apache config filename, but other than that everything went smoothly.
The Problem:
When I start CF using ./opt/coldfusion9/bin/coldfusion start I get this message:
There was an error while running the connector wizard
Connector installation was not successful
...which is the result of cf-connectors.sh modifying my apache2.conf, telling it to load the module /opt/coldfusion9/runtime/lib/wsconfig/1/mod_jrun22.so, then attempting to restart Apache and failing due to this error:
apache2: Syntax error on line 223 of /etc/apache2/apache2.conf:
Cannot load /opt/coldfusion9/runtime/lib/wsconfig/1/mod_jrun22.so into server:
/opt/coldfusion9/runtime/lib/wsconfig/1/mod_jrun22.so:
undefined symbol: ap_log_error
Troubleshooting Steps Taken:
I tailed the Apache error log, but that wasn't much help:
[mpm_prefork:notice] [pid 1516] AH00173: SIGHUP received. Attempting to restart
[mpm_prefork:notice] [pid 1516] AH00163: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.3 configured -- resuming normal operations
[core:notice] [pid 1516] AH00094: Command line: '/usr/sbin/apache2'
The JRun binary file does exist, in /opt/coldfusion9/runtime/bin/jrun. However, I've seen tutorials like this one that show it being located in /opt/jrun4...which is weird because my version of CF9 is referencing mod_jrun22.so, leading me to believe there is a version difference.
Running ./opt/coldfusion9/runtime/bin/jrun status, I get this output:
The coldfusion server is running
No jndi.properties file was found in samples's SERVER-INF directory. The JRun kernel requires JNDI information.
The samples server is not running
The admin server is not running
...which tells me that there is a missing indi.properties file, and that the samples and admin servers are not running. I assume that is a result of cf-connectors.sh failing.
The Question:
How can I get the CF connector wizard to succeed? What am I missing here?
Thanks in advance!
Apache 2.4.x is not supported by Coldfusion 9, see my answer here:
Apache won't start with ColdFusion 10: mod_jk.conf procedure not found
I suggest you install Apache 2.2 and then you should be able to install the Connector.
While Apache 2.4 is not supported by Adobe, it is possible to get it running but recompiling the mod_jrun module against the Apache 2.4 sources (after a small modification to the source code).
There are full instructions on my blog post, if you're still interested.
mod_jrun on Apache 2.4 (Ubuntu 14.04 + ColdFusion 9)

OTRS installation error on openSUSE

I have a fresh, text-only installation of openSuSe 13.1 (physical server, old Samsung netbook), and I'm trying to get OTRS up and running. I've installed OTRS using the below commands. I don't think they're all necessary, but someone in the OtterHub forums had a successful installation with the software versions I'm targeting using this sequence, so I was trying to piggyback on that success.
zypper in otrs-3.3.4-01.noarch.rpm gcc make mysql-community-server perl-Crypt-SSLeay perl-JSON-XS perl-YAML-LibYAML
zypper in perl-Text-CSV_XS perl-PDF-API2 perl-GDGraph perl-Encode-HanExtra postfix perl-DBD-mysql
cd ~otrs && bin/otrs.SetPermissions.pl --otrs-user=otrs --web-user=wwwrun --otrs-group=www --web-group=www /opt/otrs
rcmysql start
systemctl start apache2.service
mysqladmin --user=root password password
All of that works fine. I'm able to get to the OTRS web installer, but that's where I get hung up. I get to the part of the web installer that creates the database, and it times out. The script successfully creates the database and updates Config.pm with the new password. I can't tell from installer.pl what it tries to do next.
Here's the error from /var/log/apache2/error_log:
[Tue Jan 28 20:53:23.136306 2014] [cgi:warn] [pid 6856] [client 192.168.1.10:52732] AH01220: Timeout waiting for output from CGI script /opt/otrs/bin/cgi-bin/installer.pl, referer: http://svr-clptest/otrs/installer.pl
[Tue Jan 28 20:53:23.136470 2014] [cgi:error] [pid 6856] [client 192.168.1.10:52732] Script timed out before returning headers: installer.pl, referer: http://svr-clptest/otrs/installer.pl
The browser displays the following:
The gateway did not receive a timely response from the upstream server or application.
This is on a local network at home. I'm accessing the Linux server using PuTTY from a Windows 8 machine. I'm using a wireless connection from the Windows 8 machine, but the server has a hard line connection to the router, if that makes any difference. I don't have any trouble executing anything from PuTTY or accessing the index page through the browser (Firefox 26). I've tried connecting from a computer on my network, and one off of my network. In both cases, I'm able to get to my domain and the web installer. But I can't make a PuTTY connection to the server from outside my network.
I've spent a couple of hours researching the error, and I can't figure out what the next step should be.
Right now, a text-only version of openSUSE and OTRS are the only things running on the machine. I haven't done anything else with it. I'm open to starting the installation from scratch again--OS and all. I'm thinking that the timeout error has something to do with my firewall settings, but I'm not a network guy. Really have no idea how to diagnose this.
UPDATE
I tried reinstalling everything fresh tonight, but then added KDE so I could walk through the web installer on the host. I get exactly the same errors. It's not a problem between server and client. Something's wrong with OTRS... Or maybe with apache?
I eventually just had to follow the steps for manual installation instead of using the web installer. Not sure where the problem was exactly, but no matter what I tried, I couldn't get the database setup to work through the web installer. If you're having a similar problem, once you get to the part of the instructions that tell you to move to the web installer, you can switch over to the instructions to install from source and pick it up from manual installation of the database.

Error When Starting OmniEvents

I am attempting to install REDHAWK v1.8.2 on a fresh install of CentOS 6.4 32 bit, but I am unable to get omniNames and omniEvents to start.
sudo /sbin/service omniEvents stop
Stopping CORBA event service: omniEvents
sudo /sbin/service omniNames stop
Stopping omniNames [ OK ]
sudo /sbin/service omniNames start
Starting omniNames [ OK ]
sudo /sbin/service omniEvents start
Starting CORBA event service on port 11169: omniEvents: [25848]: Warning - failed to resolve initial reference 'NameService'. Exception: TRANSIENT
omniEvents.
I tried to verify if omniNames was really running by calling the naming client, but got an error (see below), so it seems omniNames is not successfully starting.
nameclt list
Caught a TRANSIENT exception when trying to validate the type of the
NamingContext. Is the naming service running?
As part of the debugging process, I tried to kill the omniNames process and start it a different way (see below).
sudo killall omniNames
omniNames -start
Wed Nov 13 21:08:08 2013:
Starting omniNames for the first time.
Error: cannot create initial log file '/var/omninames/omninames-orion.log':
No such file or directory
You can set the environment variable OMNINAMES_LOGDIR to specify the
directory where the log files are kept.
I'm not sure why omniNames can't create the log file, because I verified that /var/omninames folder actually exists and even starting omniNames as root yields the same error. Regardless, I set the log directory to my desktop to circumvent the error (see below).
export OMNINAMES_LOGDIR=/home/$USER/Desktop/logs
mkdir -p /home/$USER/Desktop/logs
omniNames -start
Wed Nov 13 21:09:17 2013:
Starting omniNames for the first time.
Wrote initial log file.
Read log file successfully.
Root context is IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67436f6e746578744578743a312e30000001000000000000005c000000010102000a00000031302e322e382e333500f90a0b0000004e616d6553657276696365000200000000000000080000000100000000545441010000001c00000001000000010001000100000001000105090101000100000009010100
Checkpointing Phase 1: Prepare.
Checkpointing Phase 2: Commit.
Checkpointing completed.
Even though it looks like omniNames successfully started, when I open another terminal window and call the naming client, I get the same error as before (see below).
nameclt list
Caught a TRANSIENT exception when trying to validate the type of the
NamingContext. Is the naming service running?
The only modification I made in the /etc/omniORB.cfg file is to add the lines for InitRef (see below).
InitRef = NameService=corbaname::localhost
InitRef = EventService=corbaloc::localhost:1169/omniEvents
Also, I am not connected to the internet so my version of CentOS has not been updated from the base version, except for the boost libraries as recommended in Appendix J of the manual (http://sourceforge.net/projects/redhawksdr/files/redhawk-doc/1.9.0/REDHAWK_Manual_v1.9.0.pdf/download).
Looks like the issue is in your configuration. You've got the wrong port in your configuration file. It should be port 11169 however you've listed port 1169.
See: http://redhawksdr.github.io/Documentation/mainch2.html#x4-120002.6 for details.
A few other observations and tricks regarding omniOrb in case this was not the issue.
Sometimes omninames/omnievents can get into a bad state. The fix is to delete the log files created by omniNames and omniEvents and restart the services. They are located:
/var/lib/omniEvents/*
/var/omniNames/*
You'll need to be root to delete those files. I always forget where they are located and often do a "locate omni | grep -i log" to remind myself but you must do this as root since they are not visible to standard users.
While it should not matter, I've personally found that using 127.0.0.1 is more reliable than localhost. For some reason, using localhost within a VM in the configuration file has caused me problems in the past. Consider using 127.0.0.1 instead of localhost. This is what the current version of the Redhawk Manual recommends as well.
You mentioned you are using Redhawk v1.8.2. As an FYI, the latest REDHAWK version in the 1.8 series is currently v1.8.5 and 1.9.0 was also recently released.
Hopefully this gets you up and running!

PHP-FPM and capistrano, "No input file specified"

I use capistrano to deploy new versions of a website to servers that run nginx and php-fpm, and sometimes it seems like php-fpm gets a bit confused after deployment and expects old files to exist, generating the "No input file specified" error. I thought it could have had something to do with APC, which I uninstalled, but I realize the process doesn't get as far as checking stuff with APC.
Is there a permission friendly way to tell php-fpm that after deployment it needs to flush its memory (or similar), that I could use? I don't think I want to do sudo restarts.
rlimit_files isn't set in php-fpm.conf, and ulimit -n is 250000.
Nginx has it's own rather aggressive filecache. It's worse when NFS is involved since that has it's own cache as well. Tell capistrano to restart nginx after deployment.
It can also be an issue with your configuration as Mohammad suggests, but then a restart shouldn't fix the issue, so you can tell the two apart.

Resources