Today, my MongoDB database went down after weeks of it being up. After some digging around, I realized that the permissions of my mongodb-27017.sock file were incorrect.
Running chown mongod:mongod mongodb-27017.sock resolved the issue.
My MongoDB instance was running perfectly fine for weeks. How did the permissions all of a suddenly change? How can I prevent myself from running into this issue again?
For context: I'm running an Amazon Linux 2 instance on AWS.
After almost one year of flawless working, one of our replica members received an error and it was related to this, mongo.sock owner changed from mongod:mongod to root:root.
I started searching that if I can find what changed the files owner but unfortunately there's no way to find it after it happened.
So my search lead me to auditctl.
According to man page the description is, used to control the behavior, get status, and add or delete rules.
By setting up it like audictl -w /tmp -p rwxa -k keyname, I started waiting.
Wrote a simple shell script to notify me when audictl finds out what's changing the ownership of the file. After couple hours, I received it.
With the output of the audictl, you can find information like pid, syscall and uid etc.
But the most important one is comm which tells you which program used and caused this change.
For my situation by following the audictl logs, I found out that co-worker of mine just created a cronjob that effects mongo.sock file.
Related
I have a process that exports the data from an AWS RDS MariaDB using mysqldump which has been running succesfully in a docker-image on Concourse for years.
Since two nights ago the process has started failing with the error:
mysqldump: Couldn't execute 'FLUSH TABLES WITH READ LOCK': Access denied for user 'admin'#'%' (using password: YES) (1045)
The official AWS explanation seems to be that because they do not allow super privileges to the master user or GLOBAL READ LOCK the mysqldump fails if the --master-data option is set.
I do not have that option set. I'm running with these flags:
mysqldump -h ${SOURCE_DB_HOST} ${SOURCE_CREDENTIALS} ${SOURCE_DB_NAME} --single-transaction --compress | grep -v '^SET .*;$' > /tmp/dump.sql
mysqldump works fine when executed from my local Mac. It fails with the error that it couldn't execute FLUSH TABLES WITH READ LOCK only from the linux environment.
My question is, does anyone know how to disable the FLUSH TABLES WITH READ LOCK command in mysqldump on linux?
EDIT: Happy to accept #sergey-payu answer below as having fixed my problem, but here's a link to the MySQL bug report for the benefit of anyone else coming across this issue https://bugs.mysql.com/bug.php?id=109685
I faced the same issue couple of days ago. My mysqldump script had been working for years just fine until it started to give me the Access denied; you need (at least one of) the RELOAD privilege(s) for this operation error. My first instinct was to grant this privilege. But after that I started to get Access denied for user 'user'#'%' (using password: YES) (1045) error, which is documented in AWS docs. After couple hours of investigation it turned out to be a bug of the most recent 5.7.41 version of mysql (It was released 17th of January, exactly when we started to get errors). Downgrading to 5.7.40 solved the problem. It's interesting that 5.7.41 changelog doesn't list anything close to the FLUSH TABLES WITH READ LOCK or to default values.
According to the 5.7.41 changelog,
The data and the GTIDs backed up by mysqldump were inconsistent when the options --single-transaction and --set-gtid-purged=ON were both used. It was because GTID_EXECUTED was fetched at the end of the dump, at which point the GTIDs on the server could have increased already. With this fixed, a FLUSH TABLES WITH READ LOCK is performed at the beginning of the dump and GTID_EXECUTED was fetched right after, to ensure its value is consistent with the snapshot taken by mysqldump.
I have a mongodb docker container (stock one downloaded from the docker repo). Its log size is unconstrained (/var/lib.docker/containers/'container_id'/'container_id'-json.log)
This recently caused a server to fill up so I discovered I can instruct the docker daemon to limit the max size of a container's log file as well as the number of log files it will keep after splitting. (Please forgive the naiveté. This is a tools environment so things get set up to serve immediate needs with an often painful lack of planning)
Stopping the container is not desirable (though it wouldn't bring about the end of the world) thus doing so is probably a suitable plan G.
Through experimentation I discovered that running a different instance of the same docker image and including --log-opt max-size=1m --log-opt max-file=3 in the docker run command accomplishes what I want nicely.
I'm given to understand that I can include this in the docker daemon.json file so that it will work globally for all containers. I tried adding the following to the file "/etc/docker/daemon.json"
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
Then I sent a -SIGHUP to the daemon. I did observe that the daemon's log spit out something about reloading the config and it mentioned the exact filepath at which I made the edit. (Note: This file did not exist previously. I created it and added the content.) This had no effect on the log output of the running Mongo container.
After reloading the daemon I also tried instantiating the different instance of the Mongo container again and it too didn't observe the logging directive that the daemon should have. I saw its log pass the 10m mark and keep going.
My questions are:
Should there be a way for updates to logging via the daemon to affect running containers?
If not, is there a way to tell the container to reload this information while still running? (I see docker update but this doesn't appear to be one of the config options that can be updated.
Is there something wrong with my config. I tested including a nonsensical directive to see if mistakes would fail silently and they did not. A directive not in the schema raised an error in the daemon's log. This indicates that the content I added (displayed above) is, at least, expected, though possibly incomplete or something. The commands seem to work in the run command but not in the config. Also, I initially tried including the "3" as a number and that raised an error too that disappeared when I stringified it.
I did see in the file "/var/lib.docker/containers/'container_id'/hostconfig.json" for the different instance of the Mongo container in which I included the directives in its run command that these settings were visible. Would it be effective/safe to manually edit this file for the production instance of the Mongo container to match the different proof of concept container's config?
Please see below some system details:
Docker version 1.10.3, build 20f81dd
Ubuntu 14.04.1 LTS
My main goal is to understand why the global config didn't seem to work and if there is a way to make this change to a running container without interrupting it.
Thank you, in advance, for your help!
This setting will be the new default for newly created containers, not existing containers even if they are restarted. A newly created container will have a new container id. I stress this because many people (myself included) try to change the log settings on an existing container without first deleting that container (they've likely created a pet), and there is no supported way to do that in docker.
It is not necessary to completely stop the docker engine, you can simply run a reload command for this change to apply. However, some methods for running docker, like the desktop environments and Docker in Docker based installs, may require a restart of the engine when there is no easy reload option.
This setting will limit the json file to 3 separate 10 meg files, that's between 20-30 megs of logs depending on where in the file the third log happens to be. Once you fill the third file, the oldest log is deleted, taking you back to 20 megs, a rotation is performed in the other logs, and a new log file is started. However json has a lot of overhead, approximately 50% in my testing, which means you'll get roughly 10-15 megs of application output.
Note that this setting is just the default, and any container can override it. So if you see no effect, double check how the container is started to verify there are no log options being passed there.
Changing the daemon.json for running containers did not work for me. Reloading the daemon and restarting the docker after editing the /etc/docker/daemon.json worked but only for the new containers.
docker-compose down
sudo systemctl daemon-reload
sudo systemctl restart docker
docker-compose up -d
I have created a web application where user can run Java code in the browser.
I am using chroot for executing user submitted code in the web server.
In the chroot script I am doing mounting and then unmounting some required directories.
this works very well normally but when I fire that executing requests in a row like
20-30 requests, then for some response I am getting this message /bin/su: user XXX does not exist where XXX is username for the Linux system where I am mounting the required directories.
While for others I am getting the expected output result.
My concern is "is there any side effect of doing mount and unmount repeatedly in the Linux box?
Or is there any setting in the Linux to make this config to support?
In order to use /bin/su you need to have the user information provided by /etc/passwd. Have you mounted that directory or (as I would recommend) copied it to the /etc/ in the new root directory?
Concerning your mount issues, yes, mounting and unmounting can take some time and is not guaranteed to be instantaneous (especially the unmounting can plainly fail if something is still active on the mounted file system). So maybe you should check if the unmount failed and retry in that case.
Thanks for the reply...Yes you are absolutely right Alfe! it is the problem of mounting/unmounting in a row. I have checked this by SSH login to my web server. when I executed 20-30 program commands repeatedly(separated by semicolon) then I got the desired output in a sequence on my window . then I opened another SSH window and again I executed 10 commands from that window and 20 commands from previous window . when I saw the output then for some commands in both the windows I got that message of "/bin/bash user XXX doesnt exist". so one conclusion is that when I make web requests concurrently then execution of commands(chroot/unchroot) are not in a sync. that's why I am getting this message. I am not very good in Linux . I don't know How can I address this issue.
This is my first ever post to stackoverflow, so be gentle please. ;>
Ok, I'm using slightly customized Clonezilla-Live cd's to backup the drives on four PCs. Each cd is for a specific PC, saving an image of its disk(s) to a box-specific backup folder on a samba server. That's all pretty much working. But once in a while, Something Goes Wrong, and the backup isn't completed properly. Things like: the cat bit through a cat5e cable; I forgot to check if the samba server had run out of room; etc. And it is not always readily apparent that a failure happened.
I will admit right now that I am pretty much a noob as far as linux system administration goes, even though i managed somehow to setup a centos 6 box (i wish i'd picked ubuntu...) with samba, git, ssh, and bitnami-gitlab back in february.
I've spent days and days and days trying to figure out if clonezilla leaves a simple clue in a backup as to whether it succeeded completely or not, and have come up dry. Looking in the folder for a particular backup job (on the samba server) I see that the last file written is named "clonezilla-img". It seems to be a console dump that covers the backup itself. But it does not seem to include the verification pass.
Regardless of whether the batch backup task succeeded or failed, I can run a post-process bash script automagically, that I place on my clonezilla cds. I have this set to run just fine, though its not doing a whole lot right now. What I would like this post-process script to do is determine if the backup job succeeded or not, and then rename (mv) the backup job directory to include some word like "SUCCESS" or "FAILURE". I know how to do the renaming part. It's the test for success or failure that I'm at a loss about.
Thanks for any help!
I know this is old, but I've just started on looking into doing something very similar.
For your case i think you could do what you are looking for with ocs_prerun and ocs_postrun scripts.
For my setup I'm using a pen/falsh drive for some test systems and also pxe with NFSmount. PXE and nfs are much easier to test and modify quickly.
I haven't tested this yet, but I was thinking that I might be able to search the logs in /var/log/{clonezilla.log,partclone.log} via an ocs_postrun script to validate success or failure. I haven't seen anything that indicates the result is set in the environment so I'm thinking the logs might be the quick easy method over mounting or running a crc check. Clonezilla does have an option to validate the image, the results of which might be in the local logs.
Another option might be to create a custom ocs_live_run script to do something similar. There is an example at this URL http://clonezilla.org/fine-print-live-doc.php?path=./clonezilla-live/doc/07_Customized_script_with_PXE/00_customized_script_with_PXE.doc#00_customized_script_with_PXE.doc
Maybe in the script the exit code of ocs-sr can be checked? As I said I haven't tried any of this, just some thoughts.
I updated the above to reflect the log location (/var/log). The logs are in the log folder of course. :p
Regards
How come I always get
"GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Failed to get connection to session: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.)"
when I start 'gedit' from a shell from my superuser account?
I've been using GUI apps as a logged-in user and as a secondary user for 15+ years on various UNIX machines. There's plenty of good reasons to do so (remote shell, testing of configuration files, running multiple sessions of programs that only allow one instance per user, etc).
There's a bug at launchpad that explains how to eliminate this message by setting the following environment variable.
export DBUS_SESSION_BUS_ADDRESS=""
The technical answer is that gedit is a Gtk+/Gnome program, and expects to find a current gconf session for its configuration. But running it as a separate user who isn't logged in on the desktop, you don't find it. So it spits out a warning, telling you. The failure should be benign though, and the editor will still run.
The real answer is: don't do that. You don't want to be running GUI apps as anything but the logged-in user, in general. And you never want to be running any GUI app as root, ever.
For some (RHEL, CentOS) you may need to install the dbus-x11 package ...
sudo yum install dbus-x11
Additional details here.
Setting and exporting DBUS_SESSION_BUS_ADDRESS to "" fixed the problem for me. I only had to do this once and the problem was permanently solved. However, if you have a problem with your umask setting, as I did, then the GUI applications you are trying to run may not be able to properly create the directories and files they need to function correctly.
I suggest creating (or, have created) a new user account solely for test purposes. Then you can see if you still have the problem when logged in to the new user account.
I ran into this issue myself on several different servers. It I tried all of the suggestions listed here: made sure ~/.dbus had proper ownership, service messagbus restart, etc.
I turns out that my ~/.dbus was mode 755 and the problem went away when I changed the mode to 700. I found this when comparing known working servers with servers showing this error.
I understand there are several different answers to this problem, as I have been trying to solve this for 3 days.
The one that worked for me was to
rm -r .gconf
rm -r .gconfd
in my home directory. Hope this helps somebody.