Cannot restart neo4j with error "No space left on device" - python-3.x

We have Neo4j 3.3.5 community version running on an Ubuntu 16.04.4 TLS and it crashed while we were creating relationships between certain nodes.
The relationships are made with at python program using the py2neo module. When the process crashed, the following error was produced by py2neo:
[2018-07-06 15:47:33] The database has encountered a critical error, and needs to be restarted. Please see database logs for more details.
[2018-07-06 15:47:33] [Errno 24] Too many open files
The latter was printed 1830 times.
The error shown in the neo4j log file is "device out of space". Then we cleared some space in the disk, now there are more than 3.5 GB available.
The problem we are how having is that we cannot restart neo4j any more.
The first error was:
"Store and its lock file has been locked by another process: /var/lib/neo4j/data/databases/graph.db/store_lock. Please ensure no other process is using this database, and that the directory is writable (required even for read-only access)".
For this, we found a solution online that suggested to kill the java process connected to "org.neo4j.server". However, killing these processes did not help and we still cannot start the server with the command:
'''sudo service neo4j start'''
It now produces the following error:
*..."java.io.IOException: No space left on device". *
But when we check the disk there is more than enough space and Inodes seems alright.
Another suggestion is to re-instal the database. However, we would like to understand what happened before we take such a drastic step in order to prevent it from happening again.
Any ideas and/or suggestion will be greatly appreciated.

Related

Unable to connect to SSH on Google Cloud VM Instance

I have run into a problem today where I am unable to connect via SSH to my Google Cloud VM instance running debian-10-buster. SSH has been working until today when it suddenly lost connection while docker was running. I've tried rebooting the VM instance and resetting, but the problem still persists. This is the serial console output on GCE, but I am not sure what to look for in that, so any help would be highly appreciated.
Another weird thing is that earlier today before the problem started, my disk usage was fine and then suddenly I was getting a bunch of errors that the disk was out of space even after I tried clearing up a bunch of space. df showed that the disk was 100% full to the point where I couldn't even install ncdu to see what was taking the space. So then I tried rebooting the instance to see if that would help and that's when the SSH problem started. Now I am unable to connect to SSH at all (even through the online GCE interface), so I am not sure what next steps to take.
Your system has run out of disk space for the boot (root) file system.
The error message is:
Root filesystem has insufficient free space
Shutdown the VM, resize the disk larger in the Google Cloud Web GUI and then restart the VM.
Provided that there are no uncorrectable file system errors, your system will startup, resize the partition and file system, and be fine.
If you have modified the boot disk (partition restructuring, added additional partitions, etc) then you will need to repair and resize manually.
I wrote an article on resizing the Debian root file system. My article goes into more detail than you need, but I do explain the low level details of what happens.
Google Cloud – Debian 9 – Resize Root File System

ArangoDB Corruption Bad table magic number

we're using ArangoDB 3.3.5 with RocksDB as a single instance.
we did a shut down of the machine where arangod is running, and after reboot the service didn't come up again with the following warning:
Corruption: Bad table magic number
Is there a way to repair the database? Or any other ways to get rid of the problem?
This is an issue with the RocksDB files. Please try to start specifying --log.level TRACE, store the log file, open a github issue and attach the corresponding logfile.
Cheers.

MongoDB runs FTDC

So I just installed mongoDB and when i activate run it, it runs fine, but it gives me this warning or error:
2017-07-24T12:48:44.119-0700 I FTDC [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK
mongoDB warning
Now I have my PATH going to the right place:
C:\Program Files\MongoDB\Server\3.4\bin
I also have my data file in the right place as well: C:\data\db
The folder is full of different files from mongoDB.
I looked into my DBs and everything is still saved and no files have been corrupted or missing.
If anyone can help, that would be greatly appreciated. Thanks!
This error means that your MongoDB deployment was not shutdown cleanly. This is also shown in the screenshot of the log you posted, where it said Detected unclean shutdown. Typically this is the result of using kill -9 in UNIX environment, force killing the mongod process. This could also be the result of a hard crash of the OS.
CTRL-C should result in a clean shutdown, but it may be possible that something interferes with mongod during its shutdown process, or there is an OS hard restart during the shutdown process. To shutdown your mongod cleanly, it's usually best to send a db.shutdownServer() command inside the mongo shell connected to the server in question.
FTDC is diagnostic data that is recorded by MongoDB for troubleshooting purposes, and can be safely removed to avoid the FTDC startup warning you are seeing. In your deployment, the diagnostics data should be located in C:\data\db\diagnostics.data directory.
The WiredTiger storage engine is quite resilient and were designed to cope with hard crashes like this. However, if it's a hardware/OS crash, it's best to check your disk integrity to ensure that there is no hardware-level storage corruption. Please refer to your OS documentation for instructions on how to check for storage integrity, as methods differ from OS to OS on how to perform this.

Weird "Stale file handle, errno=116" on remote cluster after dozens of hours running

I'm now running a simulation code called CMAQ on a remote cluster. I first ran a benchmark test in serial to see the performance of the software. However, the job always runs for dozens of hours and then crashes with the following "Stale file handle, errno=116" error message:
PBS Job Id: 91487.master.cluster
Job Name: cmaq_cctm_benchmark_serial.sh
Exec host: hs012/0
An error has occurred processing your job, see below.
Post job file processing error; job 91487.master.cluster on host hs012/0Unknown resource type REJHOST=hs012.cluster MSG=invalid home directory '/home/shangxin' specified, errno=116 (Stale file handle)
This is very strange because I never modify the home directory and this "/home/shangxin/" is surely my permanent directory where the code is....
Also, in the standard output .log file, the following message is always shown when the job fails:
Bus error
100247.930u 34.292s 27:59:02.42 99.5% 0+0k 16480+0io 2pf+0w
What does this message mean specifically?
I once thought this error is due to that the job consumes the RAM up and this is a memory overflow issue. However, when I logged into the computing node while running to check the memory usage with "free -m" and "htop" command, I noticed that both the RAM and swap memory occupation never exceed 10%, at a very low level, so the memory usage is not a problem.
Because I used "tee" to record the job running to a log file, this file can contain up to tens of thousands of lines and the size is over 1MB. To test whether this standard output overwhelms the cluster system, I ran another same job but without the standard output log file. The new job still failed with the same "Stale file handle, errno=116" error after dozens of hours, so the standard output is also not the reason.
I also tried running the job in parallel with multiple cores, it still failed with the same error after dozens of hours running.
I can make sure that the code I'm using has no problem because it can successfully finish on other clusters. The administrator of this cluster is looking into the issue but also cannot find out the specific reasons for now.
Has anyone ever run into this weird error? What should we do to fix this problem on the cluster? Any help is appreciated!
On academic clusters, home directories are frequently mounted via NFS on each node in the cluster to give you a uniform experience across all of the nodes. If this were not the case, each node would have its own version of your home directory, and you would have to take explicit action to copy relevant files between worker nodes and/or the login node.
It sounds like the NFS mount of your home directory on the worker node probably failed while your job was running. This isn't a problem you can fix directly unless you have administrative privileges on the cluster. If you need to make a work-around and cannot wait for sysadmins to address the problem, you could:
Try using a different network drive on the worker node (if one is available). On clusters I've worked on, there is often scratch space or other NFS directly under root /. You might get lucky and find an NFS mount that is more reliable than your home directory.
Have your job work in a temporary directory local to the worker node, and write all of its output files and logs to that directory. At the end of your job, you would need to make it copy everything to your home directory on the login node on the cluster. This could be difficult with ssh if your keys are in your home directory, and could require you to copy keys to the temporary directory which is generally a bad idea unless you restrict access to your keys with file permissions.
Try getting assigned to a different node of the cluster. In my experience, academic clusters often have some nodes that are more flakey than others. Depending on local settings, you may be able to request certain nodes directly, or potentially request resources that are only available on stable nodes. If you can track which nodes are unstable, and you find your job assigned to an unstable node, you could resubmit your job, then cancel the job that is on an unstable node.
The easiest solution is to work with the cluster administrators, but I understand they don't always work on your schedule.

Did oracle error cause Linux system can ping but cannot ssh?

We have seen this same strange problem twice so far.
First we found our remote Linux server responds to ping but we could not ssh to it. We went to the server and found the system unresponsive, and had to restart it. After we restarted it, we checked the log. We found nothing in the /var/log/message log, but we found some error messages in Oracle's *_alert.log files:
Thread 1 cannot allocate new log, sequence 296280
Private strand flush not complete
Current log# 3 seq# 296279 mem# 0: /home/oracle/app/oracle/oradata/orcl/redo03.log
Current log# 3 seq# 296279 mem# 1: /home/oracle/app/oracle/oradata/orcl/redo09.log
Thread 1 advanced to log sequence 296280 (LGWR switch)
Current log# 2 seq# 296280 mem# 0: /home/oracle/app/oracle/oradata/orcl/redo02.log
Current log# 2 seq# 296280 mem# 1: /home/oracle/app/oracle/oradata/orcl/redo08.log
Process P098 died, see its trace file
Process P098 died, see its trace file
Process P098 died, see its trace file
Our questions are:
Could Oracle cause Linux to hang? I thought even if Oracle is dead, Linux should be alive. We thought this is relevant because these events happened at the same time i.e. Oracle died before Linux hung.
What could be the reason for the server to respond to ping but not allow ssh to it?
We did a test when the ssh login failed, and it showed port 22 was okay:
[administrator#localhost ~]$ nc -v -w 1 172.16.*.* -z 22
Connection to 172.16.*.* 22 port [tcp/ssh] succeeded!
When we do ssh -v, it stops at "load ssh key".
3.Why a linux server is hang,but it can still be ping?How can we prevent system hang?
Any ideas what could be the explanation?
The log snippet you showed doesn't say the database crashed; it looks like a delay in a log switch, and a parallel query slave process dying. That should certainly be investigated - you can start by looking at the trace file - but it could be a symptom rather than the cause.
A very high load can make the server behave like this, responding to some network events, but unable to (or extremely slow to) create new processes. That would explain why you can connect to port 22 but sshd isn't progressing very far and doesn't complete the connection process. It could also explain your P098 dying - it might not be ale to start in the first place.
It's unlikely that Oracle would be causing this directly. It's more likely that you have a script or application process which is spinning for some reason, spawning new processes until the system runs put of resources. (You can certainly get an infinite loop in a PL/SQL block, which would cause high load, but wouldn't cause you to run out of processes - so you'd be able to connect eventually). You could be getting an ORA error that is making a script/app loop if it isn't handled well, but you'd have to hope that's revealed in an application log. It might not even be something that's talking to the DB.
It's basically impossible to know what happened if it wasn't logged. You might have a starting point if you know what was being run at the time. You could also look back at what the DB was doing before the problem, with the AWR reports in Oracle Enterprise Manager, for example.
Unfortunately there isn't much you can do to recover if you can't connect, and even if you have an existing shell running you might not be able to run useful tools to see what's going on. Sometimes a hard reboot is the only option, though obviously it's a last resort.
Yes oracle can hang your system if the process load is very high. Please let us know that do you have multipath on this system with oracle or RACK.

Resources