DSE Cannot run program df - Too many open files - cassandra

Our agent stoped working and the only way to get it back is by restarting the agent in every node, the error we get:
ERROR [clojure-agent-send-off-pool-11618] df failed on execute;
returning default value - Cannot run program "df": error=24, Too many open files
Why it's need to have so much files open? How we can reset it without the need to restart the agent and make it automatically.

As per the DSE documentation on production settings, make sure that you have made the following adjustments to your /etc/security/limits.d/cassandra.conf file:
cassandra - memlock unlimited
cassandra - nofile 100000
cassandra - nproc 32768
cassandra - as unlimited
Note: This assumes that you are running Cassandra as the cassandra user. If that is not the case, adjust accordingly.
Why it's need to have so much files open?
Cassandra ultimately writes its data to SSTable files on the underlying filesystem. It also has to serve reads from any of those aforementioned files. Additionally, if you have rows that have been updated a lot and thus contain obsoleted data over time (and compaction has not run), it is entirely possible that a single row might need to be read from multiple files.
In short, databases need to work with many files simultaneously, and Cassandra is no exception.

Related

do sstablesplit on the side

As sstables are immutable and sstable split has to be performed offline ie. with node shutdown. Wouldn't it also be possible to split copies of extreme large sstables offline/in a sideline dir, while keeping a node online then following swap the extreme sstables with a set of splitted sstable files during a short restart of a node to minimize node downtime?
Or would it be better to decommission a node, spreading data over rest of cluster and then rejoin as a new empty node
Eg. having some large sstables which ain't getting into a compaction view any time soon. I'll like to split such offline say in another directory/FS/on another box, just where ever out of scope from running node while still having the node servicing redundancy from original sstable path. Only it seems sstablesplit want to find the configuration or can it be tricked to otherwise do a split out-reach from the running node?
Tried on a copy of a sstable file to split it, but:
on-a-offlinebox$ sstablesplit --debug -s SOME-VALUE-IN-MB mykeyspc-mycf-*-Data.db 16:58:13.197 [main] ERROR o.a.c.config.DatabaseDescriptor - Fatal
configuration error
org.apache.cassandra.exceptions.ConfigurationException: Expecting URI
in variable: [cassandra.config]. Please prefix the file with file:///
for local files or file:/// for remote files. Aborting. If you
are executing this from an external tool, it needs to set
Config.setClientMode(true) to avoid loading configuration.
at org.apache.cassandra.config.YamlConfigurationLoader.getStorageConfigURL(YamlConfigurationLoader.java:73)
~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(YamlConfigurationLoader.java:84)
~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.config.DatabaseDescriptor.loadConfig(DatabaseDescriptor.java:161)
~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.config.DatabaseDescriptor.(DatabaseDescriptor.java:136)
~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.tools.StandaloneSplitter.main(StandaloneSplitter.java:56)
[apache-cassandra-2.1.15.jar:2.1.15] Expecting URI in variable:
[cassandra.config]. Please prefix the file with file:/// for local
files or file:/// for remote files. Aborting. If you are
executing this from an external tool, it needs to set
Config.setClientMode(true) to avoid loading configuration. Fatal
configuration error; unable to start. See log for stacktrace.
If you can afford downtime for the node, just do it (split the tables). Anyway, if you will do this split on another machine/another dir you will need to run repair on the node (due to "offline" time of the rebuilded tables) after reloading sstables.
You can also try to drop this tables data files from your node and running repair, it will be probably minimal downtime for the node:
Stop the node -> Delete big sstables -> Start the node -> Repair.
EDIT: Since Cassandra 3.4, you can run compact command on specific sstables/files. On any earlier version, you can use forceUserDefinedCompaction jmx call. You can use one of these, or make a jmx call by yourself:
http://wiki.cyclopsgroup.org/jmxterm/manual.html
https://github.com/hancockks/cassandra-compact-cf
https://gist.github.com/jeromatron/e238e5795b3e79866b83
Example code with jmxterm:
sudo java -jar jmxterm-1.0-alpha-4-uber.jar -l localhost:7199
bean org.apache.cassandra.db:type=CompactionManager
run forceUserDefinedCompaction YourKeySpaceName_YourFileName.db
Also, if "big tables" problem occurs all the time, consider moving to LCS.

Could not read commit log descriptor in file

I started to use cassandra 3.7 and always I have problems with the commitlog. When the pc unexpected finished by a power outage for example the cassandra service doesn't restart. I try to start for the command line, but always the error cassandra could not read commit log descriptor in file appears.
I have to delete all the commit logs to start the cassandra service. The problem is that I lose a lot of data. I tried to increment the replication factor to 3, but is the same.
What I can do to decrease amount of lost data?
pd: I only one pc to use cassandra database, it is not possible to add more pcs.
I think your option here is to work around the issue since its unlikely there is a guaranteed solution to prevent commit table files getting corrupted on sudden power outage. Since you only have a single node, it makes it more difficult to recover the data. Increasing the replication factor to 3 on a single node cluster is not going to help.
One thing you can try is to reduce the frequency at which the memtables are flushed. On flush of memtable the entries in the commit log are discarded, therefore reducing the amount of data lost. Details here. This will however not resolve the root issue

How to prevent Cassandra commit logs filling up disk space

I'm running a two node Datastax AMI cluster on AWS. Yesterday, Cassandra started refusing connections from everything. The system logs showed nothing. After a lot of tinkering, I discovered that the commit logs had filled up all the disk space on the allotted mount and this seemed to be causing the connection refusal (deleted some of the commit logs, restarted and was able to connect).
I'm on DataStax AMI 2.5.1 and Cassandra 2.1.7
If I decide to wipe and restart everything from scratch, how do I ensure that this does not happen again?
You could try lowering the commitlog_total_space_in_mb setting in your cassandra.yaml. The default is 8192MB for 64-bit systems (it should be commented-out in your .yaml file... you'll have to un-comment it when setting it). It's usually a good idea to plan for that when sizing your disk(s).
You can verify this by running a du on your commitlog directory:
$ du -d 1 -h ./commitlog
8.1G ./commitlog
Although, a smaller commit log space will cause more frequent flushes (increased disk I/O), so you'll want to keep any eye on that.
Edit 20190318
Just had a related thought (on my 4-year-old answer). I saw that it received some attention recently, and wanted to make sure that the right information is out there.
It's important to note that sometimes the commit log can grow in an "out of control" fashion. Essentially, this can happen because the write load on the node exceeds Cassandra's ability to keep up with flushing the memtables (and thus, removing old commitlog files). If you find a node with dozens of commitlog files, and the number seems to keep growing, this might be your issue.
Essentially, your memtable_cleanup_threshold may be too low. Although this property is deprecated, you can still control how it is calculated by lowering the number of memtable_flush_writers.
memtable_cleanup_threshold = 1 / (memtable_flush_writers + 1)
The documentation has been updated as of 3.x, but used to say this:
# memtable_flush_writers defaults to the smaller of (number of disks,
# number of cores), with a minimum of 2 and a maximum of 8.
#
# If your data directories are backed by SSD, you should increase this
# to the number of cores.
#memtable_flush_writers: 8
...which (I feel) led to many folks setting this value WAY too high.
Assuming a value of 8, the memtable_cleanup_threshold is .111. When the footprint of all memtables exceeds this ratio of total memory available, flushing occurs. Too many flush (blocking) writers can prevent this from happening expediently. With a single /data dir, I recommend setting this value to 2.
In addition to decreasing the commitlog size as suggested by BryceAtNetwork23, a proper solution to ensure it won't happen again will have monitoring of the disk setup so that you are alerted when its getting full and have time to act/increase the disk size.
Seeing as you are using DataStax, you could set an alert for this in OpsCenter. Haven't used this within the cloud myself, but I imagine it would work. Alerts can be set by clicking Alerts in the top banner -> Manage Alerts -> Add Alert. Configure the mounts to watch and the thresholds to trigger on.
Or, I'm sure there are better tools to monitor disk space out there.

Many commitlog files with Cassandra

I've got a mostly idle (till now) 4 node cassandra cluster that's hooked up with opscenter. There's a table that's had very little writes in the last week or so (test cluster). It's running 2.1.0. Happened to ssh in, and out of curiosity, ran du -sh * on the data directory. Here's what I get:
4.2G commitlog
851M data
188K saved_caches
There's 136 files in the commit log directory. I flushed, and then drained cassandra, stopped and started the service. Those files are still there. What's the best way to get rid of these? Most of the stuff is opscenter related, and I'm inclined to just blow them away as I don't need the test data. Wondering what to do in case this pops up again. Appreciate any tips.
The files in the commit log directory have a fixed size determined by your settings in the cassandra.yaml. All files have a pre-allocated size, so you cannot change it by making flush, drain or other operations on the cluster.
You have to change the configuration if you want to reduce their size.
Look at the configuration settings "commitlog_total_space_in_mb" and "commitlog_segment_size_in_mb" to configure the size of each file and the total space occupied by all of them.

Data in Cassandra not consistent even with Quorum configuration

I encountered a consistency problem using Hector and Cassandra when we have Quorum for both read and write.
I use MultigetSubSliceQuery to query rows from super column limit size 100, and then read it, then delete it. And start another around.
I found that the row which should be deleted by my prior query is still shown from next query.
And also from a normal Column Family, I updated the value of one column from status='FALSE' to status='TRUE', and the next time I queried it, the status was still 'FALSE'.
More detail:
It has not happened not every time (1/10,000)
The time between the two queries is around 500 ms (but we found one pair of queries in which 2 seconds had elapsed between them, still indicating a consistency problem)
We use ntp as our cluster time synchronization solution.
We have 6 nodes, and replication factor is 3
I understand that Cassandra is supposed to be "eventually consistent", and that read may not happen before write inside Cassandra. But for two seconds?! And if so, isn't it then meaningless to have Quorum or other consistency level configurations?
So first of all, is it the correct behavior of Cassandra, and if not, what data we need to analyze for further investment?
After check the source code with the system log, I found the root cause of the inconsistency.
Three factors cause the problem:
Create and update same record from different nodes
Local system time is not synchronized accurately enough (although we use NTP)
Consistency level is QUORUM
Here is the problem, take following as the event sequence
seqID NodeA NodeB NodeC
1. New(.050) New(.050) New(.050)
2. Delete(.030) Delete(.030)
First Create request come from Node C with local time stamp 00:00:00.050, assume requests first record in Node A and Node B, then later synchronized with Node C.
Then Delete request come from Node A with local time stamp 00:00:00.030, and record in node A and Node B.
When read request come, Cassandra will do version conflict merge, but the merge only depend on time stamp, so although Delete happened after Create, but the merge final result is "New" which has latest time stamp due to local time synchronization issue.
I also faced similar a issue. The issue occured because cassandra driver uses server timestamp by default to check which query is latest. However in latest version of cassandra driver they have changes it and now by default they are using client timestamp.
I have described the details of issue here
The deleted rows may be showing up as "range ghosts" because of the way that distributed deletes work: see http://wiki.apache.org/cassandra/FAQ#range_ghosts
If you are reading and writing individual columns both at CL_QUORUM, then you should always get full consistency, regardless of the time interval (provided strict ordering is still observed, i.e. you are certain that the read is always after the write). If you are not seeing this, then something, somewhere, is wrong.
To start with, I'd suggest a) verifying that the clients are syncing to NTP properly, and/or reproduce the problem with times cross-checked between clients somehow, and b) maybe try to reproduce the problem with CL_ALL.
Another thought - are your clients synced with NTP, or just the Cassandra server nodes? Remember that Cassandra uses the client timestamps to determine which value is the most recent.
I'm running into this problem with one of my clients/node. The other 2 clients I'm testing with (and 2 other nodes) run smoothly. I have a test that uses QUORUM in all reads and all writes and it fails very quickly. Actually some processes do not see anything from the others and others may always see data even after I QUORUM remove it.
In my case I turned on the logs and intended to test the feat with the tail -F command:
tail -F /var/lib/cassandra/log/system.log
to see whether I was getting some errors as presented here. To my surprise the tail process itself returned an error:
tail: inotify cannot be used, reverting to polling: Too many open files
and from another thread this means that some processes will fail opening files. In other words, the Cassandra node is likely not responding as expected because it cannot properly access data on disk.
I'm not too sure whether this is related to the problem that the user who posted the question, but tail -F is certainly a good way to determine whether the limit of files was reached.
(FYI, I have 5 relatively heavy servers running on the same machine so I'm not too surprise about the fact. I'll have to look into increasing the ulimit. I'll report here again if I get it fixed in this way.)
More info about the file limit and the ulimit command line option: https://askubuntu.com/questions/181215/too-many-open-files-how-to-find-the-culprit
--------- Update 1
Just in case, I first tested using Java 1.7.0-11 from Oracle (as mentioned below, I first used a limit of 3,000 without success!) The same error would popup at about the same time when running my Cassandra test (Plus even with the ulimit of 3,000 the tail -F error would still appear...)
--------- Update 2
Okay! That worked. I changed the ulimit to 32,768 and the problems are gone. Note that I had to enlarge the per user limit in /etc/security/limits.conf and run sudo sysctl -p before I could bump the maximum to such a high number. Somehow the default upper limit of 3000 was not enough even though the old limit was only 1024.

Resources