java.io.EOFException using sstabledump - cassandra

I have a script that downloads all the necessary files to use sstabledump from gcloud and then dumps them to a json
but for some reason i got this error:
java.io.EOFException
at org.apache.cassandra.io.util.RebufferingInputStream.readByte(RebufferingInputStream.java:180)
at org.apache.cassandra.io.util.RebufferingInputStream.readPrimitiveSlowly(RebufferingInputStream.java:142)
at org.apache.cassandra.io.util.RebufferingInputStream.readInt(RebufferingInputStream.java:222)
at org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:157)
at org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:132)
at org.apache.cassandra.tools.Util.metadataFromSSTable(Util.java:317)
at org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:144)
java.io.EOFException
at org.apache.cassandra.io.util.RebufferingInputStream.readByte(RebufferingInputStream.java:180)
at org.apache.cassandra.io.util.RebufferingInputStream.readPrimitiveSlowly(RebufferingInputStream.java:142)
at org.apache.cassandra.io.util.RebufferingInputStream.readInt(RebufferingInputStream.java:222)
at org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:157)
at org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:132)
at org.apache.cassandra.tools.Util.metadataFromSSTable(Util.java:317)
at org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:144)
any idea on why that could happen?

That indicates to me that one (or more) of the SSTables is incomplete or corrupted.
The end-of-file exception (EOFException) gets thrown because the SSTable metadata indicates that there's more data in the *-Data.db file but the end of the file was reached.
This is expected if you are copying the files from the live data/* subdirectories because the files have not necessarily been (a) fully flushed to disk, or (b) not fully compacted yet.
I recommend that you do a nodetool snapshot and only copy the SSTables from the snapshots/* subdirectories to ensure that the files are fully consistent. Cheers!

Related

do sstablesplit on the side

As sstables are immutable and sstable split has to be performed offline ie. with node shutdown. Wouldn't it also be possible to split copies of extreme large sstables offline/in a sideline dir, while keeping a node online then following swap the extreme sstables with a set of splitted sstable files during a short restart of a node to minimize node downtime?
Or would it be better to decommission a node, spreading data over rest of cluster and then rejoin as a new empty node
Eg. having some large sstables which ain't getting into a compaction view any time soon. I'll like to split such offline say in another directory/FS/on another box, just where ever out of scope from running node while still having the node servicing redundancy from original sstable path. Only it seems sstablesplit want to find the configuration or can it be tricked to otherwise do a split out-reach from the running node?
Tried on a copy of a sstable file to split it, but:
on-a-offlinebox$ sstablesplit --debug -s SOME-VALUE-IN-MB mykeyspc-mycf-*-Data.db 16:58:13.197 [main] ERROR o.a.c.config.DatabaseDescriptor - Fatal
configuration error
org.apache.cassandra.exceptions.ConfigurationException: Expecting URI
in variable: [cassandra.config]. Please prefix the file with file:///
for local files or file:/// for remote files. Aborting. If you
are executing this from an external tool, it needs to set
Config.setClientMode(true) to avoid loading configuration.
at org.apache.cassandra.config.YamlConfigurationLoader.getStorageConfigURL(YamlConfigurationLoader.java:73)
~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(YamlConfigurationLoader.java:84)
~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.config.DatabaseDescriptor.loadConfig(DatabaseDescriptor.java:161)
~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.config.DatabaseDescriptor.(DatabaseDescriptor.java:136)
~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.tools.StandaloneSplitter.main(StandaloneSplitter.java:56)
[apache-cassandra-2.1.15.jar:2.1.15] Expecting URI in variable:
[cassandra.config]. Please prefix the file with file:/// for local
files or file:/// for remote files. Aborting. If you are
executing this from an external tool, it needs to set
Config.setClientMode(true) to avoid loading configuration. Fatal
configuration error; unable to start. See log for stacktrace.
If you can afford downtime for the node, just do it (split the tables). Anyway, if you will do this split on another machine/another dir you will need to run repair on the node (due to "offline" time of the rebuilded tables) after reloading sstables.
You can also try to drop this tables data files from your node and running repair, it will be probably minimal downtime for the node:
Stop the node -> Delete big sstables -> Start the node -> Repair.
EDIT: Since Cassandra 3.4, you can run compact command on specific sstables/files. On any earlier version, you can use forceUserDefinedCompaction jmx call. You can use one of these, or make a jmx call by yourself:
http://wiki.cyclopsgroup.org/jmxterm/manual.html
https://github.com/hancockks/cassandra-compact-cf
https://gist.github.com/jeromatron/e238e5795b3e79866b83
Example code with jmxterm:
sudo java -jar jmxterm-1.0-alpha-4-uber.jar -l localhost:7199
bean org.apache.cassandra.db:type=CompactionManager
run forceUserDefinedCompaction YourKeySpaceName_YourFileName.db
Also, if "big tables" problem occurs all the time, consider moving to LCS.

Cassandra SSTable and Memory mapped files

In this article Reading and Writing from SSTable Perspective(yeah, quite old article) author says that indexdb and sstable files are warmed up using memory mapped files.
Row keys for each SSTable are stored in separate file called index.db,
during start Cassandra “goes over those files”, in order to warm up.
Cassandra uses memory mapped files, so there is hope, that when
reading files during startup, then first access on those files will be
served from memory.
I seee the usage of MappedByteBuffer in CommitLogSegment, but not for SSTable Loader/Reader. Also just mapping MappedByteBuffer to the file channel doesn't load the file into memory, I think load need to be called explicitly.
So my question is: when Cassandra starts up, how does it warm up? And am I missing something in this article's statement?
'going over index files' most probably refers to index sampling. At some point Cassandra was reading the files on startup for the sampling purposes.
Since Cassandra 1.2 results of that process are now being persisted in Partition summary file.

Recovering deleted Cassandra DB files

I screwed up massively, and accidentally deleted some SSTable (.db) files from a Cassandra data directory. Literal "rm -rf" style deletion. I've been trying to recover them using foremost, but I need to know the header and footer file signatures for SSTables in order to configure foremost to do the job.
Does anyone know the header/footer file signatures?
Or, can someone give me some ideas on how I can go about recovering these?
Are they in your snapshots directory? Cassandra creates hardlinks to sstables in the snapshots directory precisely for this kind of scenario.

is cassandra snapshot a copy or link?

I stopped a cassandra node. It created a snapshot directory.Under snapshot directory, there are many subfolds. Under those sub folds, there are many sstable files.
I wonder how cassandra put/copy sstable files to those sub folds, other words, what is meaning of the sub folds' name?
I also wonder if the sstables under snapshot are links or copies of data. I used "ls -l", I can not see link. However, I used "du", the size does not make sense too, if they are true copies.
The sstables in the snapshot dir are hardlinks. You can see the number of links to an sstable by running stat on the sstable file.

Many commitlog files with Cassandra

I've got a mostly idle (till now) 4 node cassandra cluster that's hooked up with opscenter. There's a table that's had very little writes in the last week or so (test cluster). It's running 2.1.0. Happened to ssh in, and out of curiosity, ran du -sh * on the data directory. Here's what I get:
4.2G commitlog
851M data
188K saved_caches
There's 136 files in the commit log directory. I flushed, and then drained cassandra, stopped and started the service. Those files are still there. What's the best way to get rid of these? Most of the stuff is opscenter related, and I'm inclined to just blow them away as I don't need the test data. Wondering what to do in case this pops up again. Appreciate any tips.
The files in the commit log directory have a fixed size determined by your settings in the cassandra.yaml. All files have a pre-allocated size, so you cannot change it by making flush, drain or other operations on the cluster.
You have to change the configuration if you want to reduce their size.
Look at the configuration settings "commitlog_total_space_in_mb" and "commitlog_segment_size_in_mb" to configure the size of each file and the total space occupied by all of them.

Resources