Cassandra creates tens of thousands hd files for a column family - cassandra

I have a column family with a lot of data. Tens of millions keys with small data items, and it's growing.
I've noticed cassandra created about 170k files called like this:
my_col_family-hd-702036-Data.db
my_col_family-hd-702036-Index.db
my_col_family-hd-702036-Digest.db
my_col_family-hd-702036-Statistics.db
my_col_family-hd-702036-Filter.db
They only differ by the number in the file name.
When I re-start cassandra it needs about an hour to get up, the log says:
INFO 09:26:34,649 Opening /var/lib/cassandra/data/foo/my_col_family-hd-805240 (5243383 bytes)
INFO 09:26:34,649 Opening /var/lib/cassandra/data/foo/my_col_family-hd-731915 (5242896 bytes)
INFO 09:26:34,714 Opening /var/lib/cassandra/data/foo/my_col_family-hd-797692 (5243454 bytes)
INFO 09:26:34,753 Opening /var/lib/cassandra/data/foo/my_col_family-hd-688013 (5243541 bytes)
It goes like this for about an hour until it reads all the 170k files
I wanted to ask if this is normal? Why does it create so many small files, 5 MB each and then read all of them on startup?

You have a lot of files because you are using an old version of Cassandra which uses a default file size of 5mb for Leveled compaction. Further testing has shown that ~160mb is a more optimal file size for this particular compaction strategy. I would recommend switching to the larger size asap.
https://issues.apache.org/jira/browse/CASSANDRA-5727
As to checking for all of them on startup, it isn't actually reading them all. Cassandra is just opening file handles so that it can access data from the files during reads from the database. This is necessary and normal.

Related

LevelDB based mqtt-level-store does not clean deleted data from filesystem

I'm using mqtt.js with mqtt-level-store. I have no idea of how leveldb is being used here and how it works.
mqtt.js puts data in the store and removes it once it is successfully uploaded.
I kept my device offline for 2 days and let it gather data in the store (140 KB every minute)
After I put it online, it uploaded a lot of data and then stopped. Now it was only uploading the new incoming data, so I guess it uploaded everything.
However, before putting the device online, I saw there were files of about 230 MB.
After all of the uploading completed, the files were still there. After few more fresh data uploads, some of the files were removed, however, there are still files of about 190 MB.
Is there a setting I am missing? How does this cleanup happen?
From Google documentation we know that compaction of db provided
When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction.
. . .
DeleteObsoleteFiles() is called at the end of every compaction and at the end of recovery.
From my experience you need to wait a bit. DB will be cleaned up somewhere in time.

Getting database for Cassandra or building one from scratch?

So, I'm new to Cassandra and I was wondering what the best approach would be to learn Cassandra.
Should I first focus on the design of a database and build one from scratch?
And as I was reading that Cassandra is great for writing. How can one observe that? Is there open source data that one can use? (I didn't really know where to look.)
A good point getting started with Cassandra are the free online courses from DataStax (an enterprise grade Cassandra distribution): https://academy.datastax.com/courses
And for Cassandra beeing good at writing data - have a look here: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
The write path comes down to these points:
write the data into the commitlog (append only sequentially, no random io - therefore should be on its own disk to prevent head movements, with ssd no issue)
write the data into memtables (kept in memory - very fast)
So in terms of disk, a write is a simple append to the commitlog in the first place. No data is directly written to the sstables (it's in the commitlog and memtable, which becomes flushed to disk at times as sstables), updates are not changing an sstable on disk (sstables are immutable, an update is written separately with a new timestamp), a delete does not remove data from sstables (sstables are immutable - instead a tombstone is written).
All updates and deletes produce new entries in memtable and sstables, to remove deleted data and to get rid of old versions of data from updates sstables on disk are compacted from time to time into a new one.
Also read about the different compaction strategies (can help you provide good performance), replication factor (how many copies of your data the cluster should keep) and consistency levels (how Cassandra should determine when a write or read is successful, hint: ALL is almost wrong all the time, look for QUORUM).

Too much disk space used by Apache Kudu for WALs

I have a hive table that is of 2.7 MB (which is stored in a parquet format). When I use impala-shell to convert this hive table to kudu, I notice that the /tserver/ folder size increases by around 300 MB. Upon exploring further, I see it is the /tserver/wals/ folder that holds the majority of this increase. I am facing serious issues due to this. If a 2.7 MB file generates a 300 MB WAL, then I cannot really work on bigger data. Is there a solution to this?
My kudu version is 1.1.0 and impala is 2.7.0.
I never used KUDU but I'm able to Google on a few keywords, and read some documentation.
From the Kudu configuration reference section "Unsupported flags"...
--log_preallocate_segments Whether the WAL should preallocate the entire segment before writing to it Default true
--log_segment_size_mb The default segment size for log roll-overs, in MB Default 64
--log_min_segments_to_retain The minimum number of past log segments to keep at all times, regardless of what is required for
durability. Must be at least 1. Default 2
--log_max_segments_to_retain The maximum number of past log segments to keep at all times for the purposes of catching up other
peers. Default 10
Looks like you have a minimum disk requirement of (2+1)x64 MB per tablet, for the WAL only. And it can grow up to 10x64 MB if some tablets are straggling and cannot catch up.
Plus some temp disk space for compaction etc. etc.
[Edit] these default values have changed in Kudu 1.4 (released in June 2017); quoting the Release Notes...
The default size for Write Ahead Log (WAL) segments has been reduced
from 64MB to 8MB. Additionally, in the case that all replicas of a
tablet are fully up to date and data has been flushed from memory,
servers will now retain only a single WAL segment rather than two.
These changes are expected to reduce the average consumption of disk
space on the configured WAL disk by 16x

Cassandra SSTable and Memory mapped files

In this article Reading and Writing from SSTable Perspective(yeah, quite old article) author says that indexdb and sstable files are warmed up using memory mapped files.
Row keys for each SSTable are stored in separate file called index.db,
during start Cassandra “goes over those files”, in order to warm up.
Cassandra uses memory mapped files, so there is hope, that when
reading files during startup, then first access on those files will be
served from memory.
I seee the usage of MappedByteBuffer in CommitLogSegment, but not for SSTable Loader/Reader. Also just mapping MappedByteBuffer to the file channel doesn't load the file into memory, I think load need to be called explicitly.
So my question is: when Cassandra starts up, how does it warm up? And am I missing something in this article's statement?
'going over index files' most probably refers to index sampling. At some point Cassandra was reading the files on startup for the sampling purposes.
Since Cassandra 1.2 results of that process are now being persisted in Partition summary file.

Gridgain: Write timed out (socket was concurrently closed)

While trying to upload data to Gridgain using GridDataLoader, I'm getting
'Write timed out (socket was concurrently closed).'
I'm trying to load 10 million lines of data using a .csv file on a cluster having 13 nodes (16 core cpu).
The structure of my GridDataLoader is GridDataLoader where Key is a composite key. While using a primitive data type as the key there was no issue. But when I changed it to a composite key this error is coming.
I guess this is because it takes up too large space on heap when it tries to parse your csv and create entries. As a result, if you don't configure your heap-size large enough, you are likely suffering from GC pauses since when GC kicks in, everything has to pause, and that's why you got this time out error. I think it may help if you can break that large csv into smaller files and load them one by one.

Resources