read / view the content of the cassandra commitlog - cassandra

I would like to analyze the cassandra commitlog files.
indeed, I have activated the cdc cassandra on some tables and I want to know the modifications made to these tables from these .log files
my question is if I can easily read the contents of the cassandra commitlog file?
and how can i do it?
cassandra 3.11.10

Full disclosure - I work on the Scylla project
Trying to do a manual CDC by analyzing the commitlog of each node is going to be extremely difficult. Assuming perfect cluster health, I guess it would be a pain, but possible. But there are a million different scenarios involving combinations of partitions, node outage, streaming, repair, etc. that will need to be accounted for in your commitlog reader. Very tough to do.
Scylla is a Cassandra-compatible open source database that has built-in CDC, where CDC is treated as a system table, accessible by standard CQL. This feature is stable and considered "GA". https://docs.scylladb.com/using-scylla/cdc/
Newer versions of Cassandra also have CDC. Though I think it is all still in experimental mode and does require some sort of commit-log reader. (I'm not an expert in this, so others can chime in with better advice.) https://cassandra.apache.org/doc/latest/operating/cdc.html

Related

Getting data OUT of Cassandra?

How can I export data, over a period of time (like hourly or daily) or updated records from a Cassandra database? It seems like using an index with a date field might work, but I definitely get timeouts in my cqlsh when I try that by hand, so I'm concerned that it's not reliable to do that.
If that's not the right way, then how do people get their data out of Cassandra and into a traditional database (for analysis, querying with JOINs, etc..)? It's not a java shop, so using Spark is non-trivial (and we don't want to change our whole system to use Spark instead of cassandra directly). Do I have to read sstables and try to keep track of them that way? Is there a way to say "get me all records affected after point in time X" or "get me all changes after timestamp X" or something similar?
It looks like Cassandra is really awesome at rapidly reading and writing individual records, but beyond that Cassandra seems to not be the right tool if you want to pull its data into anything else for analysis or warehousing or querying...
Spark is the most typical to do exactly that (as you say). It does it efficiently and is used often so pretty reliable. Cassandra is not really designed for OLAP workloads but things like spark connector help bridge the gap. DataStax Enterprise might have some more options available to you but I am not sure their current offerings.
You can still just query and page through the whole data set with normal CQL queries, its just not as fast. You can even use ALLOW FILTERING just be wary as its very expensive and can impact your cluster (creating a separate dc for the workload and using LOCOL_CL queries against it helps). You will probably also in that scenario add a < token() and > token() to the where clause to split up the query and prevent too much work on any one coordinator. Organizing your data so that this query is more efficient would be strongly recommended (ie if doing time slices, put things in a partition bucketed by time and clustering key timeuuids so its sequential read for each part of time).
Kinda cheesy sounding but the CSV dump from cqlsh is actually fast and might work for you if your data set is small enough.
I would not recommend going to the sstables directly unless you are familiar with internals and using hadoop or spark.

Cassandra vs Cassandra+Ignite

(Single Node Cluster)I've got a table having 2 columns, one is of 'text' type and the other is a 'blob'. I'm using Datastax's C++ driver to perform read/write requests in Cassandra.
The blob is storing a C++ structure.(Size: 7 KB).
Since I was getting lesser than desirable throughput when using Cassandra alone, I tried adding Ignite on top of Cassandra, in the hope that there will be significant improvement in the performance as now the data will be read from RAM instead of hard disks.
However, it turned out that after adding Ignite, the performance dropped even more(roughly around 50%!).
Read Throughput when using only Cassandra: 21000 rows/second.
Read Throughput with Cassandra + Ignite: 9000 rows/second.
Since, I am storing a C++ structure in Cassandra's Blob, the Ignite API uses serialization/de-serialization while writing/reading the data. Is this the reason, for the drop in the performance(consider the size of the structure i.e. 7K) or is this drop not at all expected and maybe something's wrong in the configuration?
Cassandra: 3.11.2
RHEL: 6.5
Configurations for Ignite are same as given here.
I got significant improvement in Ignite+Cassandra throughput when I used serialization in raw mode. Now the throughput has increased from 9000 rows/second to 23000 rows/second. But still, it's not significantly superior to Cassandra. I'm still hopeful to find some more tweaks which will improve this further.
I've added some more details about the configurations and client code on github.
Looks like you do one get per each key in this benchmark for Ignite and you didn't invoke loadCache before it. In this case, on each get, Ignite will go to Cassandra to get value from it and only after it will store it in the cache. So, I'd recommend invoking loadCache before benchmarking, or, at least, test gets on the same keys, to give an opportunity to Ignite to store keys in the cache. If you think you already have all the data in caches, please share code where you write data to Ignite too.
Also, you invoke "grid.GetCache" in each thread - it won't take a lot of time, but you definitely should avoid such things inside benchmark, when you already measure time.

Getting database for Cassandra or building one from scratch?

So, I'm new to Cassandra and I was wondering what the best approach would be to learn Cassandra.
Should I first focus on the design of a database and build one from scratch?
And as I was reading that Cassandra is great for writing. How can one observe that? Is there open source data that one can use? (I didn't really know where to look.)
A good point getting started with Cassandra are the free online courses from DataStax (an enterprise grade Cassandra distribution): https://academy.datastax.com/courses
And for Cassandra beeing good at writing data - have a look here: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
The write path comes down to these points:
write the data into the commitlog (append only sequentially, no random io - therefore should be on its own disk to prevent head movements, with ssd no issue)
write the data into memtables (kept in memory - very fast)
So in terms of disk, a write is a simple append to the commitlog in the first place. No data is directly written to the sstables (it's in the commitlog and memtable, which becomes flushed to disk at times as sstables), updates are not changing an sstable on disk (sstables are immutable, an update is written separately with a new timestamp), a delete does not remove data from sstables (sstables are immutable - instead a tombstone is written).
All updates and deletes produce new entries in memtable and sstables, to remove deleted data and to get rid of old versions of data from updates sstables on disk are compacted from time to time into a new one.
Also read about the different compaction strategies (can help you provide good performance), replication factor (how many copies of your data the cluster should keep) and consistency levels (how Cassandra should determine when a write or read is successful, hint: ALL is almost wrong all the time, look for QUORUM).

Is Cassandra able to detect corrupted data that doesn't get used often?

Is there anything like HDFS's DataBlockScanner for Cassandra, ie. an automatic mechanism that checks for corrupted data that doesn't get read often?
No.
Cassandra doesn't do that automatically - it can guarantee consistency on
read or write via ConsistencyLevel on each query, and it can run active
(AntiEntropy) repairs. But active repairs must be scheduled (by human or
cron or by third party script like http://cassandra-reaper.io/), and to
be pedantic, repair only fixes consistency issue, there's some work to be
done to properly address/support fixing corrupted replicas (for example,
repair COULD send a bit flip from one node to all of the others)
http://mail-archives.apache.org/mod_mbox/cassandra-user/201709.mbox/%3CCABNXB2CWXqvR_zkGSGfw7DJjU+Emer3a0Dcv0YkHUtKBEc1e+A#mail.gmail.com%3E
Big data as a trash can. Cool.
Best bet is to use nodetool verify to compare the hash of the sstable with the content. Particularly with nodetool verify -e to walk the individual cells.
https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsVerify.html

What is a good Bulk data loading tool for Cassandra

I'm looking for a tool to load CSV into Cassandra. I was hoping to use RazorSQL for this but I've been told that it will be several months out.
What is a good tool?
Thanks
1) If you have all the data to be loaded in place you can try sstableloader(only for cassandra 0.8.x onwards) utility to bulk load the data.For more details see:cassandra bulk loader
2) Cassandra has introduced BulkOutputFormat bulk loading data into cassandra with hadoop job in latest version that is cassandra-1.1.x onwards.
For more details see:Bulkloading to Cassandra with Hadoop
I'm dubious that tool support would help a great deal with this, since a Cassandra schema needs to reflect the queries that you want to run, rather than just being a generic model of your domain.
The built-in bulk loading mechanism for cassandra is via BinaryMemtables: http://wiki.apache.org/cassandra/BinaryMemtable
However, whether you use this or the more usual Thrift interface, you still probably need to manually design a mapping from your CSV into Cassandra ColumnFamilies, taking into account the queries you need to run. A generic mapping from CSV-> Cassandra may not be appropriate since secondary indexes and denormalisation are commonly needed.
For Cassandra 1.1.3 and higher, there is the CQL COPY command that is available for importing (or exporting) data to (or from) a table. According to the documentation, if you are importing less than 2 million rows, roughly, then this is a good option. Is is much easier to use than the sstableloader and less error prone. The sstableloader requires you to create strictly formatted .db files whereas the CQL COPY command accepts a delimited text file. Documenation here:
http://www.datastax.com/docs/1.1/references/cql/COPY
For larger data sets, you should use the sstableloader.http://www.datastax.com/docs/1.1/references/bulkloader. A working example is described here http://www.datastax.com/dev/blog/bulk-loading.

Resources