How to implement CDC in Cassandra? - cassandra

I am trying to use CDC in Cassandra tried using incremental backup as mentioned in this link but the format of SSTables is very weired for the composite keys.Is there any way to implement CDC in cassandra.
Any pointers will be very useful.

It is available now from Cassandra 3.8
https://issues.apache.org/jira/browse/CASSANDRA-8844

Related

Less rows being inserted by sstableloader in ScyllaDB

I'm trying to migrate data from Cassandra to ScyllaDB from snapshot using sstableloader and data in some tables gets loaded without any error but when verifying count using PySpark, it gives less rows in ScyllaDB than in Cassandra. Help needed!
I work at ScyllaDB
There are two tools that can be used to help find the differences:
https://github.com/scylladb/scylla-migrate (https://github.com/scylladb/scylla-migrate/blob/master/docs/scylla-migrate-user-guide.md) you can use the check mode to find the missing rows.
https://github.com/scylladb/scylla-migrator is a tool for migration from alive CQL clusters one to another (Cassandra --> Scylla) will work that also supports validation (https://github.com/scylladb/scylla-migrator#running-the-validator). There is a blog series on using this tool https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-migrator/.
Please post a bug on https://github.com/scylladb/scylla/issues if indeed there are missing rows.
Solved this problem by using nodetool repair on Cassandra keyspace, took snapshot and loaded the snapshot in ScyllaDB using sstableloader.

Data Structure in Cassandra

Is there a way to read the SSTable in Cassandra? I see from the documentation that sstabledump is an enterprise version, Is it possible to get the trial version of sstabledump?
Or is there a way to read the SSTable using the existing utilities in Cassandra/bin folder?
sstabledump is also available in apache cassandra.
It can be found in tools/bin directory in cassandra 3.x
Note: sstable2json was replaced by sstabledump in 3.0
You can use sstable2json for that.
http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsSSTable2Json.html
https://www.datastax.com/dev/blog/debugging-sstables-in-3-0-with-sstabledump

Importing blob data from RDBMS (Sybase) to Cassandra

I am trying to import large blob data ( around 10 TB ) from an RDBMS (Sybase ASE) into Cassandra, using DataStax Enterprise(DSE) 5.0 .
Is sqoop still the recommended way to do this in DSE 5.0? As per the release notes(http://docs.datastax.com/en/latest-dse/datastax_enterprise/RNdse.html) :
Hadoop and Sqoop are deprecated. Use Spark instead. (DSP-7848)
So should I use Spark SQL with JDBC data source to load data from Sybase, and then save the data frame to a Cassandra table?
Is there a better way to do this? Any help/suggestions will be appreciated.
Edit: As per DSE documentation (http://docs.datastax.com/en/latest-dse/datastax_enterprise/spark/sparkIntro.html), writing to blob columns from spark is not supported.
The following Spark features and APIs are not supported:
Writing to blob columns from Spark
Reading columns of all types is supported; however, you must convert collections of blobs to byte arrays before serialising.
Spark for the ETL of large data sets is preferred because it performs a distributed injest. Oracle data can be loaded into Spark RDDs or data frames and then just use saveToCassandra(keyspace, tablename). Cassandra Summit 2016 had a presentation Using Spark to Load Oracle Data into Cassandra by Jim Hatcher which discusses this topic in depth and provides examples.
Sqoop is deprecated but should still work in DSE 5.0. If its a one-time load and you're already confortable with Squoop, try that.

Spark saving to Cassandra with TTL

I am using Spark-Cassandra connector 1.1.0 with Cassandra 2.0.12.
I write RDDs to Cassandra via the saveToCassandra() Java API method.
Is there a way to set the TTL property of the persisted records with the connector?
Thanks,
Shai
Unfortunately it doesn't seem like there is a way to do this (that I know of) with version 1.1.0 of the connector. There is a way in 1.2.0-alpha3 however.
saveToCassandra() is a wrapper over WriterBuilder which has a withTTL method. Instead of using saveToCassandra you could use writerBuilder(keyspace,table,rowWriter).withTTL(seconds).saveToCassandra().
Yes, we can do.
Just set spark config key "spark.cassandra.output.ttl" .while creating sparkConf Object.
Note: Value should be in second

Cassandra bulk insert solution

I have a java program run as service , this program must insert 50k rows/s (1 row have 25 column ) to cassandra cluster.
My cluster contain 3 nodes, 1 node have 4 cpu core (core i5 2.4 ghz) , 4 gb ram.
i used Hector api, multithread, bulk insert but the performance is too low as expect (about 25k rows /s ).
Any one have suggest another solution for that. Is there cassandra support an internal bulk insert (without use Thrift).
Astyanax is a high level Java client for Apache Cassandra. Apache Cassandra is a highly available column oriented database.
Astyanax is currently in use at Netflix. Issues generally are fixed as quickly as possbile and releases done frequently.
https://github.com/Netflix/astyanax
I've had good luck creating sstables and loading them directly. There is a sstableloader
tool included in the distribution as well as a JMX interface. You can create the sstables using the SSTableSimpleUnsortedWriter class.
Details here.
The fastest way to bulk-insert data into Cassandra is sstableloader an utility provided by Cassandra in 0.8 onwards. For that you have to create sstables first which is possible with SSTableSimpleUnsortedWriter more about this is described here
Another faster way is Cassandras BulkoutputFormat for hadoop.With this we can write Hadoop job to load data to cassandra.See more on this bulkload to cassandra with hadoo

Resources