I am in the process of migrating from Datastax (DSE) Cassandra to Apache Cassandra 3.11.
I have a cluster of 7 nodes of Datastax (DSE) Cassandra.
Is there a way I create new cluster of apache Cassandra & connect it to DSE Cassandra so that my writes go to both DSE & Apache cassandra
So that once my data has started to be written in both Cassandra I can migrate my Read API's gradually from DSE to Apache.
Yes, I've done this before.
First of all, find the exact version of the Cassandra version (not the DSE version) that your cluster is running:
SELECT release_version FROM system.local;
release_version
-----------------
3.11.4
You can also see this version number when connecting with cqlsh. The DSE version of Cassandra will have a (long) build number added on to that. But the idea is that the version of Apache Cassandra on new nodes should match the DSE version of Cassandra as closely as possible.
Next, build up your Apache Cassandra "replacement" nodes as a new logical datacenter. Make sure that they use a different dc_name (than the existing nodes) in the cassandra-rackdc.properties file. The first node (or two) should use nodes from the existing cluster as seed nodes. The following nodes can then use the first nodes as seeds. Plus, the cluster_name needs to match.
Now check the keyspace definitions for system_auth, system_traces, system_distributed, and any keyspaces that the app needs. Make sure that they're using NetworkTopologyStrategy. If not, make sure it is, and configure the replication factor (RF) for the existing DC (DC name must match dc_name of existing DSE nodes). Then you can extend replication to the new data center.
If current dc_name is DSE_DC and the new dc_name is AC_DC, then:
ALTER KEYSPACE yourkeyspace WITH replication =
{'class': 'NetworkTopologyStrategy',
'DSE_DC': '3', 'AC_DC': '3'};
Once that change is done, run a nodetool rebuild on each new Apache Cassandra node.
nodetool rebuild -- DSE_DC
That will move the data from the DSE_DC to the current node. Then, you should be able to switch your API by specifying the new data center name.
Edit 20200506
Check your data directories. The most important thing that needs to match-up for this to work, is the SSTable format.
ver 3.11.4+
43 Feb 20 08:55 md-1-big-CompressionInfo.db
83 Feb 20 08:55 md-1-big-Data.db
10 Feb 20 08:55 md-1-big-Digest.crc32
16 Feb 20 08:55 md-1-big-Filter.db
17 Feb 20 08:55 md-1-big-Index.db
4769 Feb 20 08:55 md-1-big-Statistics.db
57 Feb 20 08:55 md-1-big-Summary.db
92 Feb 20 08:55 md-1-big-TOC.txt
ver 4.0-alpha4:
47 May 6 10:13 na-1-big-CompressionInfo.db
107 May 6 10:13 na-1-big-Data.db
10 May 6 10:13 na-1-big-Digest.crc32
16 May 6 10:13 na-1-big-Filter.db
32 May 6 10:13 na-1-big-Index.db
4687 May 6 10:13 na-1-big-Statistics.db
66 May 6 10:13 na-1-big-Summary.db
92 May 6 10:13 na-1-big-TOC.txt
You can also verify this in DataStax's Product Compatibility Guide.
Basically, if your SSTable files are prefixed with m[a,b,c,d], then 3.11.6 should be able to work.
Related
So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96 and Spark-Cassandra-Connector 3.1.0.
I am trying to repartitionByCassandraReplica().JoinWithCassandraTable() on partition keys of a RDD with a cassandra table. The size of the data of the cassandra table that will be joined is 84Gb and I would like to know what would be the ideal number of partitionsPerHost. How should I calculate that? Let me know if you need any more information on my cluster.
I'm learning Spark SQL, when I'm using spark-sql to uncache a table which has previously cached, but after submitted the uncache command, I can still query the cache table. Why this happened?
Spark version 3.2.0(Pre-built for Apache Hadoop 2.7)
Hadoop version 2.7.7
Hive metastore 2.3.9
Linux Info
Static hostname: master
Icon name: computer-vm
Chassis: vm
Machine ID: 15c**********************10b2e19
Boot ID: 48b**********************efc169b
Virtualization: vmware
Operating System: Ubuntu 18.04.6 LTS
Kernel: Linux 4.15.0-163-generic
Architecture: x86-64
spark-sql (default)> CACHE TABLE testCache SELECT * FROM students WHERE AGE = 13;
Error in query: Temporary view 'testCache' already exists
spark-sql (default)> UNCACHE TABLE testCache;
Response code
Time taken: 0.092 seconds
spark-sql (default)> SELECT * FROM testCache;
NAME rollno AGE
Kent 8 21
Marry 1 10
Eddie Davis 5 13
Amy Smith 3 13
Barron 3 12
Fleur Laurent 4 9
Ivy 3 8
Time taken: 0.492 seconds, Fetched 7 row(s)
UNCACHE TABLE removes the entries and associated data from the in-memory and/or on-disk cache for a given table or view, not drop the table. So you can still query it.
We have upgraded from dse4.5 to dse4.8 . After running upgrade sstables on 4 nodes of my 10 node cluster , i see the size of my cluster shown in opscenter is reduced to 2.5TB from 3.2TB . However , there is no impact of it in production.
Does the data really get compressed or reduce in size after upgrading sstables ?
Answer posted by markc in the comment :
Compaction might play a part here you may have had pending compactions prior to the upgrade which have now completed. Snapshots will be under the data directory too, upgrade sstables will upgrade those too iirc. – markc Feb 23 at 12:19
i use Cassandra 3.4 on some centos 7 machines.
I have 2 clusters:
Cluster 1 with 2 DC , DC1 has 2 machines 192.168.0.171/192.168.172, DC2 has 1 machine 192.168.0.173. Cluster 1 has some data on it, on one keyspace with replication 2 : 1.
Cluster 2 with 1 datacenter , DC3 has 2 machines. 192.168.0.174/192.168.0.175.
On second cluster, DC3, I create the keyspace : "keyspace1" with NetworkTopologyStrategy : DC3 : 2.
Streamed some cassandra-stress on 192.168.0.175 :
cassandra-stress write n=1000000 -node 192.168.0.175.
In this moment cassandra-stress should generate some garbage data.
Checked the /var/lib/cassandra/data/keyspace1/standard1-97a771600d4011e69a5a13282caaa658 and there i have some ma-1-big-Data.db 57 Mb, ma-2-big-Data.db 65 Mb, ma-3-big-Data.db 65 Mb.
My question :
Let`s assume the garbage data is actual data and i want to stream from Cluster 2 this data into Cluster 1.
How can i do that by using sstableloader?
NOTE: Please give, if possible, example with commands ( i`m quite newbie in domain :( )
bin/sstableloader -d 192.168.0.171,192.168.172 /var/lib/cassandra/data/keyspace1/standard1-97a771600d4011e69a5a13282caaa658
this command will load data from one cluster to another cluster
Note: keyspace and table should exist in both clusters, and the tables should have the same schema.
I am learning Cassandra from academy.datastax.com. I am trying the Replication and Consistency demo on local machine. RF = 3 and Consistency = 1.
When my Node3 is down and I am updating my table using update command, the SYSTEM.HINTS table is expected to store hint for node3 but it is always empty.
Do I need to make any configurational changes for hints to work or the defaults are ok?
surjanrawat$ ccm node1 nodetool getendpoints mykeyspace mytable 1
127.0.0.3
127.0.0.4
127.0.0.5
surjanrawat$ ccm status
Cluster: 'mycluster'
--------------------
node1: UP
node3: DOWN
node2: UP
node5: UP
node4: UP
cqlsh:mykeyspace> select * from system.hints ;
target_id | hint_id | message_version | mutation
-----------+---------+-----------------+----------
(0 rows)
Did you use the exact same version of Cassandra to create the cluster? Since version 3+ the hints are stored in the local filesystem of the coordinator. I ask this because the exact same thing happened to me during that Datastax demo (I used 3.0.3 instead of 2.1.5) I replayed the steps but with 2.1.5 and the hints where there in the table just as expected.
Came across this post as I ran into the same issue. Latest versions of cassandra don't store the hints in system.hints table. I am using cassandra 3.9 and the hints are stored in the file system.
It is configured in cassandra.yaml file.
# Directory where Cassandra should store hints.
# If not set, the default directory is $CASSANDRA_HOME/data/hints.
# hints_directory: /var/lib/cassandra/hints
In my case I was using ccm and I was able to find the hints at
${user_home}/.ccm/mycluster/node1/hints
where mycluster is my cluster name and node1 is my node name. Hope this helps someone.
In order to understand why no hints are being written you need a good grasp on how Cassandra replicates data.
A replication factor of 2 means two copies of each row, where each
copy is on a different node. All replicas are equally important; there
is no primary or master replica.
With a replication factor of 3 and 5 nodes in the cluster you could lose another node and still not store hints because your data replication strategy is still valid. Try killing two more nodes and then check the hints table.