Running sstableloader from several nodes

Running sstableloader from several nodes - cassandra

With the objective of speeding up the migration process of a full production cassandra cluster, I would like to know if anyone has tried to simultaneously run cassandra's sstableloader from two nodes at the same time.
Those nodes would be out of the destination cassandra's ring and they would stream different data to the ring.
Has anyone tried this?
Thank you.

I have tried this with multiple simultaneous sstableloaders without any issue. In my case the SSTable-sets were created by a map-reduce job resulting in a set of SSTables per reducer that were later loaded via the sstableloader.

Related

Insert Data using Spark in Cassandra

I am writing 1.2 billion rows of data (two columns) in Cassandra using spark and datastax spark connector. I have a two DC setup, I will be writing with local_quorum. I have 3 replications in both DC. Will there be latency introduced due to other DC. What other things should I keep in mind while inserting Data. I have tested on single DC and results are satisfactory.

Writes will be sent to other DC anyway, but because you're using LOCAL_QUORUM, Spark won't wait for confirmation from nodes in that DC, so it shouldn't affect the latency. The only thing that I would monitor - if the another DC is far away, and/or have a slow link, then the nodes where write happens may start to collect hints, and if this happens, then this may slightly affect performance because hints need to be written & then replayed after the remote node is back.

Data Inconsistency in Cassandra Cluster after migration of data to a new cluster

I see some data inconsistency after moving data to a new cluster.
Old cluster has 9 nodes in total and each has got 2+ TB of data on it.
New cluster has same set of nodes as old and configuration is same.
Here is what I've performed in order:
nodetool snapshot.
Copied that snapshot to destination
Created a new Keyspace on Destination Cluster.
Used sstableloader utility to load.
Restarted all nodes.
After successful completion of transfer, I ran few queries to compare(Old vs New Cluster) and found out that the new cluster is not consistent but the data I see is properly distributed on each node (nodetool status).
Same query returns different sets of results for some of the partitions and I get zero rows first time, second time 100 rows,200 rows and eventually it becomes consistent for few partitions and record count matches with old cluster.
Few partitions have no data in the new cluster where as old cluster has data for those partitions.
I tried running queries on cqlsh with CONSISTENCY ALL but the problem still exist.
Did i miss any important steps to consider before and after?
Is there any procedure to find out the root cause of this?
I am currently running "nodetool repair" but I doubt if that could solve as I tried with Consistency ALL.
Highly Appreciate your help!

The fact that the results eventually becomes consistent indicates that the replicas are out-of-sync.
You can verify this by reviewing the logs around the time that you were loading data, particularly for dropped mutations. You can also check the output of nodetool netstats. If you're seeing blocking read repairs, that's another confirmation that the replicas are out-of-sync.
If you still have other partitions you can test, enable TRACING ON in cqlsh when you query with CONSISTENCY ALL. You will see if there are digest mismatches in the trace output which should also trigger read repairs. Cheers!
[EDIT] Based on your comments below, it sounds like you possibly did not load the snapshots from ALL the nodes in the source cluster with sstableloader. If you've missed loading SSTables to the target cluster, then that would explain why data is missing.

Concept for temporary data in Apache Cassandra

I have a question regarding the usage of Cassandra for temporary data (Data which is written once to the database, which is read once from the database and then deleted).
We are using Cassandra, to exchange data between processes which are running on different machines / different containers. Process1 is writing some data to the Cassandra, Process2 is reading this data. After that, data can be deleted.
As we learned that Cassandra doesn't like writing and deleting data very often in one table because of tombestones and performance issues, we are creating temporary tables for this.
Process1 : Create table, write data to table.
Process2 : Read data from table, drop table.
But doing this in a very high number (500-1000 tables create and drop per hour) we are facing problems on our schema synchronization between our nodes (we have cluster with 6 nodes).
The Cassandra cluster got very slow, we got a lot of timeout warnings, we got errors about different schemas on the nodes, the CPU load on the cluster nodes grew up to 100% and then the cluster was dead :-).
Is Cassandra the right database for this usecase ?
Is it a problem of how we configured our cluster ?
Will it be a better solution to create temporary keyspaces for this ?
Has anyone experience of how to handle such usecase with Cassandra ?

You don't need any database here. Your use case is to enable your applications to handshake with each other to share data asynchronously. There are two possible solutions:
1) For Batch based writes and reads consider using something like HDFS for intermediate storage. Process 1 writes data files in HDFS directories and Process 2 reads it from HDFS.
2) For message based system consider something like Kafka. Process 1 process the data stream and writes into Kafka Topics and Process 2 consumers reads data from Kafka Topics. Kafka do provides Ack/Nack features.
Continuously creating and deleting number of tables in Cassandra is not a good practice and is never recommended.

How to execute some instructions on selected nodes in a cluster?

I don't have any RDD to use, I just want to execute some of my own functions on some nodes of my cluster, with Apache Spark. So I don't have any data to distribute, but only code (which depends on the node that is executing it).
Is it possible ? Is Spark compatible with this goal ?

Is it possible?
I think it is possible and I've been asked about it few times already (so had time to think about it :))
Is Spark compatible with this goal?
The way Spark could handle it is to launch as many executors as you want to use nodes for the distributed work. That's the job of a cluster manager to spread the work across a cluster of nodes and so Spark can only use what nodes are given.
With the nodes assigned you simply execute a computation on fake dataset to build a RDD on top of.
If the computation runs on a node that should not be used, you can hostname inside the code and see what node you are on and decide on whether to continue or stop.
You could even read the code to execute from a database (seen a solution like this already).

Cassandra Replication With in cluster Without partitioning Data

I have 3 nodes in cluster
Node1 = 127.0.0.1:9160
Node2 = 127.0.0.2:9161
Node3 = 127.0.0.3:9162
I want to use only one node(node1) for insertion. Other two nodes should be used for fault tolerance on writing millions of records. i.e. when node1 is down either node2 or node3 should take care of writing.For that I formed a cluster with replication factor of 2 and added seed nodes properly in cassandra.yalm file. It is working fine. But due to partition whenever I write the data to the node 1, rows are getting scattered across all the node in the cluster. So is there any way to use the nodes for only replication in the cluster?...Or is there any way to disable the partitioning?...
thanks in advance..

No. Cassandra is a fully distributed system.

What are you trying to achieve here? We have a 6 node cluster with RF=3 and since PlayOrm fixed the config bug they had in astyanax, even if we start getting one slow node, it automatically starts going to the other nodes to keep the system fast. Why would you want to avoid great features like that???? IF your primary node gets slow you would be screwed in your situation.
If you describe your use-case better, we might be able to give you better ideas.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Running sstableloader from several nodes - cassandra

I have tried this with multiple simultaneous sstableloaders without any issue. In my case the SSTable-sets were created by a map-reduce job resulting in a set of SSTables per reducer that were later loaded via the sstableloader.

Related

Insert Data using Spark in Cassandra

Data Inconsistency in Cassandra Cluster after migration of data to a new cluster

Concept for temporary data in Apache Cassandra

How to execute some instructions on selected nodes in a cluster?

Cassandra Replication With in cluster Without partitioning Data

Categories

Resources