We are planning to use combination of Cassandra and Titan/JanusGraph db as the backend for one of our projects. As part of that, I have the below requirement.
Record/Vertex A and Record/Vertex B should be written onto the backend in an atomic way, i.e., either both the records are written or neither of the records are written. Essentially, I need to have multi-row atomic writes. However, from documentation of both Titan and Cassandra as listed below, this is what I found.
Titan DB
Titan transactions are not necessarily ACID. They can be so configured on BerkleyDB, but they are not generally so on Cassandra or HBase, where the underlying storage system does not provide serializable isolation or multi-row atomic writes and the cost of simulating those properties would be substantial
Cassandra 2.0
In Cassandra, a write is atomic at the partition-level, meaning inserting or updating columns in a row is treated as one write operation.
Cassandra 3.0
In Cassandra, a write operation is atomic at the partition level, meaning the insertions or updates of two or more rows in the same partition are treated as one write operation.
I have below questions.
1) we use titan DB with Cassandra 2.1.X. If I want to achieve multi-row atomicity how do I do that? Is there any solution to achieve this?
2) I see that Cassandra batch operation provides atomicity for multiple operations? But I don't see a corresponding operation in Titan DB to use this functionality. Am I missing some thing here or is there any way to use this?
3) As the Cassandra is heavily used in various applications and I am pretty sure people have uses cases which requires multi-row atomic operations. How do people solve this?
4) I see that Cassandra 3.0 has this support. So when JanusGraph started supporting Casandra 3.0 (currently it only supports 2.1.x), should I expect this support in JanusGraph?
Related
Based on the document "Benchmarking Distributed SQL Databases" What sees that throughput is higher in YCQL when compared with YSQL.
If we are using the same table structure and tool to insert would be the same and I am not using any SQL like features then why does YCQL perform better when compared with YSQL?
This could be because of a few differences between YCQL and YSQL. Note that while these differences not fundamental to the architecture, they manifest because YSQL started with the PostgreSQL code for the upper half of the DB. Many of these are being enhanced.
One hop optimization YCQL is shard-aware and knows how the underlying DB (called DocDB) shards and distributes data across nodes. This means it can “hop” directly to the node that contains the data when using PREPARE-BIND statements. YSQL today cannot do this since this requires a JDBC level protocol change, this work is being done in the jdbc-yugabytedb project.
Threads instead of processes YCQL uses threads to handle incoming client queries/statements, while the YSQL (and PostgreSQL code) uses processes. These processes are heavier weight, and this could affect throughput in certain scenarios (and connection scalability in certain others as well). This is another enhancement that is planned.
Upsert vs insert In YCQL, each insert is treated as an upsert (update or insert, without having to check the existing value) by default and needs special syntax to perform pure inserts. In YSQL, each insert needs to read the data before performing the insert since if the key already exists, it is treated as a failure.
More work gone into YCQL performance Currently (end of 2019) the focus has been only on correctness + functionality for YSQL, while YCQL performance has been worked on quite a bit. Note that while the work on performance has just started, it is possible to improve the performance relatively quickly because of the underlying architecture.
I am running mysql server in production side which is currently having 200 GB data. Now it is very difficult to manage mysql server because it is growing exponetially. I have heard a lot about cassandra and I did POC on that. Cassandra provide high availability and eventually consistent data . Cassandra is perfect for our requirement. Now the problem is how to transfer all mysql data to cassandra database.
Since MYSQL is relational database and cassandra is NOSQL. How to map MYSQL table and its relational table to cassandra table.
I believe you are asking the wrong question. There is no rule for transitionning from a relational model to Cassandra.
The first question is the following: What are your requirements in terms of performance, availability, data volume & growth, and most important of all query abilities? Do you need ACID? Can you change the applicative code accessing the database to fit to a Cassandra more denormalized model?
The answer to these questions will tell you whether Cassandra is compatible with your use case or not.
As a rule of thumb:
If you use mysql with a lot of indices and usually perform join during queries the Cassandra data model, then the applicative code to use the database will require a lot of work, or maybe even Cassandra will not be the right choice. Same, if you really need ACID you may have a problem with Cassandra consistency model.
If your SQL data model is fully denormalized and your perform queries without joins, then you can just replicate your DB tables schema as Cassandra column families and you're done, even if this may not be optimal.
Your use case is probably in between and you really need to understand how you can model your data in cassandra, you have to get this understanding and perform this analysis by yourself because you know your domain and we don't. However, don't hesitate to give clues about your model and how you need to query your data so you can be advised.
200GB is low for Cassandra and you may discover that your data is taking much less space in Cassandra than in MYSQL, even when widely denormalized because Cassandra is pretty efficient.
you can migrate data from mysql to cassandra using spark .
spark have connectivity with mysql as well as cassandra . First you have create model in cassandra according your requirement after that you have pull all data from mysql and after done some transformation you can directly push data in cassandra .
Transferring relational data directly to Cassandra isn't possible. You have to denormalize it. However, be warned that some queries and methods of denormalizing those are anti-patterns. Get through those free courses first:
http://academy.datastax.com/courses/ds201-cassandra-core-concepts
https://academy.datastax.com/courses/ds220-data-modeling
If you fail at Cassandra's data model design of your relational data, you won't get nice features provided by Cassandra. For eample you won't get horizontal scalability (you might have hot-spots in your claster) or high avaiability (it might happen that for some queries all of the nodes will be needed to build response)
I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.
I am storing my all data in cassandra using cli . There is any possibility to rollback in cassandra or other technique please tell me.
The closest you can get to the transactional behavior you are asking about is using BATCH. Anyways the semantics of BATCH are not equivalent to an RDBMS transaction. Mainly:
all updates in a BATCH belonging to a given partition key are performed atomically and in isolation
I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.