Based on the document "Benchmarking Distributed SQL Databases" What sees that throughput is higher in YCQL when compared with YSQL.
If we are using the same table structure and tool to insert would be the same and I am not using any SQL like features then why does YCQL perform better when compared with YSQL?
This could be because of a few differences between YCQL and YSQL. Note that while these differences not fundamental to the architecture, they manifest because YSQL started with the PostgreSQL code for the upper half of the DB. Many of these are being enhanced.
One hop optimization YCQL is shard-aware and knows how the underlying DB (called DocDB) shards and distributes data across nodes. This means it can “hop” directly to the node that contains the data when using PREPARE-BIND statements. YSQL today cannot do this since this requires a JDBC level protocol change, this work is being done in the jdbc-yugabytedb project.
Threads instead of processes YCQL uses threads to handle incoming client queries/statements, while the YSQL (and PostgreSQL code) uses processes. These processes are heavier weight, and this could affect throughput in certain scenarios (and connection scalability in certain others as well). This is another enhancement that is planned.
Upsert vs insert In YCQL, each insert is treated as an upsert (update or insert, without having to check the existing value) by default and needs special syntax to perform pure inserts. In YSQL, each insert needs to read the data before performing the insert since if the key already exists, it is treated as a failure.
More work gone into YCQL performance Currently (end of 2019) the focus has been only on correctness + functionality for YSQL, while YCQL performance has been worked on quite a bit. Note that while the work on performance has just started, it is possible to improve the performance relatively quickly because of the underlying architecture.
Related
I am reading about NoSQL DBs (Specifically Cassandra) and It says that Cassandra is faster for writing and queries are fast as well. Schema design is done more based on queries than based on data. For example, You have queries like in this example
then I have a question, Suppose I design the RDBMS schema similar to Cassandra's way and I ensure that no joins are required for queries. Will I get any significant performance gains still by using Cassandra(NoSql DBs)?
Cannot have an exact answer but few points,
JOIN is just of the many things - Cassandra stores the data physically based on the partition keys and hence making the read by partition as fast as possible.
On the performance side - its not about the performance at the beginning but keeping the performance consistent over a period of time. Say for example you have a time series like requirement where data is inserted every second, RDBMS performance will usually degrade as the data grows and not easy to keep up the index and stats up to date etc, while cassandra will fit better for a time series pattern and as the data grows its easy to scale up by adding nodes.
On the write performance - Cassandra's write workflow itself is different and is designed in a way to take up faster (the complicated process like merging sstabls, compaction etc happens in the background without affecting the actual write).
In short - you need to review the business case and make decision.
We are planning to use combination of Cassandra and Titan/JanusGraph db as the backend for one of our projects. As part of that, I have the below requirement.
Record/Vertex A and Record/Vertex B should be written onto the backend in an atomic way, i.e., either both the records are written or neither of the records are written. Essentially, I need to have multi-row atomic writes. However, from documentation of both Titan and Cassandra as listed below, this is what I found.
Titan DB
Titan transactions are not necessarily ACID. They can be so configured on BerkleyDB, but they are not generally so on Cassandra or HBase, where the underlying storage system does not provide serializable isolation or multi-row atomic writes and the cost of simulating those properties would be substantial
Cassandra 2.0
In Cassandra, a write is atomic at the partition-level, meaning inserting or updating columns in a row is treated as one write operation.
Cassandra 3.0
In Cassandra, a write operation is atomic at the partition level, meaning the insertions or updates of two or more rows in the same partition are treated as one write operation.
I have below questions.
1) we use titan DB with Cassandra 2.1.X. If I want to achieve multi-row atomicity how do I do that? Is there any solution to achieve this?
2) I see that Cassandra batch operation provides atomicity for multiple operations? But I don't see a corresponding operation in Titan DB to use this functionality. Am I missing some thing here or is there any way to use this?
3) As the Cassandra is heavily used in various applications and I am pretty sure people have uses cases which requires multi-row atomic operations. How do people solve this?
4) I see that Cassandra 3.0 has this support. So when JanusGraph started supporting Casandra 3.0 (currently it only supports 2.1.x), should I expect this support in JanusGraph?
A few years ago, Facebook decided to use hbase instead of cassandra for its messaging system: http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html
The main fact why fb uses hbase was that reads are faster than writes in compare to cassandra. Is this fact still true? I am using cassandra 3.0 and when setting read consistency level to ONE or TWO, reads are faster than when setting to ALL.
Now my question is: If Facebook has to decide to use cassandra or hbase in 2016, will its decision still be hbase?
Cassandra was designed and built originally for optimized write performance. As versions have been released their has been a lot of work done to increase the read performance so that it is much closer to write performance. There have been multiple benchmarks and studies done on HBase versus Cassandra but in general they tend to say that performance is about equal to Cassandra being a bit better. however I always take all of these performance benchmark studies with a grain of salt as you can make anyone the winner depending on how you setup the test.
You will most certainly get faster reads and writes with a CL=ONE than ALL because the coordinator only needs to wait for any of the replicas to respond instead of all of them. If you are in a multi-DC scenario then LOCAL_ONE will increase the throughput even more.
As for whether or not FB would choose Cassandra over HBase, it is impossible to say because there is so much more to making that decision than just simple performance metrics. I can say that a messaging use case is one that cassandra performs well. You can read thier use cases here:
http://www.planetcassandra.org/blog/functional_use_cases/messaging/
I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.
I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.