When to use Materialized Views? - cassandra

I'm learning Cassandra now and I understand I should make a table for each query. I'm not sure when I should make separate tables or materialized views. For example, I have the following queries for users and posts:
users_by_id
users_by_email
users_by_session_key
posts_by_id
posts_by_category
posts_by_user
Should I always use materialized views?
It seems to me that if you want to keep the Posts or Users consistent across queries, then I have to use materialized views. However materialized views I read have a read before write latency.
On the other hand, if I use different tables, am I supposed to make 3 Inserts every time a new post is created? I noticed that I get the error batch with conditions cannot span multiple tables, which means I have to insert it one at a time into each separate table, which can cause consistency problems if one of the queries fails. (A batch statement, would fail all 3 if one of them failed).
So, since it makes sense to have consistency, then it seems to me that I will always want to use materialized views, and have to take the read before write penalty.
I guess my other question is when would it ever be okay for data to be inconsistent?
So hoping someone can provide more clarity for me for how to handle multiple queries in cassandra on a 'theoretical model` like Users or Posts. Should I be using materialized views? If I use 3 different tables for each model, how do I keep them consistent? Just hope that all 3 inserts don't fail? Doesn't seem right.

Read my deep dive blog post for all the trade-offs when using materialized views. Once you understand the trade-offs, choose wisely: http://www.doanduyhai.com/blog/?p=1930

No, you shouldn't always use materialized views. The perfect solution is a interface for your database. In this application, you handle all your different tables. But there's are also some use case for the materialized views: If you haven't the time for this application but you need this feature, use materialized views. You have a performance trade off but in this scenario, the time is more important. If you also need real updates instead of upserts on all tables: use materialized views.
Batch is useful for buffering or putting data-sets with the same partition key together. For example: You have a high data troughput application. Between your heartbeats or between execution another query with QUORUM, you got 10 other events with the same partition key. But you won't execute them because you're waiting for a successful response. If a success comes back, you execute a batch query. But please keep in mind: Use only a batch for the same partition keys.
Generally, remember one important thing: Cassandra has an eventually consistency model. That means: If you use qourum, you will have consistency but not every time. If your application needs a full consistency, not only eventually use another solution. E.g. SQL with sharding. Cassandra is optimized for writes and you will only get happy when you're using the cassandra features.
Some performance tips:
If you need a better consistency: Use QUORUM, never use ALL. And, generally, write you queries standalone. Sometimes batch is useful. Don't execute queries with ALLOW FILTERING. Don't use token ranges or IN operator on partition keys :)

Related

Best way of querying table without providing the primary key

I am designing the data model of our Scylla database. For example, I created a table, intraday_history, with fields:
CREATE TABLE intraday_history (id bigint,timestamp_seconds bigint,timestamp timestamp,sec_code text,open float,high float,low float,close float,volume float,trade int, PRIMARY KEY ((id,sec_code),timestamp_seconds,timestamp));
My id is a twitter_snowflake generated 64-bit integers.. My problem is how can I use WHERE without providing always the id (most of the time I will use the timestamp with bigint). I also encounter this problem in other tables. Because the id is unique then I cannot query a batch of timestamp.
Is it okay if lets say for a bunch of tables for my 1 node, I will use an ID like cluster1 so that when I query the id I will just id=cluster1 ? But it loss the uniqueness feature
Allow filtering comes as an option here. But I keep reading that it is a bad practice, especially when dealing with millions of query.
I'm using the ScyllaDB, a compatible c++ version of Apache Cassandra.
In Cassandra, as you've probably already read, the queries derive the tables, not the other way around. So your situation where you want to query by a different filter would ideally entail you creating another Cassandra table. That's the optimal way. Partition keys are required in filters unless you provide the "allow filtering" "switch", but it isn't recommended as it will perform a DC (possibly cluster)-wide search, and you're still subjected to timeouts. You could consider using indexes or materialized views, which are basically cassandra maintained tables populated by the base table's changes. That would save you the troubles of having the application populate multiple tables (Cassandra would do it for you). We've had some luck with materialized views, but with either of these components, there can be side effects like any other cassandra table (inconsistencies, latencies, additional rules, etc.). I would say do a bit of research to determine the best approach, but most likely providing "allow filtering" isn't the best choice (especially for high volume and frequent queries or with tables containing high volumes of data). You could also investigate SOLR if that's an option, depending on what you're filtering.
Hope that helps.
-Jim

Dynamic Cassandra queries

I have a messenger application with a history page, on which you can see your sent and received messages.
Since the amount of messages has lowered my performance I have been thinking about using Cassandra.
After researching on the topic of Cassandra, I found out that you have to build tables to satisfy your queries.
Now the problem: on the history page you can use x amount of different filters at the same time. e.g filter by date,receiver and sender.
If I were to use Cassandra, would I need to create a table for every combination of these filters?
Or is this a bad use case for Cassandra in general?
If so, are there any alternatives?
Why don't you just make a SELECT statement.
You should definately have a look into CQL (Cassandra Query Language).
While CQL and SQL share a similar syntax queries are a lot different.
The reasons for these differences is the fact that Cassandra is dealing with distributed data and aims to prevent inefficient queries.
See this link for reference. It shows queries you can or cannot do.

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

Inner Join in cassandra CQL

How do I write subqueries/nested queries in cassandra. Is this facility is provided in CQL?
Example I tried:
cqlsh:testdb> select itemname from item where itemid = (select itemid from orders where customerid=1);
It just throws the following error -
Bad Request: line 1:87 no viable alternative at input ';'
Because of its distributed nature, Cassandra has no support for RDBMS style joins. You have a few options for when you want something like a join.
One option perform separate queries and then have your application join the data itself. This makes sense if the data is relatively small and you only have to perform a small number of queries. Based on the example you gave above, this would probably be a good solution for you.
For more complicated joins, the usual strategy is to denormalize the data and store a materialized view of the join. The advantage to this is that fetching this data will be much faster than having to build it join in your application every time you need it. The cost is now you have multiple places where you are storing the same data and you will need to keep it all in sync. You can either update all your views when new data comes into the system or you can have a periodic batch job that rebuilds thems.
You might find this article useful: Do You Really Need SQL to Do It All in Cassandra? Its a bit old but its principles still apply.

Cassandra bulk insert operation, internally

I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.

Resources