Inner Join in cassandra CQL - cassandra

How do I write subqueries/nested queries in cassandra. Is this facility is provided in CQL?
Example I tried:
cqlsh:testdb> select itemname from item where itemid = (select itemid from orders where customerid=1);
It just throws the following error -
Bad Request: line 1:87 no viable alternative at input ';'

Because of its distributed nature, Cassandra has no support for RDBMS style joins. You have a few options for when you want something like a join.
One option perform separate queries and then have your application join the data itself. This makes sense if the data is relatively small and you only have to perform a small number of queries. Based on the example you gave above, this would probably be a good solution for you.
For more complicated joins, the usual strategy is to denormalize the data and store a materialized view of the join. The advantage to this is that fetching this data will be much faster than having to build it join in your application every time you need it. The cost is now you have multiple places where you are storing the same data and you will need to keep it all in sync. You can either update all your views when new data comes into the system or you can have a periodic batch job that rebuilds thems.
You might find this article useful: Do You Really Need SQL to Do It All in Cassandra? Its a bit old but its principles still apply.

Related

What is the performance difference between stream.filter instead of CQL ALLOW FILTERING?

The data in my Cassandra DB table doesn't have much data right now.
However, since it is a table where data is continuously accumulated, I am interested in performance issues.
First of all, please don't think about the part where you need to redesign the table.
Think of it as a general RDBS date-based lookup. (startDate ~ endDate)
From Cassandra DB
Apply allow filtering and force the query.
This will get you exactly the data you want.
Query "all data" in Cassandra DB, This query only needs to be done once. (no where)
After that, only the data within the desired date is extracted through the stream().filter() function.
Which method would you choose?
In general, which one has more performance issues?
Summary: You need to do about 6 methods.
Execute allow filtering query 6 times / Not perform stream filter
Execute findAll query once / Execute stream filter 6 times
The challenge with both options is that neither will scale. It may work with very small data sets, say less than 1000 partitions, but you will quickly find that neither will work once your tables grow.
Cassandra is designed for real-time OLTP workloads where you are retrieving a single partition for real-time applications.
For analytics workloads, you should instead use Spark with the spark-cassandra-connector because it optimises analytics queries. Cheers!

Best way of querying table without providing the primary key

I am designing the data model of our Scylla database. For example, I created a table, intraday_history, with fields:
CREATE TABLE intraday_history (id bigint,timestamp_seconds bigint,timestamp timestamp,sec_code text,open float,high float,low float,close float,volume float,trade int, PRIMARY KEY ((id,sec_code),timestamp_seconds,timestamp));
My id is a twitter_snowflake generated 64-bit integers.. My problem is how can I use WHERE without providing always the id (most of the time I will use the timestamp with bigint). I also encounter this problem in other tables. Because the id is unique then I cannot query a batch of timestamp.
Is it okay if lets say for a bunch of tables for my 1 node, I will use an ID like cluster1 so that when I query the id I will just id=cluster1 ? But it loss the uniqueness feature
Allow filtering comes as an option here. But I keep reading that it is a bad practice, especially when dealing with millions of query.
I'm using the ScyllaDB, a compatible c++ version of Apache Cassandra.
In Cassandra, as you've probably already read, the queries derive the tables, not the other way around. So your situation where you want to query by a different filter would ideally entail you creating another Cassandra table. That's the optimal way. Partition keys are required in filters unless you provide the "allow filtering" "switch", but it isn't recommended as it will perform a DC (possibly cluster)-wide search, and you're still subjected to timeouts. You could consider using indexes or materialized views, which are basically cassandra maintained tables populated by the base table's changes. That would save you the troubles of having the application populate multiple tables (Cassandra would do it for you). We've had some luck with materialized views, but with either of these components, there can be side effects like any other cassandra table (inconsistencies, latencies, additional rules, etc.). I would say do a bit of research to determine the best approach, but most likely providing "allow filtering" isn't the best choice (especially for high volume and frequent queries or with tables containing high volumes of data). You could also investigate SOLR if that's an option, depending on what you're filtering.
Hope that helps.
-Jim

How to use join in cassandra database tables like mysql

I am new to node.js.
And use Express framework and use cassandra as a database.
My question is it is possible to use join with multiple table relationship like mysql.
For e.g.,
In mysql
select * from `table1`
left join `table2` on table1.id=table2.id
Short: No.
There are no joins in cassandra (and other nosql databases) - which is different from relational databases.
In cassandra the standard way is to denormalize data and maybe store multiple copies if necessary. As a rule of thumb think query first - and store your data in that way you need to query it later.
Cassandra will perform very very well if your data is evenly spread accross your cluster and the every day queries hit only one or only a few primary key(s) (to be exact - partition key).
Have a look at: http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
And there are trainings from DataStax: https://academy.datastax.com/resources/ds220-data-modeling (and others too).
No. Joins are not possible in cassandra (or any other NoSQL databases).
You should design your table in cassandra based on your query requirements.
And while using NoSQL system it's recommended to de-normalize your data.
Basic Rules of Cassandra Data Modeling
Basic googling returns this blog post from the company behind Cassandra: https://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise
In short, it says that joins are possible via Spark or ODBC connections.
It comes down to the performance characteristics of such joins, esp. compared to making a "join" by hand, i.e. a lookup query on one table for every (relevant) row on the other. Any ideas?

When to use Materialized Views?

I'm learning Cassandra now and I understand I should make a table for each query. I'm not sure when I should make separate tables or materialized views. For example, I have the following queries for users and posts:
users_by_id
users_by_email
users_by_session_key
posts_by_id
posts_by_category
posts_by_user
Should I always use materialized views?
It seems to me that if you want to keep the Posts or Users consistent across queries, then I have to use materialized views. However materialized views I read have a read before write latency.
On the other hand, if I use different tables, am I supposed to make 3 Inserts every time a new post is created? I noticed that I get the error batch with conditions cannot span multiple tables, which means I have to insert it one at a time into each separate table, which can cause consistency problems if one of the queries fails. (A batch statement, would fail all 3 if one of them failed).
So, since it makes sense to have consistency, then it seems to me that I will always want to use materialized views, and have to take the read before write penalty.
I guess my other question is when would it ever be okay for data to be inconsistent?
So hoping someone can provide more clarity for me for how to handle multiple queries in cassandra on a 'theoretical model` like Users or Posts. Should I be using materialized views? If I use 3 different tables for each model, how do I keep them consistent? Just hope that all 3 inserts don't fail? Doesn't seem right.
Read my deep dive blog post for all the trade-offs when using materialized views. Once you understand the trade-offs, choose wisely: http://www.doanduyhai.com/blog/?p=1930
No, you shouldn't always use materialized views. The perfect solution is a interface for your database. In this application, you handle all your different tables. But there's are also some use case for the materialized views: If you haven't the time for this application but you need this feature, use materialized views. You have a performance trade off but in this scenario, the time is more important. If you also need real updates instead of upserts on all tables: use materialized views.
Batch is useful for buffering or putting data-sets with the same partition key together. For example: You have a high data troughput application. Between your heartbeats or between execution another query with QUORUM, you got 10 other events with the same partition key. But you won't execute them because you're waiting for a successful response. If a success comes back, you execute a batch query. But please keep in mind: Use only a batch for the same partition keys.
Generally, remember one important thing: Cassandra has an eventually consistency model. That means: If you use qourum, you will have consistency but not every time. If your application needs a full consistency, not only eventually use another solution. E.g. SQL with sharding. Cassandra is optimized for writes and you will only get happy when you're using the cassandra features.
Some performance tips:
If you need a better consistency: Use QUORUM, never use ALL. And, generally, write you queries standalone. Sometimes batch is useful. Don't execute queries with ALLOW FILTERING. Don't use token ranges or IN operator on partition keys :)

Cassandra bulk insert operation, internally

I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.

Resources