How to use join in cassandra database tables like mysql - node.js

I am new to node.js.
And use Express framework and use cassandra as a database.
My question is it is possible to use join with multiple table relationship like mysql.
For e.g.,
In mysql
select * from `table1`
left join `table2` on table1.id=table2.id

Short: No.
There are no joins in cassandra (and other nosql databases) - which is different from relational databases.
In cassandra the standard way is to denormalize data and maybe store multiple copies if necessary. As a rule of thumb think query first - and store your data in that way you need to query it later.
Cassandra will perform very very well if your data is evenly spread accross your cluster and the every day queries hit only one or only a few primary key(s) (to be exact - partition key).
Have a look at: http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
And there are trainings from DataStax: https://academy.datastax.com/resources/ds220-data-modeling (and others too).

No. Joins are not possible in cassandra (or any other NoSQL databases).
You should design your table in cassandra based on your query requirements.
And while using NoSQL system it's recommended to de-normalize your data.
Basic Rules of Cassandra Data Modeling

Basic googling returns this blog post from the company behind Cassandra: https://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise
In short, it says that joins are possible via Spark or ODBC connections.
It comes down to the performance characteristics of such joins, esp. compared to making a "join" by hand, i.e. a lookup query on one table for every (relevant) row on the other. Any ideas?

Related

Best way of querying table without providing the primary key

I am designing the data model of our Scylla database. For example, I created a table, intraday_history, with fields:
CREATE TABLE intraday_history (id bigint,timestamp_seconds bigint,timestamp timestamp,sec_code text,open float,high float,low float,close float,volume float,trade int, PRIMARY KEY ((id,sec_code),timestamp_seconds,timestamp));
My id is a twitter_snowflake generated 64-bit integers.. My problem is how can I use WHERE without providing always the id (most of the time I will use the timestamp with bigint). I also encounter this problem in other tables. Because the id is unique then I cannot query a batch of timestamp.
Is it okay if lets say for a bunch of tables for my 1 node, I will use an ID like cluster1 so that when I query the id I will just id=cluster1 ? But it loss the uniqueness feature
Allow filtering comes as an option here. But I keep reading that it is a bad practice, especially when dealing with millions of query.
I'm using the ScyllaDB, a compatible c++ version of Apache Cassandra.
In Cassandra, as you've probably already read, the queries derive the tables, not the other way around. So your situation where you want to query by a different filter would ideally entail you creating another Cassandra table. That's the optimal way. Partition keys are required in filters unless you provide the "allow filtering" "switch", but it isn't recommended as it will perform a DC (possibly cluster)-wide search, and you're still subjected to timeouts. You could consider using indexes or materialized views, which are basically cassandra maintained tables populated by the base table's changes. That would save you the troubles of having the application populate multiple tables (Cassandra would do it for you). We've had some luck with materialized views, but with either of these components, there can be side effects like any other cassandra table (inconsistencies, latencies, additional rules, etc.). I would say do a bit of research to determine the best approach, but most likely providing "allow filtering" isn't the best choice (especially for high volume and frequent queries or with tables containing high volumes of data). You could also investigate SOLR if that's an option, depending on what you're filtering.
Hope that helps.
-Jim

Dynamic Cassandra queries

I have a messenger application with a history page, on which you can see your sent and received messages.
Since the amount of messages has lowered my performance I have been thinking about using Cassandra.
After researching on the topic of Cassandra, I found out that you have to build tables to satisfy your queries.
Now the problem: on the history page you can use x amount of different filters at the same time. e.g filter by date,receiver and sender.
If I were to use Cassandra, would I need to create a table for every combination of these filters?
Or is this a bad use case for Cassandra in general?
If so, are there any alternatives?
Why don't you just make a SELECT statement.
You should definately have a look into CQL (Cassandra Query Language).
While CQL and SQL share a similar syntax queries are a lot different.
The reasons for these differences is the fact that Cassandra is dealing with distributed data and aims to prevent inefficient queries.
See this link for reference. It shows queries you can or cannot do.

How to transfer mysql data to cassandra database?

I am running mysql server in production side which is currently having 200 GB data. Now it is very difficult to manage mysql server because it is growing exponetially. I have heard a lot about cassandra and I did POC on that. Cassandra provide high availability and eventually consistent data . Cassandra is perfect for our requirement. Now the problem is how to transfer all mysql data to cassandra database.
Since MYSQL is relational database and cassandra is NOSQL. How to map MYSQL table and its relational table to cassandra table.
I believe you are asking the wrong question. There is no rule for transitionning from a relational model to Cassandra.
The first question is the following: What are your requirements in terms of performance, availability, data volume & growth, and most important of all query abilities? Do you need ACID? Can you change the applicative code accessing the database to fit to a Cassandra more denormalized model?
The answer to these questions will tell you whether Cassandra is compatible with your use case or not.
As a rule of thumb:
If you use mysql with a lot of indices and usually perform join during queries the Cassandra data model, then the applicative code to use the database will require a lot of work, or maybe even Cassandra will not be the right choice. Same, if you really need ACID you may have a problem with Cassandra consistency model.
If your SQL data model is fully denormalized and your perform queries without joins, then you can just replicate your DB tables schema as Cassandra column families and you're done, even if this may not be optimal.
Your use case is probably in between and you really need to understand how you can model your data in cassandra, you have to get this understanding and perform this analysis by yourself because you know your domain and we don't. However, don't hesitate to give clues about your model and how you need to query your data so you can be advised.
200GB is low for Cassandra and you may discover that your data is taking much less space in Cassandra than in MYSQL, even when widely denormalized because Cassandra is pretty efficient.
you can migrate data from mysql to cassandra using spark .
spark have connectivity with mysql as well as cassandra . First you have create model in cassandra according your requirement after that you have pull all data from mysql and after done some transformation you can directly push data in cassandra .
Transferring relational data directly to Cassandra isn't possible. You have to denormalize it. However, be warned that some queries and methods of denormalizing those are anti-patterns. Get through those free courses first:
http://academy.datastax.com/courses/ds201-cassandra-core-concepts
https://academy.datastax.com/courses/ds220-data-modeling
If you fail at Cassandra's data model design of your relational data, you won't get nice features provided by Cassandra. For eample you won't get horizontal scalability (you might have hot-spots in your claster) or high avaiability (it might happen that for some queries all of the nodes will be needed to build response)

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

Inner Join in cassandra CQL

How do I write subqueries/nested queries in cassandra. Is this facility is provided in CQL?
Example I tried:
cqlsh:testdb> select itemname from item where itemid = (select itemid from orders where customerid=1);
It just throws the following error -
Bad Request: line 1:87 no viable alternative at input ';'
Because of its distributed nature, Cassandra has no support for RDBMS style joins. You have a few options for when you want something like a join.
One option perform separate queries and then have your application join the data itself. This makes sense if the data is relatively small and you only have to perform a small number of queries. Based on the example you gave above, this would probably be a good solution for you.
For more complicated joins, the usual strategy is to denormalize the data and store a materialized view of the join. The advantage to this is that fetching this data will be much faster than having to build it join in your application every time you need it. The cost is now you have multiple places where you are storing the same data and you will need to keep it all in sync. You can either update all your views when new data comes into the system or you can have a periodic batch job that rebuilds thems.
You might find this article useful: Do You Really Need SQL to Do It All in Cassandra? Its a bit old but its principles still apply.

Resources