Columnar/Column-oriented database vs wide-column/column family database - cassandra

I got really confused about Cassandra recently because most online material even AWS and Google describe it as columnar database. But actually it is row-based, partitioned, column-family database. Now everything makes sense to me:
Then are all characteristics such as
being optimised for foreign join
aggregating many rows and a few columns/ single column aggregation
Scanning only column by column as opposed to row-based database
still valid or true for Cassandra AKA wide-column/column-family database. If not, what are their real characteristics? Write Performance?
IN addition, I really need a true columnar database example to study on?(as opposed to Cassandra and data are store in different column blocks)
Can anyone help? It troubles me for a long time.

Related

Best way of querying table without providing the primary key

I am designing the data model of our Scylla database. For example, I created a table, intraday_history, with fields:
CREATE TABLE intraday_history (id bigint,timestamp_seconds bigint,timestamp timestamp,sec_code text,open float,high float,low float,close float,volume float,trade int, PRIMARY KEY ((id,sec_code),timestamp_seconds,timestamp));
My id is a twitter_snowflake generated 64-bit integers.. My problem is how can I use WHERE without providing always the id (most of the time I will use the timestamp with bigint). I also encounter this problem in other tables. Because the id is unique then I cannot query a batch of timestamp.
Is it okay if lets say for a bunch of tables for my 1 node, I will use an ID like cluster1 so that when I query the id I will just id=cluster1 ? But it loss the uniqueness feature
Allow filtering comes as an option here. But I keep reading that it is a bad practice, especially when dealing with millions of query.
I'm using the ScyllaDB, a compatible c++ version of Apache Cassandra.
In Cassandra, as you've probably already read, the queries derive the tables, not the other way around. So your situation where you want to query by a different filter would ideally entail you creating another Cassandra table. That's the optimal way. Partition keys are required in filters unless you provide the "allow filtering" "switch", but it isn't recommended as it will perform a DC (possibly cluster)-wide search, and you're still subjected to timeouts. You could consider using indexes or materialized views, which are basically cassandra maintained tables populated by the base table's changes. That would save you the troubles of having the application populate multiple tables (Cassandra would do it for you). We've had some luck with materialized views, but with either of these components, there can be side effects like any other cassandra table (inconsistencies, latencies, additional rules, etc.). I would say do a bit of research to determine the best approach, but most likely providing "allow filtering" isn't the best choice (especially for high volume and frequent queries or with tables containing high volumes of data). You could also investigate SOLR if that's an option, depending on what you're filtering.
Hope that helps.
-Jim

How to use join in cassandra database tables like mysql

I am new to node.js.
And use Express framework and use cassandra as a database.
My question is it is possible to use join with multiple table relationship like mysql.
For e.g.,
In mysql
select * from `table1`
left join `table2` on table1.id=table2.id
Short: No.
There are no joins in cassandra (and other nosql databases) - which is different from relational databases.
In cassandra the standard way is to denormalize data and maybe store multiple copies if necessary. As a rule of thumb think query first - and store your data in that way you need to query it later.
Cassandra will perform very very well if your data is evenly spread accross your cluster and the every day queries hit only one or only a few primary key(s) (to be exact - partition key).
Have a look at: http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
And there are trainings from DataStax: https://academy.datastax.com/resources/ds220-data-modeling (and others too).
No. Joins are not possible in cassandra (or any other NoSQL databases).
You should design your table in cassandra based on your query requirements.
And while using NoSQL system it's recommended to de-normalize your data.
Basic Rules of Cassandra Data Modeling
Basic googling returns this blog post from the company behind Cassandra: https://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise
In short, it says that joins are possible via Spark or ODBC connections.
It comes down to the performance characteristics of such joins, esp. compared to making a "join" by hand, i.e. a lookup query on one table for every (relevant) row on the other. Any ideas?

Big data solution for frequent queries

I need a big data storage solution for batch inserts of denormalized data which happen infrequently and queries on the inserted data which happen frequently.
I've gone through Cassandra and feel that its not that good for batch inserts, but an OK solution for querying. Also, it would be good if there was a mechanism to segregate data separately based on a data attribute.
As you mentioned Cassandra I will talk about it:
Can you insert in an unbatched way or is this impossed by the system? If you can insert unbatched, Cassandra will probably be able to handle it easily.
Batched inserts should also be handable by Cassandra nodes, but this won't distribute the load properly among all the nodes (NOTE: I'm talking about load balancing, not about data balance, which will be only depending on your partition key setup). If you are not very familiar with Cassandra you could tell us your data structure and your query types and we could suggest you how to use Cassandra's data model to fit it.
For the filtering part of the question, Cassandra has clustering keys and secondary indexes, that are basically like adding another column configuration to the clustering key so that you have both for querying.

How to transfer mysql data to cassandra database?

I am running mysql server in production side which is currently having 200 GB data. Now it is very difficult to manage mysql server because it is growing exponetially. I have heard a lot about cassandra and I did POC on that. Cassandra provide high availability and eventually consistent data . Cassandra is perfect for our requirement. Now the problem is how to transfer all mysql data to cassandra database.
Since MYSQL is relational database and cassandra is NOSQL. How to map MYSQL table and its relational table to cassandra table.
I believe you are asking the wrong question. There is no rule for transitionning from a relational model to Cassandra.
The first question is the following: What are your requirements in terms of performance, availability, data volume & growth, and most important of all query abilities? Do you need ACID? Can you change the applicative code accessing the database to fit to a Cassandra more denormalized model?
The answer to these questions will tell you whether Cassandra is compatible with your use case or not.
As a rule of thumb:
If you use mysql with a lot of indices and usually perform join during queries the Cassandra data model, then the applicative code to use the database will require a lot of work, or maybe even Cassandra will not be the right choice. Same, if you really need ACID you may have a problem with Cassandra consistency model.
If your SQL data model is fully denormalized and your perform queries without joins, then you can just replicate your DB tables schema as Cassandra column families and you're done, even if this may not be optimal.
Your use case is probably in between and you really need to understand how you can model your data in cassandra, you have to get this understanding and perform this analysis by yourself because you know your domain and we don't. However, don't hesitate to give clues about your model and how you need to query your data so you can be advised.
200GB is low for Cassandra and you may discover that your data is taking much less space in Cassandra than in MYSQL, even when widely denormalized because Cassandra is pretty efficient.
you can migrate data from mysql to cassandra using spark .
spark have connectivity with mysql as well as cassandra . First you have create model in cassandra according your requirement after that you have pull all data from mysql and after done some transformation you can directly push data in cassandra .
Transferring relational data directly to Cassandra isn't possible. You have to denormalize it. However, be warned that some queries and methods of denormalizing those are anti-patterns. Get through those free courses first:
http://academy.datastax.com/courses/ds201-cassandra-core-concepts
https://academy.datastax.com/courses/ds220-data-modeling
If you fail at Cassandra's data model design of your relational data, you won't get nice features provided by Cassandra. For eample you won't get horizontal scalability (you might have hot-spots in your claster) or high avaiability (it might happen that for some queries all of the nodes will be needed to build response)

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

Resources