I have close to six tables, each of them have from 20 to 60 columns in Cassandra. I am designing the schema for this database.
The requirement from the query is that all the columns must be queriable individually.
I know if the data has High-Cardinality using secondary indexes is not encouraged.
Materialized views will solve my purpose to an extent where I will be able to query on other columns as well.
My question is :
In this scenario, if each table has 30 to 50+ materialized views, is this an okay pattern to follow or is it going on a totally wrong track. Is it taking this functionality to its extreme. Maybe writes will start to become expensive on the system (I know they are written eventually and not with the immediate write to the actual table).
You definitely do not want 30 to 50 materialized views.
It sounds like the use case you're trying to satisfy is search, more so than a specific query.
If the queries that are going to be done on each column can be pre defined, then you can also go the denormalization route, trading flexibility of search for better performance and less operational overhead.
If you're interested in the search route, here's what I suggest you take a look at:
SASI Indexes (depending on Cassandra version you're using)
Elastic Search
Solr
DataStax Enterprise Search (disclaimer I work for DataStax)
Elassandra
Stratio
Those are just the ones I know off the top of my head. There may be others (Sorry if I missed you). I provided links to each so you can make your own informed decision as to which makes more sense for your use case.
Related
I've been evaluating Cassandra to replace MySQL in our microservices environment, due to MySQL being the only portion of the infrastructure that is not distributed. Our needs are both write and read intensive as it's a platform for exchanging raw data. A type of "bus" for lack of better description. Our selects are fairly simple and should remain that way, but I'm already struggling to get past some basic filtering due to the extreme limitations of select queries.
For example, if I need to filter data it has to be in the key. At that point I can't change data in the fields because they're part of the key. I can use a SASI index but then I hit a wall if I need to filter by more than one field. The hope was that materialized views would help with this but in another post I was told to avoid them, due to some instability and problematic behavior.
It would seem that Cassandra is good at storage but realistically, not good as a standalone database platform for non-trivial applications beyond very basic filtering (i.e. a single field.) I'm guessing I'll have to accept the use of another front-end like Elastic, Solr, etc. The other option might be to accept the idea of filtering data within application logic, which is do-able, as long as the data sets coming back remain small enough.
Apache Cassandra is far more than just a storage engine. Its design is a distributed database oriented towards providing high availability and partition tolerance which can limit query capability if you want good and reliable performance.
It has a query engine, CQL, which is quite powerful, but it is limited in a way to guide user to make effective queries. In order to use it effectively you need to model your tables around your queries.
More often than not, you need to query your data in multiple ways, so users will often denormalize their data into multiple tables. Materialized views aim to make that user experience better, but it has had its share of bugs and limitations as you indicated. At this point if you consider using them you should be aware of their limitations, although that is generally good idea for evaluating anything.
If you need advanced querying capabilities or do not have an ahead of time knowledge of what the queries will be, Cassandra may not be a good fit. You can build these capabilities using products like Spark and Solr on top of Cassandra (such as what DataStax Enterprise does), but it may be difficult to achieve using Cassandra alone.
On the other hand there are many use cases where Cassandra is a great fit, such as messaging, personalization, sensor data, and so on.
We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?
I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )
Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.
I'm trying to migrate our postgres database containing millions of clicks (few years click history) to more performing system. Our current analytic queries, which are running on postgres are taking forever to complete and it degrades performance of the whole database. I've been investigating possible solutions and I've decided to closely investigate 2 options:
HBase with Hadoop (mapreduce)
Cassandra with Spark
I was working with NoSQL before, however never used it for analytical purposes. At first I was a bit disapointed how little analytical query options those databases provide (missing groupBy, count, ...). After reading many articles and presentations I've found out, that I need to design my schema according how I intend to read my data and that storage layer is separated from query layer. Which adds more redundant data, however in the world of NoSQL this is not an issue.
Eventually I've found one nice grails plugin cassandra-orm, which internally encapsulates orderBy feature in cassandra counters counters. However I'm still worried about howto make this design extendable. What about the queries, that will come in the future, which I have no clue about today, how can I design my schema prepared for that ?
One option would be to use Spark, but Spark doesn't provide data in real time.
Could you give me some insight or advice what are the best possible options for bigdata analysis. Should I use combination of real time queries vs. pre-aggregated ones?
Thanks,
If you are looking at near real time data analysis, Spark + HBase combination is one of the solutions.
If you want to compromise on throughput, Solr + Cassandra combination from Datastax can be used.
I am using Solr + Cassandra from Datastax for my use case, which does not require real time processing. The performance of search option is not that great with this combo but I am OK with the throughput.
Spark+HBase combination seems to be promising. Depending on your business requirement & expertise, you can chose the right combination.
If you want the ability to analyse data in near-real-time with complete flexibility in query structure, I think your best bet would be to throw a scalable indexing engine such as Elasticsearch or Solr into your polyglot persistence mix. You could still use Cassanra as the primary data store and then index those fields you're interested in querying and/or aggregating.
Have a look at Datastax Enterprise which bundles together Cassandra and Solr. Also have a look at Solr's Stats component and its faceting capabilities. These, combined with the indexing engine's rich query language, are handy for implementing many analytics use cases.
If your data set consists of a few million records 'only', I think you'll be able to get some good response times from Solr or ES on a reasonably spec'ed cluster.
we trying to build a data-ware house for our transaction system.
- We make 5000 -6000 transaction per day, they can go > 20,000.
- Each transaction produce a file, size (> 4MB)
we want to have a system, which can make updates to the existing data, consistent and availability, and have good read performance. Infrastructure is not any issue.
Hbase or cassandra or any other ? your help and guidance is highly appreciated.
Many thanks!
Most of newer nosql platform can do what you need in terms of performance - both hbase and cassandra scales horizontally (also Aerospike and others) so performances can be guaranteed if the data-model respect the "product-patterns" for data distribution.
I would not choose the technology in terms of performances.
What I would do is:
a list of different features offered by a bunch of products and then consider the one that, out of the box, best fit my needs
a list of operation I need to do on data and check if I am not going "against" some specific product
While 1 is easily done the 2 need a deep product analysis. For instance you say you need to update existing data -- let's imagine you choose Cassandra and you update very very frequently a column on which you put a secondary index (that, under the hood, creates a lookup table) for searching purpose. Any time you perform an update on this column on the lookup table a deletion and insertion is performed. You can read in this article that performing many deletes in Cassandra is considered an anti-pattern and can lead to problematic situations. This is just an example I did on Cassandra because is the one I know best among nosql products and not to tell you avoid Cassandra.
Currently our system uses PostgreSQL, however we seem to have pushed the limit of its capabilities. Some of our tables need to handle over 100 read/write operations per second so it is probably time to scale horizontally across multiple machines.
Have a lot of experience using GAE's Big Table. Big Table had rich options for querying. For example, queries were possible against list data fields. Cassandra is supposed to be based off of Big Table, but if I understand correctly, for Cassandra, we will actually have to custom-code a layer on top of Cassandra that uses and maintains index tables.
Would be great if there was an open source database available for which we did not have to build our own custom logic for maintaining index tables, zig-zag merge joins, etc...
Is Cassandra a good candidate here? Or are there ones that might be considered better?
Unless the operations are huge joins or return hundreds of thousands of rows, any database you choose will be able to sustain 100 ops/s. Cassandra will have no problems serving thousands if not tens of thousands of reads and writes per node.
Without knowing more about your particular use case it's impossible to give you meaningful advice. Cassandra is a great database, but if it's right for you I don't know. I'd suggest looking through the cassandra tag here on Stack Overflow and look at what people ask about and if it looks at all like what you're trying to do, and if the answers say that it's possible with Cassandra (I know I've answered quite a few questions where the answer was that Cassandra wasn't the best choice for that particular case).
Cassandra and GAE Big Table have big similarities, but also big differences. One thing that trips up new Cassandra users is that there really isn't any way of doing things like "add this thing only unless that other thing was there" or "add an item and remove all but the last N items".