I understand that Cassandra is a NoSQL db and patching it with many indices is not the way to go, but here I'm looking at solution for my analytics cluster, not for the production/real-time one.
So I think it makes sense to add indices to reduce the amount of data filtered by Spark.
How do native Cassandra secondary indices compare to Lucene's indices?
Many functionalities are not available with Cassandra alone, but what about things that you can do with both?
Is it better / does it make sense to only use Lucene?
Another advantage that I see is that I can install Lucene only on my analytics cluster, without overloading the real-time one with indices (and therefore improving the write performance on that side).
Don't bother with Lucene integration
Since Cassandra 3.4, we have a new secondary index called SASI that offers full text search and is quite performant.
Read this: https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
Related
I've been evaluating Cassandra to replace MySQL in our microservices environment, due to MySQL being the only portion of the infrastructure that is not distributed. Our needs are both write and read intensive as it's a platform for exchanging raw data. A type of "bus" for lack of better description. Our selects are fairly simple and should remain that way, but I'm already struggling to get past some basic filtering due to the extreme limitations of select queries.
For example, if I need to filter data it has to be in the key. At that point I can't change data in the fields because they're part of the key. I can use a SASI index but then I hit a wall if I need to filter by more than one field. The hope was that materialized views would help with this but in another post I was told to avoid them, due to some instability and problematic behavior.
It would seem that Cassandra is good at storage but realistically, not good as a standalone database platform for non-trivial applications beyond very basic filtering (i.e. a single field.) I'm guessing I'll have to accept the use of another front-end like Elastic, Solr, etc. The other option might be to accept the idea of filtering data within application logic, which is do-able, as long as the data sets coming back remain small enough.
Apache Cassandra is far more than just a storage engine. Its design is a distributed database oriented towards providing high availability and partition tolerance which can limit query capability if you want good and reliable performance.
It has a query engine, CQL, which is quite powerful, but it is limited in a way to guide user to make effective queries. In order to use it effectively you need to model your tables around your queries.
More often than not, you need to query your data in multiple ways, so users will often denormalize their data into multiple tables. Materialized views aim to make that user experience better, but it has had its share of bugs and limitations as you indicated. At this point if you consider using them you should be aware of their limitations, although that is generally good idea for evaluating anything.
If you need advanced querying capabilities or do not have an ahead of time knowledge of what the queries will be, Cassandra may not be a good fit. You can build these capabilities using products like Spark and Solr on top of Cassandra (such as what DataStax Enterprise does), but it may be difficult to achieve using Cassandra alone.
On the other hand there are many use cases where Cassandra is a great fit, such as messaging, personalization, sensor data, and so on.
I have been using Elasticsearch for quite sometime now and little experience using Cassandra.
Now, I have a project we want to use spark to process the data but I need to decide if we should use Cassandra or Elasticsearch as the datastore to load my data.
In terms of connector, both Cassandra and Elasticsearch now has a good connector to load the data so that won't be deciding factor.
The winning factor to decide will be how fast I can load my data inside Spark. My data is almost 20 terabytes.
I know I can run some test using JMeter and see the result myself but I would like to ask anyone familiar with both systems.
Thanks
The short exact answer is "it depends", mostly on cluster sizes =)
I wouldn't chose Elastisearch as a primary source for the data, because it's good at searching. Searching is a very specific task and it requires a very specific approach, which in this cases uses inverted index to store actual data. Each field basically goes into separate index and because of that the indexes are very compact. Although it's possible to store into index complete objects, such an index will hardly get any benefit of compression. That requires much more disk space to store indexes and much more cpu clocks, spinning disks to process them.
Cassandra on the other hand is pretty good at storing and retrieving data.
Without any more or less specific requirements, I'd say that Cassandra is good at being primary storage (and provides pretty simple search scenarios) and ES is good at searching.
I will refute Evgenii answer about how ES is only good at searching.
YES ES exceed at text search but it doesnt mean it can't do data.
You can actually treat it as if it was "Mongo" style Documentation and run "filter" queries on it to have fast fetch results. HOWEVER the question now becomes: how fast do you need your read/write and do you need any distributions? What ES lacks is distribution. Yes ES can do sharding but it has issues doing multi region distribution and reliability of replication of your data.
If you need the flexibility / reliability of your data I would swing to Cassanda. Also since you are dealing with TB - Cassandra might be a winner too because it is fitted for extreme volume.
If you need an easier time to to run searches (not limited to text search, eg: geo spacial you can do too) then ES might be a better fit. (note for the shear volume you are doing, you will need to shard in order to distribute your load).
I'm trying to migrate our postgres database containing millions of clicks (few years click history) to more performing system. Our current analytic queries, which are running on postgres are taking forever to complete and it degrades performance of the whole database. I've been investigating possible solutions and I've decided to closely investigate 2 options:
HBase with Hadoop (mapreduce)
Cassandra with Spark
I was working with NoSQL before, however never used it for analytical purposes. At first I was a bit disapointed how little analytical query options those databases provide (missing groupBy, count, ...). After reading many articles and presentations I've found out, that I need to design my schema according how I intend to read my data and that storage layer is separated from query layer. Which adds more redundant data, however in the world of NoSQL this is not an issue.
Eventually I've found one nice grails plugin cassandra-orm, which internally encapsulates orderBy feature in cassandra counters counters. However I'm still worried about howto make this design extendable. What about the queries, that will come in the future, which I have no clue about today, how can I design my schema prepared for that ?
One option would be to use Spark, but Spark doesn't provide data in real time.
Could you give me some insight or advice what are the best possible options for bigdata analysis. Should I use combination of real time queries vs. pre-aggregated ones?
Thanks,
If you are looking at near real time data analysis, Spark + HBase combination is one of the solutions.
If you want to compromise on throughput, Solr + Cassandra combination from Datastax can be used.
I am using Solr + Cassandra from Datastax for my use case, which does not require real time processing. The performance of search option is not that great with this combo but I am OK with the throughput.
Spark+HBase combination seems to be promising. Depending on your business requirement & expertise, you can chose the right combination.
If you want the ability to analyse data in near-real-time with complete flexibility in query structure, I think your best bet would be to throw a scalable indexing engine such as Elasticsearch or Solr into your polyglot persistence mix. You could still use Cassanra as the primary data store and then index those fields you're interested in querying and/or aggregating.
Have a look at Datastax Enterprise which bundles together Cassandra and Solr. Also have a look at Solr's Stats component and its faceting capabilities. These, combined with the indexing engine's rich query language, are handy for implementing many analytics use cases.
If your data set consists of a few million records 'only', I think you'll be able to get some good response times from Solr or ES on a reasonably spec'ed cluster.
we trying to build a data-ware house for our transaction system.
- We make 5000 -6000 transaction per day, they can go > 20,000.
- Each transaction produce a file, size (> 4MB)
we want to have a system, which can make updates to the existing data, consistent and availability, and have good read performance. Infrastructure is not any issue.
Hbase or cassandra or any other ? your help and guidance is highly appreciated.
Many thanks!
Most of newer nosql platform can do what you need in terms of performance - both hbase and cassandra scales horizontally (also Aerospike and others) so performances can be guaranteed if the data-model respect the "product-patterns" for data distribution.
I would not choose the technology in terms of performances.
What I would do is:
a list of different features offered by a bunch of products and then consider the one that, out of the box, best fit my needs
a list of operation I need to do on data and check if I am not going "against" some specific product
While 1 is easily done the 2 need a deep product analysis. For instance you say you need to update existing data -- let's imagine you choose Cassandra and you update very very frequently a column on which you put a secondary index (that, under the hood, creates a lookup table) for searching purpose. Any time you perform an update on this column on the lookup table a deletion and insertion is performed. You can read in this article that performing many deletes in Cassandra is considered an anti-pattern and can lead to problematic situations. This is just an example I did on Cassandra because is the one I know best among nosql products and not to tell you avoid Cassandra.
According to the neo4j documentation, indexing can be done i 2 ways"
Indexing in Neo4j can be done in two different ways:
1. The database itself is a natural index consisting of its relationships of different types between nodes. For example a tree
structure can be layered on top of the data and used for index lookups
performed by a traverser.
2. Separate index engines can be used, with Apache Lucene being the default
backend included with Neo4j.
But there is no comparison which is better in what and what is better in which cases.
Which one is better and why?
Is this a data warehouse/mart or reporting database? If you have both transactions and search going against the database it might give interesting pros or cons.
Lucene exists for one reason search and it does it really well. If you have a large system with multiple services, for ultimate scalability it is always to split the services up and keep them doing their single responsibility. This would give you flexibility of using that Lucene index against other services if necessary...also if you ever got rid off neo4j, then you still have your index/search artifacts around not coupled to Neo4j.
I would look at it from the overall system architecture not just specific functionality.