How to use "LIKE" clause in Cassandra - cassandra

I have just started reading about datastax cassandra few days back so I am sort of newbie in this technology. I have some doubts/queries and need to get clarification. Such as:
Which version of Cassandra is more suitable to use 2.1/2.0 ? Right now I am using 2.1 which is not stable and recommended to use. Even though using this(2.1) leads to some problems in future then what would be the better choice to opt for ?
Does Cassandra supports "Like" clause ? If yes, in which version ? And how ? If not, then what can be the alternative?

I'm using apache cassandra 2.1.2. It's been running in prod since release. No major issues.
No... look to pair it up with Lucene or elasticsearch. If you're on DSE, DSE search nodes can give you this feature. You might even want to check out http://www.openstratio.org/blog/advanced-search-in-cassandra/ . The guys at stratio have added the ability to have a text column representing a lucene query, which is quite interesting.

Related

Does cassandra lucene index work with vnode? say num_tokens =256

I heard it's not a good practice to enable vnode if using cassandra lucene index. Is it still true? I'm using apache cassandra 3.0.5 and matching stratio lucene indexer.
Thanks.
According to the docs it is still true.

Cassandra secondary indices v. Lucene

I understand that Cassandra is a NoSQL db and patching it with many indices is not the way to go, but here I'm looking at solution for my analytics cluster, not for the production/real-time one.
So I think it makes sense to add indices to reduce the amount of data filtered by Spark.
How do native Cassandra secondary indices compare to Lucene's indices?
Many functionalities are not available with Cassandra alone, but what about things that you can do with both?
Is it better / does it make sense to only use Lucene?
Another advantage that I see is that I can install Lucene only on my analytics cluster, without overloading the real-time one with indices (and therefore improving the write performance on that side).
Don't bother with Lucene integration
Since Cassandra 3.4, we have a new secondary index called SASI that offers full text search and is quite performant.
Read this: https://github.com/apache/cassandra/blob/trunk/doc/SASI.md

Bigdata analysis in nosql

I'm trying to migrate our postgres database containing millions of clicks (few years click history) to more performing system. Our current analytic queries, which are running on postgres are taking forever to complete and it degrades performance of the whole database. I've been investigating possible solutions and I've decided to closely investigate 2 options:
HBase with Hadoop (mapreduce)
Cassandra with Spark
I was working with NoSQL before, however never used it for analytical purposes. At first I was a bit disapointed how little analytical query options those databases provide (missing groupBy, count, ...). After reading many articles and presentations I've found out, that I need to design my schema according how I intend to read my data and that storage layer is separated from query layer. Which adds more redundant data, however in the world of NoSQL this is not an issue.
Eventually I've found one nice grails plugin cassandra-orm, which internally encapsulates orderBy feature in cassandra counters counters. However I'm still worried about howto make this design extendable. What about the queries, that will come in the future, which I have no clue about today, how can I design my schema prepared for that ?
One option would be to use Spark, but Spark doesn't provide data in real time.
Could you give me some insight or advice what are the best possible options for bigdata analysis. Should I use combination of real time queries vs. pre-aggregated ones?
Thanks,
If you are looking at near real time data analysis, Spark + HBase combination is one of the solutions.
If you want to compromise on throughput, Solr + Cassandra combination from Datastax can be used.
I am using Solr + Cassandra from Datastax for my use case, which does not require real time processing. The performance of search option is not that great with this combo but I am OK with the throughput.
Spark+HBase combination seems to be promising. Depending on your business requirement & expertise, you can chose the right combination.
If you want the ability to analyse data in near-real-time with complete flexibility in query structure, I think your best bet would be to throw a scalable indexing engine such as Elasticsearch or Solr into your polyglot persistence mix. You could still use Cassanra as the primary data store and then index those fields you're interested in querying and/or aggregating.
Have a look at Datastax Enterprise which bundles together Cassandra and Solr. Also have a look at Solr's Stats component and its faceting capabilities. These, combined with the indexing engine's rich query language, are handy for implementing many analytics use cases.
If your data set consists of a few million records 'only', I think you'll be able to get some good response times from Solr or ES on a reasonably spec'ed cluster.

Storing a Lucene index in a Cassandra DB

Is there any way to use Apache Lucene and have it store values and retrieve values from a Cassandra cluster?
The hard way: implement a custom index type on top of Lucene and teach Cassandra to query it. There is also a two year old ticket open for this that you could watch.
The expensive way: buy a DataStax Enterprise license.
You should try out https://code.google.com/p/lucene-on-cassandra/ . Takes a different approach to the DataStax approach.

Confusing between Thrift API and CQL

I am working in a Java web application, using NoSQL (target is Cassandra). I use Astyanax as Cassandra client since it is suggested the best client of Cassandra for now. I've just approached Cassandra for 2 weeks, so many things is so weird to me.
During my working, I encountered some problems and I do not know how to overcome:
Is table created from CQL like column family created by Thrift API? I feel they are similar, but maybe there are some differences behind. For example:
table create by CQL command cannot be accessed by Thrift API
Thrift-based APIs cannot work with tables created by CQL, but CQL methods can access column family created by Thrift API!
​Is primary key in table correspond to row key in column family?
In CQL I can declare a table which contains a collection/set/map inside. Can I do the same thing in Thrift API?
If my application needs both of them (column families and tables), how can they deal with each other?
I recognize one thing: I cannot use Thrift API to do manipulating data on tables create by CQL, and vice versa. I wonder that that, how can I remember which table/column family created from which way so that I can use the correct APIs to process data? For the time being, we don't have a general way to handle two of them, do we? AFAIK, Thrift API and CQL do not have a same interface, so they cannot understand each other?!
Could you please help me explain these things? Thank you so much.
Yes. It's impossible to update the Thrift APIs to be CQL-aware without breaking existing applications. So if you use CQL you are committing to using CQL clients only like the Java driver, and not Astyanax, Hector, et al. But this is no great sacrifice since CQL is much more usable.
For a simple PK (i.e., single column), yes. For a compound PK, it's a bit more complicated.
No. The Thrift API operates at a lower level, by design. (So you'd see the individual storage cells that make up the Map, for instance.)
I don't understand the question. With CQL you can do everything you could do with Thrift, but more easily.
Simple; don't mix the two. Stick with one or the other.
In my opinion, I believe focus is shifting towards making cassandra look like a RDBMS with SQL Queries to gain wider adoption.
But with inconsistencies between work done using Hector/Astyanax(thrift) and CQL, i think it will hurt adoption. Its almost a U turn from hector/astyanax to CQL in the middle of the journey.
Atleast CQL should have been planned in such a way that Thrift api (and high level java apis on top of it) have no problem in transitioning.

Resources