Most common cassandra cql commands - cassandra

I am looking at exploring optimization of cassandra for a limited set of commands. For that I wanted to know which among SELECT, INSERT, UPDATE, DELETE & BATCH is the CQL command with highest frequency of use in realtime systems. Any pointers and thoughts on this would be great help.

There is no such thing as common cql commands, it all depends for which use case cassandra is deployed.
So Instead of optimizing commands you could go for Use Case based optimization:
Eg: UseCase: Write oriented Workload:
Optimize Insert and Update commands.

Related

"LIKE" and "ILIKE" queries in AWS Keyspaces

As there is no support for custom index in AWS Keyspaces what would be the best solution / pattern to be able to run LIKE or ILIKE queries on specific columns of a Cassandra Table?
In vanilla Cassandra, you can use SSTable secondary index to use LIKE queries, but we can't in AWS...
Is there any query for Cassandra as same as SQL:LIKE Condition?
Feeding an OpenSearch service, or even a good old Postgres at the same time of updating Keyspaces seems a bit overkill to me.
Fetching all columns in-memory somewhere to do the query seems slow as well.
What would be the lightest infra / architecture to implement to provide a LIKE query support based on AWS Keyspaces as source of truth?
You can use a Lexi-graphical Select statement to narrow your query down the same way you would do a LIKE statement. If you needed to further narrow it down you could do that narrowing client side. I would love to learn more your use case so I can better assist you.

Getting data OUT of Cassandra?

How can I export data, over a period of time (like hourly or daily) or updated records from a Cassandra database? It seems like using an index with a date field might work, but I definitely get timeouts in my cqlsh when I try that by hand, so I'm concerned that it's not reliable to do that.
If that's not the right way, then how do people get their data out of Cassandra and into a traditional database (for analysis, querying with JOINs, etc..)? It's not a java shop, so using Spark is non-trivial (and we don't want to change our whole system to use Spark instead of cassandra directly). Do I have to read sstables and try to keep track of them that way? Is there a way to say "get me all records affected after point in time X" or "get me all changes after timestamp X" or something similar?
It looks like Cassandra is really awesome at rapidly reading and writing individual records, but beyond that Cassandra seems to not be the right tool if you want to pull its data into anything else for analysis or warehousing or querying...
Spark is the most typical to do exactly that (as you say). It does it efficiently and is used often so pretty reliable. Cassandra is not really designed for OLAP workloads but things like spark connector help bridge the gap. DataStax Enterprise might have some more options available to you but I am not sure their current offerings.
You can still just query and page through the whole data set with normal CQL queries, its just not as fast. You can even use ALLOW FILTERING just be wary as its very expensive and can impact your cluster (creating a separate dc for the workload and using LOCOL_CL queries against it helps). You will probably also in that scenario add a < token() and > token() to the where clause to split up the query and prevent too much work on any one coordinator. Organizing your data so that this query is more efficient would be strongly recommended (ie if doing time slices, put things in a partition bucketed by time and clustering key timeuuids so its sequential read for each part of time).
Kinda cheesy sounding but the CSV dump from cqlsh is actually fast and might work for you if your data set is small enough.
I would not recommend going to the sstables directly unless you are familiar with internals and using hadoop or spark.

How to Create slowness in Cassandra?

I want to create slowness in Cassandra to test my application. Is there any specific ways to induce slowness in Cassandra. In RDBMS we use locking, to wait for other operation until the lock is released. As Cassandra doesn't have locking, is there any other way to create deadlock, slowness etc.
You could use cassandra-stress tool
You could check out our project here simulacron. https://github.com/datastax/simulacron
This is a C*/DSE simulator, that was written specifically to test things like race conditions, and error conditions. You would have to prime all your relevant queries ahead of time, but it would allow you introduce a wait time, or errors to your responses. You can also simulate a large cluster on your local machine.
There is also a similar tool called scassandra, which does much of the same thing.
http://www.scassandra.org/
There are many ways to do it, i'll list two:
Create UDF with sleep/wait function within, if your version of Cassandra supports it.
Link to the docs:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateUDF.html
Create large table (the larger it be, slower it will run), and run:
select some_column from table where other_column = 'something' allow filtering;
where other_column is not a partition key of the table. It will result in full table scan, and since Cassandra isn't built for it, it will take some time (also I/O and CPU).
Maybe easier to just limit the network on the nodes. Depending on the OS ure using there are different options.

Does Accumulo support aggregation?

I am new to Accumulo. I know that I can write Java code to scan, insert, update and delete data using Hadoop and MapReduce. What I would like to know is whether aggregation is possible in Accumulo.
I know that in MySql we can use groupby,orderby,max,min,count,sum,joins, nested queries, etc. Is their is any possibility to use these functions in Accumulo either directly or indirectly.
Accumulo does support aggregation through the use of combiner iterators (Accumulo Combiner Example ).
Iterators mostly run server-side, but can be run client-side, and can perform quite a bit of computation before sending the data back to your client.
Accumulo comes packaged with many iterators, more specifically the summingCombiner is used to sum the values of entries. Dave Medinet's has a blog that has some good examples (Accumulo Blog). More specifically, using the summingCombiner to implement wordcount (Word Count in Accumulo). I also suggest signing up for the Accumulo users mailing lists (mailing lists).
I like to think Accumulo has great agg functionality. I run an OLAP solution on it with hundreds of millions of keys on 40 nodes. In addition to the basic SummingCombiner, I recommend the newer statscombiner as well
http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/examples/simple/combiner/StatsCombiner.html
which gives you basic stats about a set of keys.
You can set combiners at maj compaction, minor compaction or scan time. If you have a ton of data with a lot of trickled keys, I don't recommend scan time combining, because it can slow down the scan time (not always).
HTH
Some aggregation is supported in Accumulo, over multiple entries, and even multiple rows, within each tablet. Aggregation across tablets would need to be done on the client side or in a MapReduce job.
Yes, Aggregations are possible in Accumulo. you can achieve them by -
1) Using in built Combiners which aggregate data when you ingest.
2) Make Customised Aggregation Iterator and then deploy it at minor or majour compactions.

Cassandra bulk insert operation, internally

I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.

Resources