Is it bad to use INDEX in Cassandra if performance is not important? - cassandra

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?

Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Related

Is RDBMS with redundancy as good as nosql dbs?

I am reading about NoSQL DBs (Specifically Cassandra) and It says that Cassandra is faster for writing and queries are fast as well. Schema design is done more based on queries than based on data. For example, You have queries like in this example
then I have a question, Suppose I design the RDBMS schema similar to Cassandra's way and I ensure that no joins are required for queries. Will I get any significant performance gains still by using Cassandra(NoSql DBs)?
Cannot have an exact answer but few points,
JOIN is just of the many things - Cassandra stores the data physically based on the partition keys and hence making the read by partition as fast as possible.
On the performance side - its not about the performance at the beginning but keeping the performance consistent over a period of time. Say for example you have a time series like requirement where data is inserted every second, RDBMS performance will usually degrade as the data grows and not easy to keep up the index and stats up to date etc, while cassandra will fit better for a time series pattern and as the data grows its easy to scale up by adding nodes.
On the write performance - Cassandra's write workflow itself is different and is designed in a way to take up faster (the complicated process like merging sstabls, compaction etc happens in the background without affecting the actual write).
In short - you need to review the business case and make decision.

Getting data OUT of Cassandra?

How can I export data, over a period of time (like hourly or daily) or updated records from a Cassandra database? It seems like using an index with a date field might work, but I definitely get timeouts in my cqlsh when I try that by hand, so I'm concerned that it's not reliable to do that.
If that's not the right way, then how do people get their data out of Cassandra and into a traditional database (for analysis, querying with JOINs, etc..)? It's not a java shop, so using Spark is non-trivial (and we don't want to change our whole system to use Spark instead of cassandra directly). Do I have to read sstables and try to keep track of them that way? Is there a way to say "get me all records affected after point in time X" or "get me all changes after timestamp X" or something similar?
It looks like Cassandra is really awesome at rapidly reading and writing individual records, but beyond that Cassandra seems to not be the right tool if you want to pull its data into anything else for analysis or warehousing or querying...
Spark is the most typical to do exactly that (as you say). It does it efficiently and is used often so pretty reliable. Cassandra is not really designed for OLAP workloads but things like spark connector help bridge the gap. DataStax Enterprise might have some more options available to you but I am not sure their current offerings.
You can still just query and page through the whole data set with normal CQL queries, its just not as fast. You can even use ALLOW FILTERING just be wary as its very expensive and can impact your cluster (creating a separate dc for the workload and using LOCOL_CL queries against it helps). You will probably also in that scenario add a < token() and > token() to the where clause to split up the query and prevent too much work on any one coordinator. Organizing your data so that this query is more efficient would be strongly recommended (ie if doing time slices, put things in a partition bucketed by time and clustering key timeuuids so its sequential read for each part of time).
Kinda cheesy sounding but the CSV dump from cqlsh is actually fast and might work for you if your data set is small enough.
I would not recommend going to the sstables directly unless you are familiar with internals and using hadoop or spark.

Cassandra data model too many table

I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.

Pros and Cons of Cassandra User Defined Functions

I am using Apache Cassandra to store mostly time series data. And I am grouping the data and aggregating/counting it based on some conditions. At the moment I am doing this in a Java 8 application, but with the release of Cassandra 3.0 and the User Defined Functions, I have been asking myself if extracting the grouping and aggregation/counting logic to Cassandra is a good idea. To my understanding this functionallity is something like the stored procedures in SQL.
My concern is if this will impact the computation performance and the overall performance of the database. I am also not sure if there are other issues with it and if this new feature is something like the secondary indexes in Cassandra - you can do them, but it is not recommended at all.
Have you used user defined functions in Cassandra? Do you have any observations on the performance? What are the good and bad sides of this new functionality? Is it applicable in my use case?
You can compare it to using count() or avg() kind of aggregations. They can save you a lot of network traffic and object creation/GC by having the coordinator only send the result, but its easy to get carried away and make the coordinator do a lot of work. This extra work takes away from normal C* duties, and can just as likely increase GCs as reduce them.
If your aggregating 100 rows in a partition its probably fine and if your aggregating 10000 its probably not end of the world if its very rare. If your calling it once a second though its a problem. If your aggregating over 1000 I would be very careful.
If you absolutely need to do it and its a lot of data often, you may want to create dedicated proxy coordinators (-Djoin_ring=false) to bear the brunt of the load without impacting normal C* read/writes. At that point its just as easy to create dedicated workload DC for it or something (with RF=0 for your keyspace, and set application to be part of that DC with DCAwareRoundRobinPolicy). This also is the point where using Spark is probably the right thing to do.

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

Resources