DCE Cassandra 3.9 slow secondary index creation during joining existing cluster - cassandra

We have cassandra cluster with 32 nodes, average node size is about 1TB. Node configuration 1xIntel Xeon E3-1271v3, 32GB ram, 2x3TB HDD.
We have one DB with some small tables and one big table, that holds is about 90-95% of total cluster size.
I try to add additional nodes to this cluster, but suddenly find out, that adding one node to existing cluster take is about 13-14 days for joining to cluster. Build secondary indexes take most of this time and all this time i see that all compactor threads take all available CPU.
I have changed cassandra config to extends limits:
concurrent_compactors: 4
compaction_throughput_mb_per_sec: 0
Cassandra full config
Schema
Is about 1 year ago we also add new nodes to this cluster and extend it from 16 nodes to 32 nodes cluster, average node size was 1TB before cluster extends. Cassandra version was 2.1. One node joining time was 1-1.5days.
So the question how can we speed up this process ? Did we miss something ?
Thanks.

This one is a bit longer so I can't put it into comment ... sorry.
I know that this sounds a bit strange, especially for a later stage of
your project, but the thing is with the indexes the situation won't get
any better over time. I would strongly recommend to start making your own
tables instead of just putting index on following stuff. Depending on
how often the data is accessed you can use "inverted indexes".
CREATE INDEX links_by_author_url_idx ON keyspace.links_by_author (url);
CREATE INDEX docs_url_idx ON keyspace.docs (url);
CREATE INDEX om_master_object_id_idx ON keyspace.om (master_object_id);
CREATE INDEX actions_pday_idx ON keyspace.actions (pday);
CREATE INDEX authors_yauid_idx ON keyspace.authors (yauid);
CREATE INDEX authors_login_lr_idx ON keyspace.authors (login_lr);
CREATE INDEX authors_login_idx ON keyspace.authors (login);
CREATE INDEX authors_email_idx ON keyspace.authors (email);
CREATE INDEX authors_name_idx ON keyspace.authors (name);
Basically every index that you have here enables you to "search" over base
entities to find them by some condition. Most of the conditions are
actually pretty narrow which is a good news. But the thing is the indexes
will become massive (already did), especially on docs and authors. But I guess
doc's is more problematic.
You should consider making separate tables for this. Every index that
you create will be there on every node in the cluster and in the end
you will hold far more data than you really need too because under
the hood data is multiplied per node. When you add replication factor to this
system is using a lot of space without you even being aware.
The problem with joining nodes is that when they receive new data all
the data in the cluster needs to be rebuilt ... for every single node
in the cluster and this is costing you a lot of time. So basically you loose
all the benefits of "easy node joining" that cassandra has.
Now you might think that space will become problem when you write the data
into your new schema that is denormalized ....
If space is the problem you can use a technique called inverted indexes
where you just put the id of the information into the search table and
then you make second load in the main table. I used this on some project
where space was the issue but since you have all the main stuff indexed
space will probably not be a problem because you are already using a lot
more than you think. (my bet would be that you will also probably save significantly on space)
Anyway all the indexes should become tables ... if consistency is problem,
use batches (don't use materialized views yet because you might loose data).
My honest tip is that you stay away from indexes. I know
it's hell to refactor this plus it's hard to get the time to refactor :( But
I think it should be manageable.

Related

Cassandra performance on querying fewer or more nodes

Consider a growing number of data, let's choose from two extreme choices:
Evenly distribute all data across all nodes in the cluster
We pack them to as few nodes as possible
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
However, some resources state that we shouldn't query all the nodes because that will slow down the query. Why would that slow the query? Isn't that just a normal scatter and gather? They even claim this hurts linear scalability as adding more nodes will further drag down the query.
(Maybe I am missing on how Cassandra performs the query, some background reference is appreciated).
On the contrary, some resources state that we should go with option 2 because it queries the least number of nodes.
Of course there is no black and white choices here; everything must have a tradeoff.
I want to know, what's the real difference between option 1 and option 2. Plus, regarding the network querying, why option 1 would be slow.
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
You definitely want to go with option #1. This is also preferable, in that new or replacement nodes will stream much faster than a cluster made of fewer, dense nodes.
However, some resources state that we shouldn't query all the nodes because that will slow down the query.
And those resources are absolutely correct. First of all, if you read through the resources which Alex posted above you'll discover how to build your tables so that your queries can be served by a single node. Running queries which only hit a single node is the best way around that problem.
Why would that slow the query?
Because in a distributed database environment, query time becomes network time. There are many people out there who like to run multi-key or unbound queries against Cassandra. When that happens, and the query is unable to find a single node with the data, Cassandra picks one node to designate as a "coordinator."
That node builds the result set with data from the other nodes. Which means in a 30 node cluster, that one node is now pulling data from the other 29. Assuming that these requests don't time-out, the likelihood that the coordinator will crash due to trying to manage too much data is very high.
The bottom line, is that this is one of those tradeoffs between a CA relational database and an AP partitioned row store. Build your tables to support your queries, store data together which is queried together, and Cassandra will perform just fine.

Cassandra data model too many table

I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.

What is the best way to query timeseries data with cassandra?

My table is a time series one. The queries are going to process the latest entries and TTL expire them after successful processing. If they are not successfully processed, TTL will not set.
The only query I plan to run on this is to select all entries for a given entry_type. They will be processed and records corresponding to processed entries will be expired.
This way every time I run this query I will get all records in the table that are not processed and processing will be done. Is this a reasonable approach?
Would using a listenablefuture with my own executor add any value to this considering that the thread doing the select is just processing.
I am concerned about the TTL and tombstones. But if I use clustering key of timeuuid type is this ok?
You are right one important thing getting in your way will be tombstones. By Default you will keep them around for 10 days. Depending on your access patter this might cause significant problems. You can lower this by setting the directly on the table or change it in the cassandra yaml file. Then it will be valid for all the newly created table gc_grace_seconds
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
It is very important that you make sure you are running the repair on whole cluster once within this period. So if you lower this setting to let's say 2 days, then within two days you have to have one full repair done on the cluster. This is very important because processed data will reaper. I saw this happening multiple times, and is never pleasant especially if you are using cassandra as a queue and it seems to me that you might be using it in your solution. I'll try to give some tips at the end of the answer.
I'm slightly worried about you setting the ttl dynamically depending on result. What would be the point of inserting the ttl-ed data that was successful and keeping forever the data that wasn't. I guess some sort of audit or something similar. Again this is a queue pattern, try to avoid this if possible. Also one thing to keep in mind is that you will almost always insert the data once in the beginning and then once again with the ttl should your processing be o.k.
Also getting all entries might be a bit tricky. For very moderate load 10-100 req/s this might be reasonable but if you have thousands per second getting all the requests every time might not be a good idea. At least not if you put them into single partition.
Separating the workload is also good idea. So yes using listenable future seems totally legit.
Setting clustering key to be timeuuid is usually the case with time series thata and I totally agree with you on this one.
In reality as I mentioned earlier you have to to take into account you will be saving 10 days worth of data (unless you tweak it) no matter what you do, it doesn't matter if you ttl it. It's still going to be ther, and every time cassandra will scan the partition will have to read the ttl-ed columns. In short this is just pain. I would seriously consider actually using something as kafka if I were you because what you are describing simply looks to me like a queue.
If you still want to stick with cassandra then please consider using buckets (adding date info to partitioning key and having a composite partitioning key). Depending on the load you are expecting you will have to bucket by month, week, day, hour even minutes. In some cases you might even want to add artificial columns to reduce load on the cluster. But then again this might be out of scope of this question.
Be very careful when using cassandra as a queue, it's a known antipattern. You can do it, but there are a lot of variables and it extremely depends on the load you are using. I once consulted a team that sort of went down the path of cassandra as a queue. Since basically using cassandra there was a must I recommended them bucketing the data by day (did some calculations that proved this is o.k. time unit) and I also had a look at this solution https://github.com/paradoxical-io/cassieq basically there are a lot of good stuff in this repo when using cassandra as a queue, data models etc. Basically this team had zombie rows, slow reading because of the tombstones etc. etc.
Also the way you described it it might happen that you have "hot rows" basically since you would just have one wide partition where all your data would go some nodes in the cluster might not even be that good utilised. This can be avoided by artificial columns.
When using cassandra as a queue it's very easy to mess a lot of things up. (But it's possible for moderate workloads)

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Cassandra - Distributing data and Multiple tables (Data modeling)

I am trying to learn cassandra. One thing I am not clear is how to ask Cassandra to distribute various tables. ie say I have time series data coming into table t1,t2,t3
T1 is heavily loaded ( ratio is 2000: 2:4 for num of rows).
I want the data of T1 for a given day to be not on the same machine as T2 or T3; so my queries are equally distributed ie not put too much load on one machine.
Also as the data gets older, its queried less, how can I take into account this factor.
regards
Cassandra is automatically distributed, you do not have a direct control on how the data gets distributed. In most cases, by default it makes use of an md5 on the row key and depending on that selects which nodes (computers) will be used to save the data.
What you are talking about would be more of a planning for a standard SQL database. However, if you generate extremely large amount of statistical data that is only to be used by some backend processes and users, you could have a separate cluster of 2 or 3 nodes. That way your other tables would not be affected by those statistics.
However, the true power of Cassandra is to be used with one large cluster. If it slows down, add nodes to it and do the necessary repair to spread the data properly. That's it... pretty much.
As for the way a table is used, you can use all the parameters defined on a table to tweak its setup. If you mainly do writes to a table, then you can tweak the parameters to get faster writes and slower reads. The other way around is also available: one write, many reads. And also many writes and many reads. To tweak those settings, in most cases you will need to have your software running and gather various stats and make changes as time passes.
Update:
There is actually a solution, thinking about it, just... I never use that mode so I did not think about it.
When you use a cluster which supports sorted rows, you can use a specific row name and the data will then go to a specific node. Again, you do not directly have control over what goes where, but if you really really want to do it that way, that's probably the solution you are looking for.
In this case, the row name would start with a number such as 0x0001 for T1 data, and 0x0100 and 0x0200 for T2 and T3. Since you do not know where the data really goes and how Cassandra decides to use it, it is rather complicated to obtain the right results here. And if you change your cluster (i.e. add nodes) then all your assumptions of where the data goes may very well go to the toilet! (and that's not speaking of upgrading to a new version of Cassandra...)

Resources