What is the Impact of ALLOW FILTERING on Cassandra? - cassandra

According to official Cassandra blog, ALLOW FILTERING is highly inefficient. But if for some reason one has to use such query, what would be the impact on other applications that use Cassandra to get data? Would only the thread(s) that are busy fetching rows for my query would be slow, or would whole Cassandra would be slow, and consequently, all other applications that are getting data from Cassandra will get their response slow?

It will likely affect the whole node. A problem around it is that your one query with a limit of 10 will not just read 10 records and return, but (possibly) a LOT of data. It is possible to make efficient ALLOW FILTERING queries, which things like the spark driver (token limited queries per token range or within a partition) can do. I would very strongly recommend not even attempting it though. It might work at first but your poor operations team will curse your name.
With faster disks, the obj allocations since this is unthrottled will cause serious GC overhead. This is very similiar to the issue seen when using queues or a lot of tombstones, the JVM building and throwing away the rows overruns the allocation rate the garbage collector can keep up with without longer pauses (early promotions, fragmentation in cms, allocation spikes messing up g1 younggen ratios).
If cross partitions, like with normal range queries, the coordinator will attempt to estimate the ranges it will need to read and the replicas for them to fan out with some limited concurrency. Its a rough estimate because it only has its own data to extrapolate but when the data is then further filtered and not just "number of partitions within range" its likely gonna be wrong and underestimate. Most likely it will query one range at a time, querying next replica set range if it isnt met. With vnodes this can be a very long list, and sequentially walking them will likely not complete within timeout. Luckily this will impact mostly just the one query, but it is still essentially reading the entire dataset off disk from every replica set in the cluster from 1 query. If you make 100/sec the cluster will probably be hosed.


Could my large amount of tables (2k+) be causing my write timeout exceptions?

I'm running OS Cassandra 3.11.9 with Datastax Java Driver 3.8.0. I have a Cassandra keyspace that has multiple tables functioning as lookup tables / search indices. Whenever I receive a new POST request to my endpoint, I parse the object and insert it in the corresponding Cassandra table. I also put inserts to each corresponding lookup table. (10-20 per object)
When ingesting a lot of data into the system, I've been running into WriteTimeoutExceptions in the driver.
I tried to serialize the insert requests into the lookup tables by introducing Apache Camel and putting all the Statements into a queue that the Session could work off of, but it did not help.
With Camel, since the exceptions are now happening in the Camel thread, the test continues to run, instead of failing on the first exception. Eventually, the test seems to crash Cassandra. (Nothing in the Cassandra logs though)
I also tried to turn off my lookup tables and instead insert into the main table 15x per object (to simulate a similar number of writes as if I had the lookup tables on). This test passed with no exception, which makes me think the large number of tables is the problem.
Is a large number (2k+) of Cassandra tables a code smell? Should we rearchitect or just throw more resources at it? Nothing indicative has shown in the logs, mostly just some status about the number of tables etc - no exceptions)
Can the Datastax Java Driver be used multithreaded like this? It says it is threadsafe.
There is a direct effect of the high number of tables onto the performance - see this doc (the whole series is good source of information), and this blog post for more details. Basically, with ~1000 tables, you get ~20-25% degradation of performance.
That's could be a reason, not completely direct, but related. For each table, Cassandra needs to allocate memory, have a part for it in the memtable, keep information about it, etc. This specific problem could come from the blocked memtable flushes, or something like. Check the nodetool tpstats and nodetool tablestats for blocked or pending memtable flushes. It's better to setup some continuous monitoring solution, such as, metrics collector for Apache Cassandra, and and for period of time watch for the important metrics that include that information as well.

Why is it so bad to have large partitions in Cassandra?

I have seen this warning everywhere but cannot find any detailed explanation on this topic.
For starters
The maximum number of cells (rows x columns) in a single partition is
2 billion.
If you allow a partition to grow unbounded you will eventually hit this limitation.
Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.
As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.
Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.
You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

Pros and Cons of Cassandra User Defined Functions

I am using Apache Cassandra to store mostly time series data. And I am grouping the data and aggregating/counting it based on some conditions. At the moment I am doing this in a Java 8 application, but with the release of Cassandra 3.0 and the User Defined Functions, I have been asking myself if extracting the grouping and aggregation/counting logic to Cassandra is a good idea. To my understanding this functionallity is something like the stored procedures in SQL.
My concern is if this will impact the computation performance and the overall performance of the database. I am also not sure if there are other issues with it and if this new feature is something like the secondary indexes in Cassandra - you can do them, but it is not recommended at all.
Have you used user defined functions in Cassandra? Do you have any observations on the performance? What are the good and bad sides of this new functionality? Is it applicable in my use case?
You can compare it to using count() or avg() kind of aggregations. They can save you a lot of network traffic and object creation/GC by having the coordinator only send the result, but its easy to get carried away and make the coordinator do a lot of work. This extra work takes away from normal C* duties, and can just as likely increase GCs as reduce them.
If your aggregating 100 rows in a partition its probably fine and if your aggregating 10000 its probably not end of the world if its very rare. If your calling it once a second though its a problem. If your aggregating over 1000 I would be very careful.
If you absolutely need to do it and its a lot of data often, you may want to create dedicated proxy coordinators (-Djoin_ring=false) to bear the brunt of the load without impacting normal C* read/writes. At that point its just as easy to create dedicated workload DC for it or something (with RF=0 for your keyspace, and set application to be part of that DC with DCAwareRoundRobinPolicy). This also is the point where using Spark is probably the right thing to do.

Does having 1000's of CF's will lead to OOM in Cassandra

I am having a cluster with multiple CF's (around 1000 maybe more). And I get OOM errors time to time from different nodes. We have three Cassandra nodes? Is it an expected behavior in cassandra?
Each table (columnfamily) requires a minimum of 1MB of heap memory, so it's quite possible this is causing some pressure for you.
The best solution is to redesign your application to use less tables; most of the time I've seen this it's because someone designed it to have "one table per X" where X is a customer or a data source or even a time period. Instead, combine tables with a common schema and add a column to the primary key with the distinguishing element.
In the short term, you probably need to increase your heap size.

Cassandra multiget performance

I've got a cassandra cluster with a fairly small number of rows (2 million or so, which I would hope is "small" for cassandra). Each row is keyed on a unique UUID, and each row has about 200 columns (give or take a few). All in all these are pretty small rows, no binary data or large amounts of text. Just short strings.
I've just finished the initial import into the cassandra cluster from our old database. I've tuned the hell out of cassandra on each machine. There were hundreds of millions of writes, but no reads. Now that it's time to USE this thing, I'm finding that read speeds are absolutely dismal. I'm doing a multiget using pycassa on anywhere from 500 to 10000 rows at a time. Even at 500 rows, the performance is awful sometimes taking 30+ seconds.
What would cause this type of behavior? What sort of things would you recommend after a large import like this? Thanks.
Sounds like you are io-bottlenecked. Cassandra does about 4000 reads/s per core, IF your data fits in ram. Otherwise you will be seek-bound just like anything else.
I note that normally "tuning the hell" out of a system is reserved for AFTER you start putting load on it. :)
Is it an option to split up the multi-get into smaller chunks? By doing this you would be able to spread your get across multiple nodes, and potentially increase your performance, both by spreading the load across nodes and having smaller packets to deserialize.
That brings me to the next question, what is your read consistency set to? In addition to an IO bottleneck as #jbellis mentioned, you could also have a network traffic issue if you are requiring a particularly high level of consistency.
