Somewhere I have heard that using multi row selection in cassandra is bad because for each row selection it runs new query, so for example if i want to fetch 1000 rows at once it would be the same as running 1000 separate queries at once, is that true?
And if it is how bad would it be to keep selecting around 50 rows each time page is loaded if say i have 1000 page views in a single minute, would it severely slow cassandra down or not?
P.S I'm using PHPCassa for my project
Yes, running a query for 1000 rows is the same as running 1000 queries (if you use the recommended RandomPartitioner). However, I wouldn't be overly concerned by this. In Cassandra, querying for a row by its key is a very common, very fast operation.
As to your second question, it's difficult to tell ahead of time. Build it and test it. Note that Cassandra does use in memory caching so if you are querying the same rows then they will cache.
We are using Playorm for Cassandra and there is a "findAll" pattern there which provides support to fetch all rows quickly. Visit
https://github.com/deanhiller/playorm/wiki/Support-for-retrieving-many-entities-in-parallel for more details.
1) I have little bit debugged the Cassandra code base and as per my observation to query multiple rows at the same time cassandra has provided the multiget() functionality which is also inherited in phpcassa.
2) Multiget is optimized to to handle the batch request and it saves your network hop.(like for 1k rows there will be 1k round trips, so it definitely reduces the time for 999 round trips)
3) More about multiget() in phpcassa: php cassa multiget()
Related
I have a table in cassandra DB that is populated. It provably has around 10000 records. When I try to execute select count(*), my query times out. Surprisingly, it times out even when i restrict the query with the partition key. The table has a column that is filled with a lot of text. I can't understand how that would be a problem, but i thought, i'd mention it. Any suggestions?
Doing a COUNT() of the rows in a partition shouldn't timeout unless it contains thousands and thousands of rows. More importantly, the query is most likely timing out when the partition contains thousands of tombstones.
You haven't provided a lot of information in your question and ideally you should have included:
the table schema
a sample query
In any case if you are storing queue-like datasets and deleting rows after they've been processed (because it's a queue) then you are generating lots of tombstones within the partition. Once you've reached the maximum tombstone_failure_threshold (default is 100K tombstones) then Cassandra will stop reading any more rows.
Unfortunately, it's hard to say what's happening in your case without the necessary details. Cheers!
A SELECT COUNT(*) needs to scan the entire database and can potentially take an extremely long time - much longer than the typical timeout. Ideally, such a query would be paged - periodically returning empty pages until the final page contains the count - to avoid timing out. But currently in Cassandra - and also in Scylla - this isn't done.
As Erick noted in his reply, everything becomes worse if you also have a lot of tombstones: You said you only have 10,000 rows, but it's easy to imagine a use case where the data changes frequently, and you actually have for each row 100 deleted rows - so Cassandra needs to scan through 1 million rows (most of them already dead), not 10,000.
Another issue to consider is that when your cluster is very large, scanning usually contact nodes sequentially, and each node many times (depending on the number of vnodes), so the scan time on a very large cluster will be large even if there are just a few actual rows in the database. By the way, unlike a regular scan, an aggregation like COUNT(*) can actually be done internally in parallel. Scylla recently implemented this and it speeds up counts (and other aggregation), but if I understand correctly, this feature is not in Cassandra.
Finally, you said that "Surprisingly, it times out even when i restrict the query with the partition key.". The question is how you restricted the query with a partition key. If you restricted the partition key itself to a range, it will still be slow because Cassandra still needs to scan all the partitions and compare their keys to the range. What you should have done is to restrict the token of the partition key, e.g., something like
where token(p) >= -9223372036854775808 and token(p) < ....
I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.
I plan to use memsql to store my last 7 days data for real time analytics using SQL.
I checked the documentation and find out that there is no such TTL / expiration feature in MemSQL
Is there any such feature (in case I missed it)?
Is memsql fit the use case if I do daily delete on >7 days data? I quite curious about the fragmentation
We tried it on postgresql and we need to execute Vacuum command, it takes a long time to run.
There is no TTL/expiration feature. You can do it by running delete queries. Many customer use cases are doing this type of thing, so yes MemSQL does fit the use case. Fragmentation generally shouldn't be too much of a problem here - what kind of fragmentation are you concerned about?
There is No Out of the Box TTL feature in MemSQL.
We achieved TTL by adding an additional TS column in our MemSQL Rowstore table with TIMESTAMP(6) datatype.
This provides automatic current timestamp insertion when you add a new row to the table.
When querying data from this table, you can apply a simple filter based on this TIMESTAMP column to filter older records beyond your TTL value.
https://docs.memsql.com/sql-reference/v6.7/datatypes/#time-and-date
You can always have a batch job which can run one a month which can delete older data.
we have not seen any issues due to fragmentation but you can do below once in a while if fragmentation is a concern for you:
MemSQL’s memory allocators can become fragmented over time (especially if a large table is shrunk dramatically by deleting data randomly). There is no command currently available that will compact them, but running ALTER TABLE ADD INDEX followed by ALTER TABLE DROP INDEX will do it.
Warning
Caution should be taken with this work around. Plans will rebuild and the two ALTER queries are going to move all moves in the table twice, so this should not be used that often.
Reference:
https://docs.memsql.com/troubleshooting/latest/troubleshooting/
Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.
Have a table set up in Cassandra that is set up like this:
Primary key columns
shard - an integer between 1 and 1000
last_used - a timestamp
Value columns:
value - a 22 character string
Example if how this table is used:
shard last_used | value
------------------------------------
457 5/16/2012 4:56pm NBJO3poisdjdsa4djmka8k >-- Remove from front...
600 6/17/2013 5:58pm dndiapas09eidjs9dkakah |
...(1 million more rows) |
457 NOW NBJO3poisdjdsa4djmka8k <-- ..and put in back
The table is used as a giant queue. Very many threads are trying to "pop" the row off with the lowest last_used value, then update the last_used value to the current moment in time. This means that once a row is read, since last_used is part of the primary key, that row is deleted, then a new row with the same shard, value, and updated last_used time is added to the table, at the "end of the queue".
The shard is there because so many processes are trying to pop the oldest row off the front of the queue and put it at the back, that they would severely bottleneck each other if only one could access the queue at the same time. The rows are randomly separated into 1000 different "shards". Each time a thread "pops" a row off the beginning of the queue, it selects a shard that no other thread is currently using (using redis).
Holy crap, we must be dumb!
The problem we are having is that this operation has become very slow on the order of about 30 seconds, a virtual eternity.
We have only been using Cassandra for less than a month, so we are not sure what we are doing wrong here. We have gotten some indication that perhaps we should not be writing and reading so much to and from the same table. Is it the case that we should not be doing this in Cassandra? Or is there perhaps some nuance in the way we are doing it or the way that we have it configured that we need to change and/or adjust? How might be trouble-shoot this?
More Info
We are using the MurMur3Partitioner (the new random partitioner)
The cluster is currently running on 9 servers with 2GB RAM each.
The replication factor is 3
Thanks so much!
This is something you should not use Cassandra for. The reason you're having performance issues is because Cassandra has to scan through mountains of tombstones to find the remaining live columns. Every time you delete something Cassandra writes a tombstone, it's a marker that the column has been deleted. Nothing is actually deleted from disk until there is a compaction. When compacting Cassandra looks at the tombstones and determines which columns are dead and which are still live, the dead ones are thrown away (but then there is also GC grace, which means that in order to avoid spurious resurrections of columns Cassandra keeps the tombstones around for a while longer).
Since you're constantly adding and removing columns there will be enormous amounts of tombstones, and they will be spread across many SSTables. This means that there is a lot of overhead work Cassandra has to do to piece together a row.
Read the blog post "Cassandra anti-patterns: queues and queue-like datasets" for some more details. It also shows you how to trace the queries to verify the issue yourself.
It's not entirely clear from your description what a better solution would be, but it very much sounds like a message queue like RabbitMQ, or possibly Kafka would be a much better solution. They are made to have a constant churn and FIFO semantics, Cassandra is not.
There is a way to make the queries a bit less heavy for Cassandra, which you can try (although I still would say Cassandra is the wrong tool for this job): if you can include a timestamp in the query you should hit mostly live columns. E.g. add last_used > ? (where ? is a timestamp) to the query. This requires you to have a rough idea of the first timestamp (and don't do a query to find it out, that would be just as costly), so it might not work for you, but it would take some of the load off of Cassandra.
The system appears to be under stress (2GB or RAM may be not enough).
Please have nodetool tpstats run and report back on its results.
Use RabbitMQ. Cassandra is probably a bad choice for this application.