I'm writing a file with user profiles into cassandra with 5M profiles.
My write operation finished sucessfully.
I want to count the number of rows in my column family.
Keyspace keyspaceOperator = HFactory.createKeyspace(KEY_SPACE, cluster);
CqlQuery<String,String,Long> cqlQuery = new CqlQuery<String,String,Long>(keyspaceOperator, se, se, new LongSerializer());
cqlQuery.setQuery("SELECT COUNT(*) FROM up");
QueryResult<CqlRows<String,String,Long>> result = cqlQuery.execute();
System.out.println(result.get().getAsCount());
But the following code prints me always 10000.
What am I doing wrong? And how can I make this operation from cli?
You can't for now. There's a default limit of 10K rows per query. There's an open ticket for this (CASSANDRA-3702) but no fix as of yet.
Only other alternative is to iterate via RangeSlicesQuery. I created a "census" program to count both rows and total columns; here's a version for long types. But, if this is a frequent activity, conventional wisdom seems to be to use a separate counter column to keep track; some discussion here.
You simply need to give a limit that's as large as you want to count. If you don't expect the count ever to go over 1e9, then do
SELECT COUNT(*) FROM up LIMIT 1000000000;
But be aware that COUNT (and RangeSlicesQuery too) are not at all performant, or even meant to be. They're essentially the same as a "sequential scan" in relational db parlance. A counter is a better way to address this sort of problem in a distributed system.
Please refer here for an example that does this.
You can freely use the code. Please note that Astyanax has been branched out of Hector and we are finding that it is a very good Cassandra client in Java.
Related
I have a use case in which I utilize ScyllaDB to limit users' actions in the past 24h. Let's say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB's ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.
However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I'd like to understand better and compare the behavior(please correct me if any of my statement is wrong):
using count()
count() will implement a full-scan query, meaning that it may query more than necessary records into the table.
SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;
using limit
limit will only limit the number of records being returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.
SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;
using paging
I'm a bit new to this, but if I read the docs correctly it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to a number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?
docs: https://java-driver.docs.scylladb.com/stable/manual/core/paging/index.html
my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql
iter := conn.Query(
"SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?",
hashID,
userID,3
).PageSize(3).PageState(nil).Iter()
Please let me know if my analysis was correct and which method would be best to choose
Your client should always use paging - otherwise you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don't!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.
Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve and - even then - should it not be, we have to consider such set-up at scale. Eg: There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?
If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is because there are some particular problems with the description you've given at hand:
First, you want to retrieve N amount of records within a particular time-frame, but your queries don't specify such time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is not clear how a hash_id + user_id combination can be done to determine the number of records within a time-frame.
Of course, it may be that I am wrong, but I'd like to suggest different some approaches which may be or not applicable for you and your use case.
Consider a timestamp component part of the clustering key. This will allow you to avoid full partition scans, with queries such as:
SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;
If the above is not applicable, then perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:
SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;
I hope that helps!
I have a few points I want to add to what Felipe wrote already:
First, you don't need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers, the partition key doesn't need to be random-looking. Scylla will internally hash the partition key on its own to improve the load balancing. You don't need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it's a variant of murmur3 too (which is not identical to the one you used - it's a modified algorithm originally picked by the Cassandra developers).
Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it's fine to add the third - and then when both add their record - and you end up with four records. You'll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it's a disaster. Note that theoretically you can get even more than 4 - if the user managest to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won't manage to get many superflous records). If you'll want 3 to be a hard limit, you'll probably needs to change your solution - perhaps to one based on LWT and not use TTL.
Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn't really matter if you just retrieve them or count them - you should just do whichever is convenient for you. In any case, I think that paging is not a good solution your need. For such short results and you can just use the default page size and you'll never reach it anyway, and also paging hints the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case you know that you'll never continue after the first three results. So in short, don't use any special paging setup here - just use the default page size (which is 1MB) and it will never be reached anyway.
Please help me to resolve a confusion. Cassandra book Claims that attempts to query based on column that is not a part of a PK should fail (No secondary index for this column as well). However when I try to do it I can see this warning:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Once I append ALLOW FILTERING to my query, there is no more error. I understand the implication on performance - however there is a clear contradiction to what is written in the book. Was this feature added later or book authors simply missed this?
I think it is great you have a textbook to guide you through important noSQL concepts, but don't rely on it as CASSANDRA is open source and is constantly updated by the community. Online resources such as the official apache documentation is a much better option to retrieve updated information / tutorials on new and existing features.
Although ALLOW FILTERING does exist, it is still recommended to use a different table construction (e.g. changing the column to a key) or create an INDEX to keep querying fast.
AFAIK, Cassandra has ALLOW FILTERING from version 1.
Also to explain ALLOW FILTERING,
As per the datastax documentation,
Let’s take for example the following table:
CREATE TABLE blogs (blogId int,
time1 int,
time2 int,
author text,
content text,
PRIMARY KEY(blogId, time1, time2));
If you execute the following query:
SELECT * FROM blogs;
Cassandra will return you all the data that the table blogs contains.
If you now want only the data at a specified time1, you will naturally add an equal condition on the column time1:
SELECT * FROM blogs WHERE time1 = 1418306451235;
In response, you will receive the following error message:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING.
Cassandra knows that it might not be able to execute the query in an efficient way. It is therefore warning you: “Be careful. Executing this query as such might not be a good idea as it can use a lot of your computing resources”.
The only way Cassandra can execute this query is by retrieving all the rows from the table blogs and then by filtering out the ones which do not have the requested value for the time1 column.
If your table contains for example a 1 million rows and 95% of them have the requested value for the time1 column, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value for the time1 column, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
Unfortunately, Cassandra has no way to differentiate between the 2 cases above as they are depending on the data distribution of the table. Cassandra is therefore warning you and relying on you to make the good choice.
Thanks,
Harry
I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.
I have a cassandra table that looks like the following:
create table position_snapshots_by_security(
securityCode text,
portfolioId int,
lastUpdated date,
units decimal,
primary key((securityCode), portfolioId)
)
And I would like to something like this:
update position_snapshots_by_security
set units = units + 12.3,
lastUpdated = '2017-03-02'
where securityCode = 'SPY'
and portfolioId = '5dfxa2561db9'
But it doesn't work.
Is it possible to do this kind of operation in Cassandra? I am using version 3.10, the latest one.
Thank you!
J
This is not possible in Cassandra (any version) because it would require a read-before-write (anti-)pattern.
You can try the counter columns if they suit your needs. You could also try to caching/counting at application level.
You need to issue a read at application level otherwise, killing your cluster performance.
Cassandra doesn't do a read before a write (except when using Lightweight Transactions) so it doesn't support operations like the one you're trying to do which rely on the existing value of a column. With that said, it's still possible to do this in your application code with Cassandra. If you'll have multiple writers possibly updating this value, you'll want to use the aforementioned LWT to make sure the value is accurate and multiple writers don't "step on" each other. Basically, the steps you'll want to follow to do that are:
Read the current value from Cassandra using a SELECT. Make sure you're doing the read with a consistency level of SERIAL or LOCAL_SERIAL if you're using LWTs.
Do the calculation to add to the current value in your application code.
Update the value in Cassandra with an UPDATE statement. If using a LWT you'll want to do UPDATE ... IF value = previous_value_you_read.
If using LWTs, the UPDATE will be rejected if the previous value that you read changed while you were doing the calculation. (And you can retry the whole series of steps again.) Keep in mind that LWTs are expensive operations, particularly if the keys you are reading/updating are heavily contended.
Hope that helps!
I have the following table (using CQL3):
create table test (
shard text,
tuuid timeuuid,
some_data text,
status text,
primary key (shard, tuuid, some_data, status)
);
I would like to get rows ordered by tuuid. But this is only possible when I restrict shard - I get this is due to performance.
I have shard purely for sharding, and I can potentially restrict its range of values to some small range [0-16) say. Then, I could run a query like this:
select * from test where shard in (0,...,15) order by tuuid limit L;
I may have millions of rows in the table, so I would like to understand the performance characteristics of such a order by query. It would seem like the performance could be pretty bad in general, BUT with a limit clause of some reasonable number (order of 10K), this may not be so bad - i.e. a 16 way merge but with a fairly low limit.
Any tips, advice or pointers into the code on where to look would be appreciated.
Your data is sorted according to your column key. So the performance issue in your merge in your query above does not happen due to the WHERE clause but because of your LIMIT clause, afaik.
Your columns are inserted IN ORDER according to tuuid so there is no performance issue there.
If you are fetching too many rows at once, I recommended creating a test_meta table where you store the latest timeuuid every X-inserts, to get an upper bound on the rows your query will fetch. Then, you can change your query to:
select * from test where shard in (0,...,15) and tuuid > x and tuuid < y;
In short: make use of your column keys and get rid of the limit. Alternatively, in Cassandra 2.0, there will be pagination which will help here, too.
Another issue I stumbled over, you say that
I may have millions of rows in the table
But according to your data model, you will have exactly shard number of rows. This is your row key and - together with the partitioner - will determine the distribution/sharding of your data.
hope that helps!
UPDATE
From my personal experience, cassandra performances quite well during heavy reads as well as writes. If the result sets became too large, I rather experienced memory issues on the receiving/client side rather then timeouts on the server side. Still, to prevent either, I recommend having a look a the upcoming (2.0) pagination feature.
In the meanwhile:
Try to investigate using the trace functionality in 1.2.
If you are mostly reading the "latest" data, try adding a reversed type.
For general optimizations like caches etc, first, read how cassandra handles reads on a node and then, see this tuning guide.