there's some Cassandra operation to verify if a Column Family contains a key? We don't need any row data, only key existence or not.
Best Regards
If you're using Java then create a SliceQuery for the rowKey and set begin/end values equal to the specific column key you're looking for. If there is a column with the specific key then the following expression will be true:
sliceQuery.execute().get().getColumns().size() > 0
One quick way of doing it is to ask for the column count for the row, if it's positive the row exists. Because of tombstones there's a gray area around "does not exist". You can remove all columns for a row, but asking for data for the row may result in an empty set of columns instead of null (this depends a lot on which driver you're using). You should consider rows that don't have columns as non-existent, and therefore asking for the column count is probably the best way to determine if a row exists or not.
There's some more information about this in the Cassandra FAQ under "range ghosts".
Related
I am wondering how's column slicing in CQL WHERE clause affects read performance. Does Cassandra have some optimization, which is able to only fetch specific columns with the value or have to retrieve all the columns of a row and check one after another? e.g.: I have a primary key as (key1, key2), key2 is the clustering key. I only want to find columns that match a certain key2, say value2?
Cassandra saves the data as cells - each value for a key+column is a cell. If you save several values for the key in once they will be placed together in same file. Also, since cassandra writes to sstables, you can have several values saved for same key-column/cell in different files, and cassandra will read all of them and return you the last written one, until comperssion or repair is occured, and irrelevant values are deleted.
Good article about deletes/reads/tombstones:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
I'm on my path to learning Cassandra, and the differences in CQL and SQL, but I'm noticing the absence of a way to check to see if a record exists with Cassandra. Currently, the best way that I have is to use
SELECT primary_keys FROM TABLE WHERE primary_keys = blah,
and checking to see if the results set is empty. Is there a better way to do this, or do I have the right idea for now?
Using count will make it traverse all the matching rows just to be able to count them. But you only need to check one, so just limit and return whatever. Then interpret the presence of the result as true and absence - as false. E.g.,
SELECT primary_keys FROM TABLE WHERE primary_keys = blah LIMIT 1
That's the usual way in Cassandra to check if a row exists. You might not want to return all the primary keys if all you care about is if the row exists or not, so you could do this:
SELECT count(*) FROM TABLE WHERE primary_keys = blah,
This would just return a 1 if the row exists, and a 0 if it doesn't exist.
If you are using primary key to filter rows, all the above 3 solutions (including yours) are fine. And I don't think there are real differences.
But if you are using a general way (such as indexed column, partition key) to filter rows, you should take the solution of "Limit 1", which will avoid useless network traffic.
There is a related example at:
The best way to check existence of filtered rows in Cassandra? by user-defined aggregate?
The CQL3 specification description of the UPDATE statement begins with the following paragraph:
The UPDATE statement writes one or more columns for a given row in a
table. The (where-clause) is used to select the row to update and must
include all columns composing the PRIMARY KEY (the IN relation is only
supported for the last column of the partition key). Other columns
values are specified through after the SET keyword.
The description in the specification of the DELETE statement begins with a similar paragraph:
The DELETE statement deletes columns and rows. If column names are provided
directly after the DELETE keyword, only those columns are deleted from the row
indicated by the (where-clause) (the id[value] syntax in (selection) is for
collection, please refer to the collection section for more details).
Otherwise whole rows are removed. The (where-clause) allows to specify the
key for the row(s) to delete (the IN relation is only supported for the last
column of the partition key).
The bolded portions of each of these descriptions state, in layman's terms, that these statements can be used to modify data in a solely row-based manner.
However, given the nature of the relationship (or lack thereof) between the rows and the static columns (which exist independent of any particular row) of a table, it seems as though there should be a way to modify such columns given only the keys of the partitions they're respectively contained in. According to the specification however, that does not seem to be possible, and I'm not sure if that is a product of the difficulty to allow such in the CQL3 syntax, or something else.
If a static column cannot be updated or deleted independent of any row in its table, then such operations become coupled with their non-static-column-based counterparts, making the set of columns targeted by such operations, difficult to determine. For example, given a populated table with the following definition:
CREATE TABLE IF NOT EXISTS example_table
(
partitionKeyColumn int
clusteringColumn int
nonPrimaryKeyColumn int
staticColumn varchar static
PRIMARY KEY (partitionKeyColumn, clusteringColumn)
)
... it is not immediately obvious if the following DELETE statements are equivalent:
//#1 (Explicitly specifies all of the columns in and "in" the target row)
DELETE partitionKeyColumn, clusteringColumn, nonPrimaryKeyColumn, staticColumn FROM example_table WHERE partitionKeyColumn = 1 AND clusteringColumn = 2
//#2 (Implicitly specifies all of the columns in (but not "in"?) the target row)
DELETE FROM example_table WHERE partitionKeyColumn = 1 AND clusteringColumn = 2
So, phrasing my observations in question form:
Are the above DELETE statements equivalent?
Does the primary key of at least one row in a CQL3 table have to be supplied in order to update or delete a static column in said table? If so, why?
I do not know about specification but in the real cassandra world, your two DELETE statements are not equivalent.
The first statement deletes the static_column whereas the second one does not. The reason of this is that static columns are shared by rows. You have to specify it explicitly to actually delete it.
Furthermore, I do not think its a good idea to DELETE static columns and non-static columns at the same time. By the way, this statement won't work :
DELETE staticColumn FROM example_table WHERE partitionKeyColumn = 1 AND clusteringColumn = 2
The error output is :
Bad Request: Invalid restriction on clustering column priceable_name since the DELETE statement modifies only static columns
Have a table with about 20 million rows in Cassandra.
The table is ordered by a primary_key column, which is a string. We are using 'ByteOrderedPartitioner', so the rows are ordered by the primary_key and not a hash of the primary_key column.
What is a good way to get the very last record in the table?
Thanks so much!
If for "very last record" you mean the one ordered as last I don't think you can do it like a "GET", you have to scan rows. The best you can do, afaik, is select a good range to scan (good start key) according to your primary key.
From datastax docs:
"Using the ordered partitioner allows ordered scans by primary key.
This means you can scan rows as though you were moving a cursor
through a traditional index. For example, if your application has user
names as the row key, you can scan rows for users whose names fall
between Jake and Joe. This type of query is not possible using
randomly partitioned row keys because the keys are stored in the order
of their MD5 hash (not sequentially)."
If you find better solution let me know.
Regards,
Carlo
Here is an example use case:
You need to store last N (let's say 1000 as fixed bucket size) user actions with all details in timeuuid based columns.
Normally, each users' actions are already in "UserAction" column family where user id as row key, and actions in timeuuid columns. You may also have "AllActions" column family which stores all actions with same timeuuid as column name and user id as column value. It's basically a relationship column family but unfortunately without any details of user actions. Querying with this column family is expensive I guess, because of random partioner. On the other hand, if you store all details in "AllActions" CF then cassandra can't handle that big row properly at one point. This is why I want to store last N user actions with all details in fixed number of timeuuid based columns.
Maybe you may have a better design solution for this use case... I like to hear that ...
If not, the question is how to implement fixed number of (timeuuid) columns in cassandra (with CQL) effectively?
After insertion we could delete old (overflow) columns if we had some sort of range support in cql's DELETE. AFAIK there is no support for this.
So, any idea? Thanks in advance...
IMHO, this is something that C* must handle itself like compaction. It's not a good idea to handle this on client side.
Maybe, we need some configuration (storage) options for column families to make them suitable for "most recent data".