So I've defined a column family that uses composite ids for the row keys. So say the composite key is CompositeType(LongType,LongType). So I've tested storing items with this type and that works fine and SELECT works as expected too when I know the full key. But lets say I want all keys that have 0 as the first element and anything as the second. So far the only way that I can see to perform this query is as follows:
if I was all keys that are 0:* then I would do a CQL query for key >= 0:0 AND key < 1:0 which works as long as there is an order preserving partitioner.
My questions are:
1) is this odd syntax only because I'm using a CQL driver (only option for nodejs aside from thrift)
2) is there any inefficiency with this type of query? essentially i'm using a composite key instead of super columns since those aren't supported in CQL. I have no problem dealing with this logic in the code as long as there is no limitations to using it like this.
I would suggest you change your data model. Use RandomPartitioner and just have the first component as the row key. Push the second component into the column names, that is make your column names composites instead.
Since column names are always sorted, you can do easy slicing operations. For example,
a) When you know both the components, do a get slice on the row key(first component) and first component of the composite.
b) When you know just the first component, fetch the complete row for the row key(first component)
This is the approach CQL3 takes when you ask it to create a table with multiple primary keys.
Your best option is to use CQL 3. This will let you use composites underneath to optimize your lookups while still allowing you to use the parts of the composite values as though they were separate columns. You're currently using composites in your row keys, and CQL 3 only supports composites in column names (so far), but that's probably ok. In many cases like this, shifting the compositing from the row key to the column name won't have an adverse effect on your performance or data distribution, but if your row keys aren't sufficiently selective, then it might.
Either way, though, you should be looking at CQL 3. CQL 2 is deprecated. I could tell you more about how to adapt your model for CQL 3 if I knew more about your situation.
Related
I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77
I have inserted string and integer values into dynamic columns in a Cassandra Column Family. When I query for the values in CQL they are displayed as hex encoded bits.
Can I somehow tell the query to decode the value into a string or integer?
I also would be happy to do this in the CLI if that's easier. There I see you can specify assume <column_family> validator as <type>;, but that applies to all columns and they have different types, so I have to run the assumption and query many times.
(Note that the columns are dynamic, so I haven't specified the validator when creating the column family).
You can use ASSUME in cqlsh like in cassandra-cli (although it only applies to printing values, not sending them, but that ought to be ok for you). You can also use it on a per-column basis, like:
ASSUME <column_family> ('anchor:cnnsi.com') VALUES ARE text;
..although (a), I just tested it, and this functionality is broken in cassandra-1.1.1 and later. I posted a fix at CASSANDRA-4352. And (b), this probably isn't a very versatile or helpful solution for more than a few one-off uses. I'd strongly recommend using CQL 3 here, as CQL direct support for wide storage engine rows like this is deprecated. Your table here is certainly adaptable to an (easier to use) CQL 3 model, but I couldn't say exactly what it would be without knowing more about how you're using it.
I've to test different datamodels for Cassandra. I'm thinking about to use a composite key made by key1:key2 for the row key.
With this configuration on Cassandra, for example, I can query to have all the rows having a specific key1 value and any key2 value but It's impossible otherwise (obtain all the rows having a specific key2's value and any key1).
Is it right?
thanks in advance
Cesare
If you use Order Preserving Partitioning (OPP), then yes, the keys will be stored sorted, and then you can get slices over a range of keys e.g. A:A to A:Z -- but not necessarily any:A to any:Z.
But, OPP is not guaranteed to evenly distribute the keys across the nodes and you could end up with "hot spots" of too many or too few keys. You probably want to use Random Partitioning (RP) which distributes the keys by storing by hash across all nodes.
However, since Columns are stored sorted, using Composite values can be pretty powerful for accessing ranges of data.
See this question for details on querying Composite columns using Hector .
If necessary, the column names could then be used as keys to do Multiget queries for additional lookups.
I hope these articles help you :)
http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/
http://www.datastax.com/docs/0.7/data_model/cfs_as_indexes
http://www.anuff.com/2011/02/indexing-in-cassandra.html
Also checkout this question
Storing a list of values in Cassandra
Here is an example use case:
You need to store last N (let's say 1000 as fixed bucket size) user actions with all details in timeuuid based columns.
Normally, each users' actions are already in "UserAction" column family where user id as row key, and actions in timeuuid columns. You may also have "AllActions" column family which stores all actions with same timeuuid as column name and user id as column value. It's basically a relationship column family but unfortunately without any details of user actions. Querying with this column family is expensive I guess, because of random partioner. On the other hand, if you store all details in "AllActions" CF then cassandra can't handle that big row properly at one point. This is why I want to store last N user actions with all details in fixed number of timeuuid based columns.
Maybe you may have a better design solution for this use case... I like to hear that ...
If not, the question is how to implement fixed number of (timeuuid) columns in cassandra (with CQL) effectively?
After insertion we could delete old (overflow) columns if we had some sort of range support in cql's DELETE. AFAIK there is no support for this.
So, any idea? Thanks in advance...
IMHO, this is something that C* must handle itself like compaction. It's not a good idea to handle this on client side.
Maybe, we need some configuration (storage) options for column families to make them suitable for "most recent data".
I want to fetch all rows having a common prefix using hector API. I played with RangeSuperSlicesQuery a bit but didn't find a way to get it working properly. Does key range parameters work with wild cards etc?
Update: I used ByteOrderedPartitioner instead of RandomPartitioner and it works fine with that. Is this the expected behavior?
Yes, that's the expected behavior. In RandomPartitioner, rows are stored in the order of the MD5 hash of their keys, so to get a meaningful range of keys, you need to use an order preserving partitioner like ByteOrderedPartitioner.
However, there are downsides to using ByteOrderedPartitioner or OrderPreservingPartitioner that you can usually avoid with a slightly different data model and RandomPartitioner.
To elaborate on the above answer, you should consider using column names as your "common prefix" instead of the key. Then you can either use a column slice to get all column names in a certain range, or you could use a secondary index then do an indexed slice for all keys with that column name.
Column slice example:
Key (without prefix)
<prefix1> : <data>
<prefix2> : <data>
...
Secondary index example:
Key (with or without prefix)
"prefix" : <the_prefix> <-- this column is indexed
otherCol1 : <data>
...