Fetching key range with common prefix in Cassandra - cassandra

I want to fetch all rows having a common prefix using hector API. I played with RangeSuperSlicesQuery a bit but didn't find a way to get it working properly. Does key range parameters work with wild cards etc?
Update: I used ByteOrderedPartitioner instead of RandomPartitioner and it works fine with that. Is this the expected behavior?

Yes, that's the expected behavior. In RandomPartitioner, rows are stored in the order of the MD5 hash of their keys, so to get a meaningful range of keys, you need to use an order preserving partitioner like ByteOrderedPartitioner.
However, there are downsides to using ByteOrderedPartitioner or OrderPreservingPartitioner that you can usually avoid with a slightly different data model and RandomPartitioner.

To elaborate on the above answer, you should consider using column names as your "common prefix" instead of the key. Then you can either use a column slice to get all column names in a certain range, or you could use a secondary index then do an indexed slice for all keys with that column name.
Column slice example:
Key (without prefix)
<prefix1> : <data>
<prefix2> : <data>
...
Secondary index example:
Key (with or without prefix)
"prefix" : <the_prefix> <-- this column is indexed
otherCol1 : <data>
...

Related

Is there a way to dynamically pass multiple keys for partition in Mapping Data Flow

Right now, multiple keys can be given one after the other and each key can be given dynamically but is there a way to pass entire "Unique Value Per Partition" field by passing a list of keys or something.
You can use Fixed range partitioning and build an expression that provides a fixed range of values.

DynamoDb with sort?

I'm very new to the Dynamo Db concept so forgive me if my question is a bit stupid
I have a file how looks like that
Appel,www.appel.com,www.cnn.com,www.bla.com....
Blabla,www.test.com,www.fox.com,www.bla.com.....
test,www.test.com,www.fox.com,www.bla.com...
www.appel.com,300
www.cnn.com,400
and so on. In short each line is
1: a word and all the URL's she in them
2: a URL and the number of appearance
What is need to do is to to make a query for the dynamo given the word the output need to be the list of the URL's sorted by the appearance.
for exapmple to this file
for the word appel the output is:
www.cnn.com,www.appel.com,www.bla.com....
I have tried to create 2 tables `Invert-index' and 'rank' the first for the word and the list of URL's and the second for the URL and his rank, but i cant find a way to make the query without sorting my self
so first: is the Dynamo structure (the two tables) is correct?
is there a way to query the db and sort the results?
In order to rely on DynamoDB to sort your data you have to use a Range Key. That being, in order to meet your requirements, the number of appearance has to be part of the Range Key.
The Hash Key could then be the word (e.g. Appel or Blabla), and lastly you can store the urls as an string array in each record.
From the documentation:
Query results are always sorted by the range key. If the data type of
the range key is Number, the results are returned in numeric order;
otherwise, the results are returned in order of ASCII character code
values. By default, the sort order is ascending. To reverse the order
use the ScanIndexForward parameter set to false. Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html
You can find more information about the available key types on DynamoDB on the links below:
When to use what primary key type
What is the use of a hash range in a dynamodb table
Q: If I use the number of appearance as range key how can I store the the String array? each value there has a diffrent number so if each record has a primary key (word) range key(number) and value (string array) what is the number in this case?
In that case I would recommend you to compose the Range Key with two fields (number and url) using a separator character (e.g. '#'). Your final table structure would be:
Hash Key : <Word>
Range Key : <AppearanceNumber>#<Url>
Your Range Key would be of the String type which would still work to sort your data as the <AppearanceNumber> is the prefix.
As an example by querying by the <Word>'Appel' you would get the following results:
Appel,900#www.appel.com
Appel,800#www.cnn.com
Appel,700#www.bla.com
Notice that you can still have the url and the appearanceNumber as separate fields in your table in case you want to minimize processing on your application side.

select compositetype keys in cassandra

So I've defined a column family that uses composite ids for the row keys. So say the composite key is CompositeType(LongType,LongType). So I've tested storing items with this type and that works fine and SELECT works as expected too when I know the full key. But lets say I want all keys that have 0 as the first element and anything as the second. So far the only way that I can see to perform this query is as follows:
if I was all keys that are 0:* then I would do a CQL query for key >= 0:0 AND key < 1:0 which works as long as there is an order preserving partitioner.
My questions are:
1) is this odd syntax only because I'm using a CQL driver (only option for nodejs aside from thrift)
2) is there any inefficiency with this type of query? essentially i'm using a composite key instead of super columns since those aren't supported in CQL. I have no problem dealing with this logic in the code as long as there is no limitations to using it like this.
I would suggest you change your data model. Use RandomPartitioner and just have the first component as the row key. Push the second component into the column names, that is make your column names composites instead.
Since column names are always sorted, you can do easy slicing operations. For example,
a) When you know both the components, do a get slice on the row key(first component) and first component of the composite.
b) When you know just the first component, fetch the complete row for the row key(first component)
This is the approach CQL3 takes when you ask it to create a table with multiple primary keys.
Your best option is to use CQL 3. This will let you use composites underneath to optimize your lookups while still allowing you to use the parts of the composite values as though they were separate columns. You're currently using composites in your row keys, and CQL 3 only supports composites in column names (so far), but that's probably ok. In many cases like this, shifting the compositing from the row key to the column name won't have an adverse effect on your performance or data distribution, but if your row keys aren't sufficiently selective, then it might.
Either way, though, you should be looking at CQL 3. CQL 2 is deprecated. I could tell you more about how to adapt your model for CQL 3 if I knew more about your situation.

Cassandra CQL: How to select encoded value from column

I have inserted string and integer values into dynamic columns in a Cassandra Column Family. When I query for the values in CQL they are displayed as hex encoded bits.
Can I somehow tell the query to decode the value into a string or integer?
I also would be happy to do this in the CLI if that's easier. There I see you can specify assume <column_family> validator as <type>;, but that applies to all columns and they have different types, so I have to run the assumption and query many times.
(Note that the columns are dynamic, so I haven't specified the validator when creating the column family).
You can use ASSUME in cqlsh like in cassandra-cli (although it only applies to printing values, not sending them, but that ought to be ok for you). You can also use it on a per-column basis, like:
ASSUME <column_family> ('anchor:cnnsi.com') VALUES ARE text;
..although (a), I just tested it, and this functionality is broken in cassandra-1.1.1 and later. I posted a fix at CASSANDRA-4352. And (b), this probably isn't a very versatile or helpful solution for more than a few one-off uses. I'd strongly recommend using CQL 3 here, as CQL direct support for wide storage engine rows like this is deprecated. Your table here is certainly adaptable to an (easier to use) CQL 3 model, but I couldn't say exactly what it would be without knowing more about how you're using it.

Time UUID type in pycassa

I'm having problems with using the time_uuid type as a key in my columnfamily. I want to store my records, and have them ordered by when they were inserted, and then I figured that the time_uuid is a good way to go. This is how I've set up my column family:
sys.create_column_family("keyspace", "records", comparator_type=TIME_UUID_TYPE)
When I try to insert, I do this:
q=pycassa.ColumnFamily(pycassa.connect("keyspace"), "records")
myKey=pycassa.util.convert_time_to_uuid(datetime.datetime.utcnow())
q.insert(myKey,{'somedata':'comevalue'})
However, when I insert data, I always get an error:
Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number.
If I change the comparator_type to UTF8_TYPE, it works, but the order of the items when returned are not as they should be. What am I doing wrong?
The problem is that in your data model, you are using the time as a row key. Although this is possible, you won't get a meaningful ordering unless you also use the ByteOrderedPartitioner.
For this reason, most people insert time-ordered data using the time as a column name, not a row key. In this model, your insert statement would look like:
q.insert(someKey, {datetime.datetime.utcnow(): 'somevalue'})
where someKey is a key that relates to the entire time series that you're inserting (for example, a username). (Note that you don't have to convert the time to UUID, pycassa does it for you.) To store something more than a single value, use a supercolumn or a composite key.
If you really want to store the time in your row keys, then you need to specify key_validation_class, not comparator_type. comparator_type sets the type of the column names, while key_validation_class sets the type of the row keys.
sys.create_column_family("keyspace", "records", key_validation_class=TIME_UUID_TYPE)
Remember the rows will not be sorted unless you also use the ByteOrderedPartitioner.
The comparator for a column family is used for ordering the columns within each row. You are seeing that error because 'somedata' is valid utf-8 but not a valid uuid.
The ordering of the rows stored in cassandra is determined by the partitioner. Most likely you are using RandomPartitioner which distributes load evenly across your cluster but does not allow for meaningful range queries (the rows will be returned in a random order.)
http://wiki.apache.org/cassandra/FAQ#range_rp

Resources