Azure Table Storage: Order by - azure

I am building a web site that has a wish list. I want to store the wish list(s) in azure table storage, but also want the user to be able to sort their wish list, when viewing it, a number of different ways - date added, date added reversed, item name etc. I also want to implement paging which I believe I can implement by making use of the continuation token.
As I understand it, "order by" isn't implemented and the order that results are returned from table storage is based on the partition key and row key. Therefore if I want to implement the paging and sorting that I describe, is the best way to implement this by storing the wish list multiple times with different partition key / row key?
In this simple case, it is likely that the wish list won't be that large and I could in fact restrict the maximum number of items that can appear in the list, then get rid of paging and sort in memory. However, I have more complex cases that I also need to implement paging and sorting for.

On today’ s hardware having 1000’s of rows to hold, in a list, in memory and sort is easily supportable. What the real issue is, how possible is it for you to access the rows in table storage using the Keys and not having to do a table scan. Duplicating rows across multiple tables could get quite cumbersome to maintain.
An alternate solution, would be to temporarily stage your rows into SQL Azure and apply an order by there. This may be effective if your result set is too large to work in memory. For best results the temporary table would need to have the necessary indexes.

Azure Storage keeps entities in lexicographical order, indexed by Partition Key as primary index and Row Key as secondary index. In general for your scenario it sounds like UserId would be a good fit for a partition key, so you have the Row Key to optimize for per each query.
If you want the user to see the wish lists latest on top, then you can use the log tail pattern where your row key will be the inverted Date Time Ticks of the DateTime when the wish list was entered by the user.
https://learn.microsoft.com/azure/storage/tables/table-storage-design-patterns#log-tail-pattern
If you want user to see their wish lists ordered by the item name you could have your item name as your row key, and so the entities will naturally sorted by azure.
When you are writing the data you may want to denormalize the data and do multiple writes with these different row key schemas. Since you will have the same partition key as user id, you can at that stage do a batch insert operation and not worry about consistency since azure table batch operations are atomic.
To differentiate the different rowkey schemas, you may want to prepend each with a const string value. Like your inverted ticks row key value for instance woul dbe something like "InvertedTicks_[InvertedDateTimeTicksOfTheWishList]" and your item names row key value would be "ItemName_[ItemNameOfTheWishList]"

Why not do all of this in .net using a List.
For this type of application I would have thought SQL Azure would have been more appropriate.

Something like this worked just fine for me:
List<TableEntityType> rawData =
(from c in ctx.CreateQuery<TableEntityType>("insysdata")
where ((c.PartitionKey == "PartitionKey") && (c.Field == fieldvalue))
select c).AsTableServiceQuery().ToList();
List<TableEntityType> sortedData = rawData.OrderBy(c => c.DateTime).ToList();

Related

Cassandra pagination and token function; selecting a partition key

I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77

Azure Table Storage Delete where Row Key is Between two Row Key Values

Is there a good way to delete entities that are in the same partition given a row key range? It looks like the only way to do this would be to do a range lookup and then batch the deletes after looking them up. I'll know my range at the time that entities will be deleted so I'd rather skip the lookup.
I want to be able to delete things to keep my partitions from getting too big. As far as I know a single partition cannot be scaled across multiple servers. Each partition is going to represent a type of message that a user sends. There will probably be less than 50 types. I need a way to show all the messages of each type that were sent (ex: show recent messages regardless of who sent it of type 0). This is why I plan to make the type the partition key. Since the types don't scale with the number of users/messages though I don't want to let each partition grow indefinitely.
Unfortunately, you need to know precise Partition Keys and Row Keys in order to issue deletes. You do not need to retrieve entities from storage if you know precise RowKeys, but you do need to have them in order to issue batch delete. There is no magic "Delete from table where partitionkey = 10" command like there is in SQL.
However, consider breaking your data up into tables that represent archivable time units. For example in AzureWatch we store all of the metric data into tables that represent one month of data. IE: Metrics201401, Metrics201402, etc. Thus, when it comes time to archive, a full table is purged for a particular month.
The obvious downside of this approach is the need to "union" data from multiple tables if your queries span wide time ranges. However, if your keep your time ranges to minimum quantity, amount of unions will not be as big. Basically, this approach allows you to utilize table name as another partitioning opportunity.

Azure Tables - Partition Key and Row Key - Correct Choice

I am new to Azure tables and having read a lot of articles but would like some reassurance on the above given its fundamental.
I have data which is similar to this:
CustomerId, GUID
TripId, GUID
JourneyStep, GUID
Time, DataTime
AverageSpeed, int
Based on what I have read, is CustomerId a good PartitionKey? Where I become stuck is the combination of CustomerId and TripId that does not make a unique row. My justification for TripId as the Row Key is because every query will be a dataset based on CustomerId and TripId.
Just for context, the CustomerId is clearly unique, the TripId represents one journey in a vehicle and within that journey the JourneyStep represents a unit within that Trip which may be 10 steps or 1000.
The intention is aggregate the data into further tables with each level being used for a different purpose. At the most aggregated level, the customer will be given some scores.
The amount of data will obviously be huge so need to think about query performance from the outset.
Updated:
As requested, the solution is for Vehicle Telematics so think of yourself in your own car. Blackbox shipping data to an server which in turn passes it to Azure Tables. In Relational DB terms, I would have a Customer Table and a trip table with a foreign key back to the customer table.
The tripId is auto generated by the blackbox. TripId does not need stored by date time from a query point of view, however may be relevant from a query performance point of view.
Queries will be split into two:
Display a map of a single journey for each customer, so filter by customer and then Trip to then iterate each row (journeystep) to a map.
Per customer, I will score each trip and then retrieve trips for, let's say, the last month to aggregate a score. I do have SQL Database to enrich data with client records etc but for the volume data (the trip data) I wish to use Azure Tables.
The aggregates from the second query will probably be stored in a separate table, so if someone made 10 trips in one month, I would run the second query which would score each trip, then produce a score for all trips that month and store both answers so potentially a table of trip aggregates and a table of monthly aggregates.
The thing about the Partition Key is that it represents a logical grouping; You cannot insert data spanning multiple partition keys, for example. Similarly, rows with the same partition are likely to be stored on the same server, making it quick to retrieve all the data for a given partition key.
As such, it is important to look at your domain and figure out what aggregate you are likely to work with.
If I understand your domain model correctly, I would actually be tempted to use the TripId as the Partition Key and the JourneyStep as the Row Key.
You will need to, separately, have a table that lists all the Trip IDs that belongs to a given Customer - which sort of makes sense as you probably want to store some data, such as "trip name" etc in such a table anyway.
Your design has to be related to your query. You can filter your data based on 2 columns PartitionKey and RowKey. PartitionKey is your most important column since your queries will hit that column first.
In your case CustomerId should be your PartitionKey since most of the time you will try to reach your data based on the customer. (you may also need to keep another table for your client list)
Now, RowKey can be your tripId or time. if I were you I probably use rowKey as yyyyMMddHHmm|tripId format which will let you to query based on startWith and endWidth options.
Adding to #Frans answer:
One thing you could do is create a separate table for each customer. So you could have table named like Customer. That way each customer's data is nicely segregated into different tables. Then you could use TripId as PartitionKey and then JourneyStep as RowKey as suggested by #Frans. For storing some metadata about the trip, instead of going into a separate table, I would still use the same table but here I would keep the RowKey as empty and put other information about the trip there.
I would suggest considering the following approach to your PK/RK design. I believe it would yield the best performance for your outlined queries:
PartitionKey: combination of CustomerId and TripId.
string.Format("{0}_{1}", customerId.ToString(), tripId.ToString())
RowKey: combination of the DateTime.MaxValue.Ticks - Time.Ticks formatted to a large 0-padded string with the JourneyStep.
string.Format("{0}_{1}", (DateTime.MaxValue.Ticks - Time.Ticks).ToString("00000000000000000"), JourneyStep.ToString())
Such combination will allow you to do the following queries "quickly".
Get data by CustomerId only. Example: context.Trips.Where(n=>string.Compare(id + "_00000000-0000-0000-0000-000000000000", n.PartitionKey) <= 0 && string.Compare(id+"_zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz") >=0).AsTableServiceQuery(context);
Get data by CustomerId and TripId. Example: context.Trips.Where(n=>n.PartitionKey == string.Format("{0}_{1}", customerId, tripId).AsTableServiceQuery(context);
Get last X amount of journey steps if you were to search by either CustomerId or CustomerId/TripId by using the "Take" function
Get data via date-range queries by translating timestamps into Ticks
Save data into a trip with a single storage transaction (assuming you have less than 100 steps)
If you can guarantee uniqueness of Times of Steps within each Trip, you don't even have to put JourneyStep into the RowKey as it is somewhat inconvenient
The only downside to this schema is not being able to retrieve a particular single journey step without knowing its Time and Id. However, unless you have very specific use cases, downloading all of the steps inside a trip and then picking a particular one from the list should not be so bad.
HTH
The design of table storage is a function to optimize two major capabilities of Azure Tables:
Scalability
Search performance
As #Frans user already pointed out, Azure tables uses the partitionkey to decide how to scale out your data on multiple storage server nodes. Because of this, I would advise against having unique partitionkeys, since in theory, you will have Azure spanning out storage nodes that will be able to serve one customer only. I say "in theory" because, in practice, Azure uses smart algorithms to identify if there are patterns in your partitionkeys and thus be able to group them (example, if your ids are consecutive numbers). You don't want to fall into this scenario because the scalability of your storage will be unpredictable and at the hands of obscure algorithms that will be making those decisions. See HERE for more information about scalability.
Regarding performance, the fastest way to search is to hit both partitionkey+rowkey in your search queries. Contrary to Amazon DynamoDB, Azure Tables does not support secondary column indexes. If you have your search queries search for attributes stored in columns apart from those two, Azure will need to do a full table scan.
I faced a situation similar to yours, where the design of the partition/row keys was not trivial. In the end, we expanded our data model to include more information so we could design our table in such a way that ~80% of all search queries can be matched to partition+row keys, while the remaining 20% require a table scan. We decided to include the user's location, so our partition key is the user's country and the rowkey is a customer unique ID. This means our data model had to be expanded to include the user's country, which was not a big deal. Maybe you can do the same thing? Group your customers by segment, or by location, or by email address SMTP domain?

get_range_slices and CQL query handling, need for ALLOW FILTERING

I have a following CQL table (a bit simplified for clarity):
CREATE TABLE test_table (
user uuid,
app_id ascii,
domain_id ascii,
props map<ascii,blob>,
PRIMARY KEY ((user), app_id, domain_id)
)
The idea is that this table would contain many users (i.e. rows, say, dozens of millions). For each user there would be a few domains of interest and there would be a few apps per domain. And for each user/domain/app there would be a small set of properties.
I need to scan this entire table and load its contents in chunks for given app_id and domain_id. My idea was to use the TOKEN function to be able to read the whole data set in several iterations. So, something like this:
SELECT props FROM test_table WHERE app_id='myapp1'
AND domain_id='mydomain1'
AND TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807;
I was assuming that this query would be efficient because I specify the range of the row keys and by specifying the values of the clustering keys I effectively specify the column range. But when I try to run this query I get the error message "Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING".
I have limited experience with Cassandra and I assumed that this sort of query would map into get_range_slices() call, which accepts the slice predicate (i.e. the range of columns defined by my app_id/domain_id values) and the key range defined by my token range. It seems either I misunderstand how this sort of query is handled or maybe I misunderstand about the efficiency of get_range_slices() call.
To be more specific, my questions are:
- if this data model does make sense for the kind of query I have in mind
- if this query is expected to be efficient
- if it is efficient, then why am I getting this error message asking me to ALLOW FILTERING
My only guess about the last one was that the rows that do not have the given combination of app_id/domain_id would need to be skipped from the result.
--- update ----
Thank for all the comments. I have been doing more research on this and there is still something that I do not fully understand.
In the given structure what I am trying to get is like a rectangular area from my data set (assuming that all rows have the same columns). Where top and the bottom of the rectangle is determined by the token range (range) and the left/right sides are defined as column range (slice). So, this should naturally transform into get_range_slices request. My understanding (correct me if I am wrong) that the reason why CQL requires me to put ALLOW FILTERING clause is because there will be rows that do not contain the columns I am looking for, so they will have to be skipped. And since nobody knows if it will have to skip every second row or first million rows before finding one that fits my criteria (in the given range) - this is what causes the unpredictable latency and possibly even timeout. Am I right? I have tried to write a test that does the same kind of query but using low-level Astyanax API (over the same table, I had to read the data generated with CQL, it turned out to be quite simple) and this test does work - except that it returns keys with no columns where the row does not contain the slice of columns I am asking for. Of course I had to implement some kind of simple paging based on the starting token and limit to fetch the data in small chunks.
Now I am wondering - again, considering that I would need to deal with dozens of millions of users: would it be better to partially "rotate" this table and organize it in something like this:
Row key: domain_id + app_id + partition no (something like hash(user) mod X)
Clustering key: column partition no (something like hash(user) >> 16 mod Y) + user
For the "column partition no"...I am not sure if it is really needed. I assume that if I go with this model I will have relatively small number of rows (X=1000..10000) for each domain + app combination. This will allow me to query the individual partitions, even in parallel if I want to. But (assuming the user is random UUID) for 100M users it will result in dozens or hundreds of thousands of columns per row. Is it a good idea to read one such a row in one request? It should created some memory pressure for Cassandra, I am sure. So maybe reading them in groups (say, Y=10..100) would be better?
I realize that what I am trying to do is not what Cassandra does well - reading "all" or large subset of CF data in chunks that can be pre-calculated (like token range or partition keys) for parallel fetching from different hosts. But I am trying to find a pattern that is the most efficient for such a use case.
By the way, the query like "select * from ... where TOKEN(user)>X and TOKEN(user)
Short answer
This warning means that Cassandra would have to read non-indexed data and filter out the rows that don't satisfy the criteria. If you add ALLOW FILTERING to the end of query, it will work, however it will scan a lot of data:
SELECT props FROM test_table
WHERE app_id='myapp1'
AND domain_id='mydomain1'
AND TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807
ALLOW FILTERING;
Longer explanation
In your example primary key consists of two parts: user is used as partition key, and <app_id, domain_id> form remaining part. Rows for different users are distributed across the cluster, each node responsible for specific range of token ring.
Rows on a single node are sorted by the hash of partition key (token(user) in your example). Different rows for single user are stored on a single node, sorted by <app_id, domain_id> tuple.
So, the primary key forms a tree-like structure. Partition key adds one level of hierarchy, and each remaining field of a primary key adds another one. By default, Cassandra processes only the queries that return all rows from the continuos range of the tree (or several ranges if you use key in (...) construct). If Cassandra should filter out some rows, ALLOW FILTERING must be specified.
Example queries that don't require ALLOW FILTERING:
SELECT * FROM test_table
WHERE user = 'user1';
//OK, returns all rows for a single partition key
SELECT * FROM test_table
WHERE TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807;
//OK, returns all rows for a continuos range of the token ring
SELECT * FROM test_table
WHERE user = 'user1'
AND app_id='myapp1';
//OK, the rows for specific user/app combination
//are stored together, sorted by domain_id field
SELECT * FROM test_table
WHERE user = 'user1'
AND app_id > 'abc' AND app_id < 'xyz';
//OK, since rows for a single user are sorted by app
Example queries that do require ALLOW FILTERING:
SELECT props FROM test_table
WHERE app_id='myapp1';
//Must scan all the cluster for rows,
//but return only those with specific app_id
SELECT props FROM test_table
WHERE user='user1'
AND domain_id='mydomain1';
//Must scan all rows having user='user1' (all app_ids),
//but return only those having specific domain
SELECT props FROM test_table
WHERE user='user1'
AND app_id > 'abc' AND app_id < 'xyz'
AND domain_id='mydomain1';
//Must scan the range of rows satisfying <user, app_id> condition,
//but return only those having specific domain
What to do?
In Cassandra it's not possible to create a secondary index on the part of the primary key. There are few options, each having its pros and cons:
Add a separate table that has primary key ((app_id), domain_id, user) and duplicate the necessary data in two tables. It will allow you to query necessary data for a specific app_id or <app_id, domain_id> combination. If you need to query specific domain and all apps - third table is necessary. This approach is called materialized views
Use some sort of parallel processing (hadoop, spark, etc) to perform necessary calculations for all app/domain combinations. Since Cassandra needs to read all the data anyway, there probably won't be much difference from a single pair. If the result for other pairs might be cached for later use, it will probably save some time.
Just use ALLOW FILTERING if query performance is acceptable for your needs. Dozens of millions partition keys is probably not too much for Cassandra.
Presuming you are using the Murmur3Partitioner (which is the right choice), you do not want to run range queries on the row key. This key is hashed to determine which node holds the row, and is therefore not stored in sorted order. Doing this kind of range query would therefore require a full scan.
If you want to do this query, you should store some known value as a sentinel for your row key, such that you can query for equality rather than range. From your data it appears that either app_id or domain_id would be a good choice, since it sounds like you always know these values when performing your query.

Handling the following use case in Cassandra?

I've been given the task of modelling a simple in Cassandra. Coming from an almost solely SQL background, though, I'm having a bit of trouble figuring it out.
Basically, we have a list of feeds that we're listening to that update periodically. This can be in RSS, JSON, ATOM, XML, etc (depending on the feed).
What we want to do is periodically check for new items in each feed, convert the data into a few formats (i.e. JSON and RSS) and store that in a Cassandra store.
So, in an RBDMS, the structure would be something akin to:
Feed:
feedId
name
URL
FeedItem:
feedItemId
feedId
title
json
rss
created_time
I'm confused as to how to model that data in Cassandra to facilitate simple things such as getting x amount of items for a specific feed in descending created order (which is probably the most common query).
I've heard of one strategy that mentions having a composite key storing, in this example, the the created_time as a time-based UUID with the feed item ID but I'm still a little confused.
For example, lets say I have a series of rows whose key is basically the feedId. Inside each row, I store a range of columns as mentioned above. The question is, where does the actual data go (i.e. JSON, RSS, title)? Would I have to store all the data for that 'record' as the column value?
I think I'm confusing wide rows and narrow (short?) rows as I like the idea of the composite key but I also want to store other data with each record and I'm not sure how to meld the two together...
You can store everything in one column family. However If the data for each FeedItem is very large, you can split the data for each FeedItem into another column family.
For example, you can have 1 column familyfor Feed, and the columns of that key are FeedItem ids, something like,
Feeds # column family
FeedId1 #key
time-stamp-1-feed-item-id1 #columns have no value, or values are enough info
time-stamp-2-feed-item-id2 #to show summary info in a results list
The Feeds column allows you to quickly get the last N items from a feed, but querying for the last N items of a Feed doesn't require fetching all the data for each FeedItem, either nothing is fetched, or just a summary.
Then you can use another column family to store the actual FeedItem data,
FeedItems # column family
feed-item-id1 # key
rss # 1 column for each field of a FeedItem
title #
...
Using CQL should be easier to understand to you as per your SQL background.
Cassandra (and NoSQL in general) is very fast and you don't have real benefits from using a related table for feeds, and anyway you will not be capable of doing JOINs. Obviously you can still create two tables if that's comfortable for you, but you will have to manage linking data inside your application code.
You can use something like:
CREATE TABLE FeedItem (
feedItemId ascii PRIMARY KEY,
feedId ascii,
feedName ascii,
feedURL ascii,
title ascii,
json ascii,
rss ascii,
created_time ascii );
Here I used ascii fields for everything. You can choose to use different data types for feedItemId or created_time, and available data types can be found here, and depending on which languages and client you are using it can be transparent or require some more work to make them works.
You may want to add some secondary indexes. For example, if you want to search for feeds items from a specific feedId, something like:
SELECT * FROM FeedItem where feedId = '123';
To create the index:
CREATE INDEX FeedItem_feedId ON FeedItem (feedId);
Sorting / Ordering, alas, it's not something easy in Cassandra. Maybe reading here and here can give you some clues where to start looking for, and also that's really depending on the cassandra version you're going to use.

Resources